• Sign in to Neowin Faster!

    Create an account on Neowin to contribute and support the site.

  • 0
Sign in to follow this  

RegEx question (C#/.NET)

Question

Riva    1,109

Hello I want to swap password manager addon that exports into the following text format. I am suspecting I will need to transform it to XML or JSON that another password manager supports. I am terrible at regex, nevermind in a repeating lump of text in multiple lines. Any help?

Here is the exported format. I want to get the URL login nad password as minimum;


 

Websites

Website name: www.url.com
Website URL: https://www.url.com
Login name:
Login: me@host.com
Password: password
Comment:

---

Website name: www.url.com
Website URL: https://www.url.com
Login name:
Login: me@host.com
Password: password
Comment:

---

 

Share this post


Link to post
Share on other sites

Recommended Posts

  • 0
+virtorio    3,100
Posted (edited)

With such a simple format I would write a simple text parser myself, but if you want to use regex, here's a simple example below:

 

This regex:

Website name: ?(.*)\nWebsite URL: ?(.*)\nLogin name: ?(.*)\nLogin: ?(.*)\nPassword: ?(.*)\nComment: ?(.*)\n\n*---

With this substitution:

<login>\n\t<name>$1</name>\n\t<url>$2</url>\n\t<loginName>$3</loginName>\n\t<login>$4</login>\n\t<password>$5</password>\n\t<comment>$6</comment>\n</login>

Would produce this output:

<login>
	<name>www.url.com</name>
	<url>https://www.url.com</url>
	<loginName></loginName>
	<login>me@host.com</login>
	<password>password</password>
	<comment></comment>
</login>

<login>
	<name>www.url.com</name>
	<url>https://www.url.com</url>
	<loginName></loginName>
	<login>me@host.com</login>
	<password>password</password>
	<comment></comment>
</login>

 

Give it a try here: https://regex101.com/

 

There's plenty of pitfalls you can run into using regex if you're mass converting a bunch of data. Someone more skilled in regex can probably help you out there (e.g. by being more specific than using .*), but that should be enough to get you started.

  • Like 1

Share this post


Link to post
Share on other sites
  • 0
DevTech    1,517

Absolutely without a doubt this is NOT a good case for Regex.

 

It needs to be parsed.

 

Any large amount of input will contain "dirty" data that can be caught by parsing...

 

Share this post


Link to post
Share on other sites
  • 0
DevTech    1,517

I might have misread the question.

 

If it's just a one-off conversion for say 50 items in a text editor, then Regex is very appropriate.

 

Regex just NEVER belongs in any actual source code. If it is for parsing, use a parser, something else use a dedicated library for that and if it is some weird thing that is not one-off then all Regex is doing is generating a FSM "on the fly" and your are much better off actually using your own FSM for which there are some great libs to support that.

 

It is just not professional to use it in code. In all my years of Dev work, never seen Regex in code. Even if you are just making a plug-in for some OSS Password Manager, a Regex based plug would be a support nightmare as users discover all the nasty edge cases for you!

 

 

Share this post


Link to post
Share on other sites
  • 0
Riva    1,109
1 minute ago, DevTech said:

I might have misread the question.

 

If it's just a one-off conversion for say 50 items in a text editor, then Regex is very appropriate.

 

Regex just NEVER belongs in any actual source code. If it is for parsing, use a parser, something else use a dedicated library for that and if it is some weird thing that is not one-off then all Regex is doing is generating a FSM "on the fly" and your are much better off actually using your own FSM for which there are some great libs to support that.

 

It is just not professional to use it in code. In all my years of Dev work, never seen Regex in code. Even if you are just making a plug-in for some OSS Password Manager, a Regex based plug would be a support nightmare as users discover all the nasty edge cases for you!

 

 

Interesting comment. I have used a lot of regex to validate email addresses in ASP.NET the past 19 years.

And yes this is an one-off file transformation to save me the trouble of making every entry by hand in the new addon.

Share this post


Link to post
Share on other sites
  • 0
scumdogmillionaire    242
Posted (edited)

Yeah I disagree with "no Regex in code", too. I use it frequently in code. Validation, as well as scraping HTML and other data formats. To each his own, but I reckon a "no regex in code" has probably hurt you over the years more than helped you ;)

Share this post


Link to post
Share on other sites
  • 0
DevTech    1,517
4 minutes ago, scumdogmillionaire said:

Yeah I disagree with "no Regex in code", too. I use it frequently in code. Validation, as well as scraping HTML and other data formats. To each his own, but I reckon a "no regex in code" has probably hurt you over the years more than helped you ;)

Nah. I'm a great advocate of just using a FSM. Very maintainable. Complete control of the Async model. 

 

Dev is what I do, Tech is the hobby.

 

Share this post


Link to post
Share on other sites
  • 0
DevTech    1,517
27 minutes ago, Riva said:

Interesting comment. I have used a lot of regex to validate email addresses in ASP.NET the past 19 years.

And yes this is an one-off file transformation to save me the trouble of making every entry by hand in the new addon.

Yeah the old-school ASP.NET forces you into that model. It would take insane contortions to work around it.

 

Which is one of a zillion reasons they redesigned it completely

Share this post


Link to post
Share on other sites
  • 0
scumdogmillionaire    242
6 minutes ago, DevTech said:

Nah. I'm a great advocate of just using a FSM. Very maintainable. Complete control of the Async model. 

 

Dev is what I do, Tech is the hobby.

 

Really? If you were gonna parse HTML you would go FSM? I may need to look more into it-- never done it. But I know Regex well enough and can quickly and comfortably parse something even advanced.

 

Like, imagine you wanted to take the source code of this thread and you wanted each comment, as well as the author and date into an object you'd go FSM vs Regex?

Share this post


Link to post
Share on other sites
  • 0
DevTech    1,517
1 hour ago, scumdogmillionaire said:

Really? If you were gonna parse HTML you would go FSM? I may need to look more into it-- never done it. But I know Regex well enough and can quickly and comfortably parse something even advanced.

 

Like, imagine you wanted to take the source code of this thread and you wanted each comment, as well as the author and date into an object you'd go FSM vs Regex?

Hey man, that's out of context.

 

in the original reply, I said to use a parser for parsing. And by extension of that logic use the appropriate library/algorithm for each domain area.

 

What is left over, that a person might feel Regex is calling their name, probably is better suited to FSM.

 

In the end, design, architecture, the systems being worked with etc as you know influence things a lot. In general for most languages Regex is typically not threadsafe which can deliver weird gotchas.

 

I was tempted to provide insight on this thread:

 

 

 

And then I thought after that LOL moment, how on earth do you bring people up to speed on such fundamental errors that are endemic in modern multi-tasking, multi-core programming?

 

Once, I was one working on a PCI Device Driver and realized that one core was clobbering an instruction on another core right in the middle of the operation of the ASM CPU instruction, because even a basic silicon level instruction is NOT atomic! (some of them are, Intel has a list...) just saying how "spooky" multi-tasking can feel...

 

So, this has hit ramble state, but the idea was "Async First" mind set for modern prog and that's just another detail to drill down into with Regex. So I might have been a bit too universal/global on the idea of Regex, but it was worth pointing out that using it is kind of a Red Flag warning about the design.

 

Sort of imagine a dialog box - "Are you sure you want to do that?"

 

Share this post


Link to post
Share on other sites
  • 0
Riva    1,109
1 minute ago, DevTech said:

Hey man, that's out of context.

 

in the original reply, I said to use a parser for parsing. And by extension of that logic use the appropriate library/algorithm for each domain area.

 

What is left over, that a person might feel Regex is calling their name, probably is better suited to FSM.

 

In the end, design, architecture, the systems being worked with etc as you know influence things a lot. In general for most languages Regex is typically not threadsafe which can deliver weird gotchas.

 

I was tempted to provide insight on this thread:

 

 

 

And then I thought after that LOL moment, how on earth do you bring people up to speed on such fundamental errors that are endemic in modern multi-tasking, multi-core programming?

 

Once, I was one working on a PCI Device Driver and realized that one core was clobbering an instruction on another core right in the middle of the operation of the ASM CPU instruction, because even a basic silicon level instruction is NOT atomic! (some of them are, Intel has a list...) just saying how "spooky" multi-tasking can feel...

 

So, this has hit ramble state, but the idea was "Async First" mind set for modern prog and that's just another detail to drill down into with Regex. So I might have been a bit too universal/global on the idea of Regex, but it was worth pointing out that using it is kind of a Red Flag warning about the design.

 

Sort of imagine a dialog box - "Are you sure you want to do that?"

 

Your problem my friend is that you are too much of a developer looking for complex answers to simple questions :D

Its ok we have all been there.

Share this post


Link to post
Share on other sites
  • 0
DevTech    1,517
1 minute ago, Riva said:

Your problem my friend is that you are too much of a developer looking for complex answers to simple questions :D

Its ok we have all been there.

It's the other way around with Regex!

 

It is a complex implementation for many many things that could be simpler.

 

Most languages compile Regex into an abstract representation that they then feed into a statically compiled FSM Engine that in the best case is gonna block the code.

 

So, I think it is simpler to take control and know what is going on.

 

For one-off stuff, of course just go wild and crazy with whatever...

 

Share this post


Link to post
Share on other sites
  • 0
Riva    1,109
2 minutes ago, DevTech said:

It's the other way around with Regex!

 

It is a complex implementation for many many things that could be simpler.

 

Most languages compile Regex into an abstract representation that they then feed into a statically compiled FSM Engine that in the best case is gonna block the code.

 

So, I think it is simpler to take control and know what is going on.

 

For one-off stuff, of course just go wild and crazy with whatever...

 

Every language/framework has built in regex that takes a line of code to execute. your solution needs a lot more work in comparison.

Share this post


Link to post
Share on other sites
  • 0
DevTech    1,517
34 minutes ago, Riva said:

Every language/framework has built in regex that takes a line of code to execute. your solution needs a lot more work in comparison.

Yeah, that's a major problem when nobody understands what is going on behind the curtain.

 

The problem with programming is that over time nothing is real - literally all the education revolves around huge bodies of work created by humans as a "convenient" abstraction in layer after layer of "just one line of code" which the Silicon does not give a crap about. 

 

My "solution" to a hypothetical undefined problem is maintainable and easily moved down the ladder to bare metal and IOT for maximum efficiency in large data centers etc. But that is what design and architecture are all about. But really we had gotten to the rare case where a FSM vs Regex decision might have been involved. For the real world, for some data in JSON, you would use JSON.NET i.e it gets parsed.

 

For one-off it mostly doesn't matter if some ad-hoc approach is used (except for the ridiculously large number of one-offs that get shoe-horned into production code based on the stupid maxim "why fix it, if it ain't broke")

 

Anyways, zero value in continuing this discussion by the looks of it, so nice chatting with you on a off-topic thought or two :)

 

 

 

 

Share this post


Link to post
Share on other sites
  • 0
astropheed    2,207
3 hours ago, DevTech said:

Regex just NEVER belongs in any actual source code. If it is for parsing, use a parser

There is a decent enough chance that parser is using regex. I use regex in my code and I have been for well over a decade. I'm befuddled. Although to be fair I wouldn't use regex to parse html.

Share this post


Link to post
Share on other sites
  • 0
DevTech    1,517
38 minutes ago, astropheed said:

There is a decent enough chance that parser is using regex. I use regex in my code and I have been for well over a decade. I'm befuddled. Although to be fair I wouldn't use regex to parse html.

A "real" parser is not going to be using Regex because the creators will be acutely aware of how that works behind the scenes... Well I would hope - a whole bunch of test matches would get encoded and wait in turn for eval by an external FSM which seems really inefficient.

 

But that is a huge advantage of OSS code. One can actually check that stuff was designed correctly and select something else if not. Or maybe I'm wrong and some algorithm has come along to make exec of Regex super efficient.

 

I can't decide if we would have fun exchanging examples from well known GitHub projects or if there is simply an infinity of approaches in the thousands of layers that humans have created on top of the Silicon. I don't know if my thought is meaningful but Scientists work hard to discover whatever secrets can Universe will reveal while programmers work hard to understand the many layers of abstraction created by other humans over the years. There is so much "stuff" that no human could understand a tiny fraction of all the code on GitHub in a lifetime.

 

So if we want to look at some reasonably well used parsers on GitHub, I'm up to that as a fun project.

 

I've written some recursive descent parsers for a few languages in the past, but I think all modern parsers use multi-stages based on AST representation so it might be fun to dig into a few and get up to date! I would be particularly interested in techniques that might be popular for interruptible thread safe Async parser design that would scale-out deterministically.

 

I'm currently killing my brain on GPU Shader code and HLSL vs Vulkan SPIR-V so looking at parsers seems like a nice relaxing thing. Actually that reminds me that I was musing about making up some universal Shading Language that gets parsed to HLSL for DirectX and SPIR-V for mobile, Linux etc...

 

 

Share this post


Link to post
Share on other sites
  • 0
astropheed    2,207
18 minutes ago, DevTech said:

I can't decide if we would have fun exchanging examples from well known GitHub projects

I can assure you we would not, you might though; I'd rather go to the dentist lol.

 

17 minutes ago, DevTech said:

There is so much "stuff" that no human could understand a tiny fraction of all the code on GitHub in a lifetime.

Technically any human could understand a tiny fraction of all the code on GitHub because you were not specific on the ratio of the fraction. :D

  • Like 1

Share this post


Link to post
Share on other sites
  • 0
DevTech    1,517
1 minute ago, astropheed said:

I can assure you we would not, you might though; I'd rather go to the dentist lol.

Ah, not a problem

 

It was the only realistic way to test your unlikely guess: "There is a decent enough chance that parser is using regex"

 

But it jogged my memory about wanting to make a parser for some abstract Shader Language so if I ever get around to that nebulous concept, I'll let you know what I find out about "state of the art" parsers

 

So many things to do, never enough time....

 

Share this post


Link to post
Share on other sites
  • 0
astropheed    2,207
8 minutes ago, DevTech said:

It was the only realistic way to test your unlikely guess: "There is a decent enough chance that parser is using regex"

To humour you I decided to look up 'markedjs' which is the (fairly popular) Markdown parser I'm using on my VueJS project. I'm sure you'll find the code very interesting. I'm just happy my 'unlikely guess' wasn't really that unlikely.

Share this post


Link to post
Share on other sites
  • 0
DevTech    1,517
16 minutes ago, astropheed said:

To humour you I decided to look up 'markedjs' which is the (fairly popular) Markdown parser I'm using on my VueJS project. I'm sure you'll find the code very interesting. I'm just happy my 'unlikely guess' wasn't really that unlikely.

Ha! I can't consider anything JavaScript related as real code. Using inefficient methods in an inefficient system seems well perfectly fitting.

 

Sort of like expecting anything efficient in Python...

 

Actually, I think a popular HTML parser in .NET the HTML Agility Pack also uses Regex. And everyone makes fun of that in the industry, but apparently it parses surprisingly well from what I've heard.

 

None of this contradicts my thoughts on the correct way to do this and hence, I'll have to look into it with a "real" parser in a real language using real timed benchmarks. But not tonight.

 

And I really can't stand dentists, so this WILL be better for me!

 

 

  • Like 1

Share this post


Link to post
Share on other sites
  • 0
Squirrelington    44
Posted (edited)

Regex is well-suited for unknown, user input handling if written well enough. HTML is well structured with known tags. Regex used against XIVML would be a better use-case since you may not always know all of the tags that are possible if you're dealing with proprietary systems.

 

If the format is well-known, sure a parser is ideal. Just handle the tags or formatting that is explicitly expected. If the format is ambiguous or undefined, Regex can work very well.

Share this post


Link to post
Share on other sites
  • 0
+virtorio    3,100
Posted (edited)

I use regular expressions to process climate data (from various sources) which is often inconsistent and contains all sorts of unexpected characters.

 

Far more robust than keep a parsing application up-to-date every time something comes in that doesn't follow the rules.

Edited by virtorio

Share this post


Link to post
Share on other sites
  • 0
DevTech    1,517
47 minutes ago, Squirrelington said:

Regex is well-suited for unknown, user input handling if written well enough. HTML is well structured with known tags. Regex used against XIVML would be a better use-case since you may not always know all of the tags that are possible if you're dealing with proprietary systems.

 

If the format is well-known, sure a parser is ideal. Just handle the tags or formatting that is explicitly expected. If the format is ambiguous or undefined, Regex can work very well.

 

32 minutes ago, virtorio said:

I use regular expressions to process climate data (from various sources) which is often inconsistent and contains all sorts of unexpected characters.

 

Far more robust than keep a parsing application up-to-date every time something comes in that doesn't follow the rules.

Actually, HTML is a special case that confounds the usual thinking about parsing and is actually better suited to Regex, which is odd coming from me.

 

Although the "language" is well formed and well specified, the implementations are not and EVERY popular browser is very slack about what it will accept. So there is a huge percentage of BAD HTML out there and you need something very loosy-goosy to make sense of it. But that's why I mentioned the .NET HTML Agility Pack HTML Parser that just started out with that assumption in mind.

 

This just creates a huge body of stuff that expects that type of input and hence the cycle is unlikely to ever change, specially when you consider that modern web pages these days are pulled together on the fly on BOTH the server side and the client side.

 

Share this post


Link to post
Share on other sites
  • 0
firey    3,882
15 hours ago, DevTech said:

Hey man, that's out of context.

 

in the original reply, I said to use a parser for parsing. And by extension of that logic use the appropriate library/algorithm for each domain area.

 

What is left over, that a person might feel Regex is calling their name, probably is better suited to FSM.

 

In the end, design, architecture, the systems being worked with etc as you know influence things a lot. In general for most languages Regex is typically not threadsafe which can deliver weird gotchas.

 

I was tempted to provide insight on this thread:

 

 

 

And then I thought after that LOL moment, how on earth do you bring people up to speed on such fundamental errors that are endemic in modern multi-tasking, multi-core programming?

 

Once, I was one working on a PCI Device Driver and realized that one core was clobbering an instruction on another core right in the middle of the operation of the ASM CPU instruction, because even a basic silicon level instruction is NOT atomic! (some of them are, Intel has a list...) just saying how "spooky" multi-tasking can feel...

 

So, this has hit ramble state, but the idea was "Async First" mind set for modern prog and that's just another detail to drill down into with Regex. So I might have been a bit too universal/global on the idea of Regex, but it was worth pointing out that using it is kind of a Red Flag warning about the design.

 

Sort of imagine a dialog box - "Are you sure you want to do that?"

 

Hmm, being as that was my thread, may you SHOULD have responded. Seems like you know more than everyone else so you should share your knowledge.

 

Just sayin'

Share this post


Link to post
Share on other sites
  • 0
DevTech    1,517
2 hours ago, firey said:

Hmm, being as that was my thread, may you SHOULD have responded. Seems like you know more than everyone else so you should share your knowledge.

 

Just sayin'

Everyone in the Neowin forums volunteers their time out of some sense of goodwill toward humanity. I imagine most of us would like to actually be of some help and really help resolve some issues that are blocking people.

 

In the hardware forums, the "hands on" nature of hardware helps keep the discussions grounded (IMO) but there is also a "useless" factor where debates on rather minor variations in optimizing somebody's config will make no large impact in day to day usage, which means precious volunteer time is potentially wasted.

 

In the case of your thread it was just a combo of things:

 

1. I was just looking at it yesterday which was a bit "late to the party" and it might have been resolved by now to your satisfaction

2. None of the participants appeared to have a fundamental understanding of multi-tasking contention

3. I have generally avoided the programming forums as "not much fun" due to endless arguments of the type "how many angels can dance on the head of a pin" combined with "every programmer thinks he is an expert" combined with that hard to define "layers of abstraction" where instead of dealing with real things, programmers are always immersed in layers of sometimes stupid stuff other people have invented. All of these factors has the potential to make any programming discussion become a time sink like walking through molasses...

 

If you would like me to still look at your situation, please let me know.

 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Recently Browsing   0 members

    No registered users viewing this page.