• 0

what does this code mean in unix?


Question

18 answers to this question

Recommended Posts

  • 0
  On 09/04/2013 at 01:58, virtorio said:

That is a regular expression, used to match an HTML/XML tag. e.g.

<h1>This is a heading</h1>
<name>Somebody's name</name>

That use case makes sense. I was going to suggest that it looks suspiciously close to a Perl regular expression (mostly due to the '/\1' slang at the end), but I wasn't sure what it would be used for. Check out the perlre perldoc for Perl's regex documentation.

The following is a test Perl script:


#!/usr/bin/perl -w

$r="<A057 B058> rar <something>";
print "rar == $r\n";

$r =~ s/<([A-z][A-z0-9]*)\b[^>]*>(.*)</\1>/;
print "rar == $r\n";
[/CODE]

And this is the script's output:

[CODE]
\1 better written as $1 at ./rar.pl line 6.
rar == <A057 B058> rar <something>
rar == A057>something>
[/CODE]

  • 0
  On 09/04/2013 at 02:48, LogicalApex said:

Regular Expressions shouldn't be used for parsing HTML...

While I don't completely agree with the author's inflexible strong opinion, the rant is amusingly written. Thanks for the laugh!

  • 0

If you're parsing HTML with anything other than a proper HTML parser, you're doing it wrong. One of the "benefits" of HTML is that you can write it like an absolute mess and it'll still parse properly, but that only applies to an actual HTML parser, home grown solutions that try to pick out tags are just going to break on that type of input (Try "parsing" <table><td>blah<td>blah</tr> with regex)

  • 0
  On 09/04/2013 at 03:40, The_Decryptor said:

If you're parsing HTML with anything other than a proper HTML parser, you're doing it wrong. One of the "benefits" of HTML is that you can write it like an absolute mess and it'll still parse properly, but that only applies to an actual HTML parser, home grown solutions that try to pick out tags are just going to break on that type of input (Try "parsing" <table><td>blah<td>blah</tr> with regex)

Sometimes using a full DOM parser isn't the best approach. If you're tailoring for a specific site to only pull out specific data, a full parser is often slower and overkill.

  • Like 1
  • 0

'&lt;([A-z][A-z0-9]*)\b[^&gt;]*&gt;(.*)&lt;/\1&gt;'

Match "<", followed by any upper or lower-case letter, then any amount letter or numbers. These are put into a match group with the () around them.

(I'm not sure how to interpret the "\b" as I thought that was backspace.) After the backspace match any characters that are not a ">" followed by a ">".

Next, match any number of character other than backspace and put them in second match group. Then, match "</" followed by the text that was captured in group 1.

Finally, match ">".

If it's to parse html it doesn't look like it would work.

"<\s*([A-z][A-z0-9]*)\s*>([^<]*)<\s*/\1\s*>" might be closer and take whitespace into account although I did not test this. :)

(Groups are 0 index but the first one is the entire match, so the tag is in group 1 and the XML text is in group 2.)

EDIT: The edited one appears to work in C# at least.

  • 0
  On 09/04/2013 at 03:50, Xilo said:

Sometimes using a full DOM parser isn't the best approach. If you're tailoring for a specific site to only pull out specific data, a full parser is often slower and overkill.

Until the site markup changes for whatever reason and the hand written parsing code starts returning gibberish. And I can't imagine that the parsing is so time sensitive that it's worth maintaining your own parsing code, proper parsing wouldn't take that much time.

  • 0
  On 09/04/2013 at 04:11, The_Decryptor said:

Until the site markup changes for whatever reason and the hand written parsing code starts returning gibberish. And I can't imagine that the parsing is so time sensitive that it's worth maintaining your own parsing code, proper parsing wouldn't take that much time.

This would be fine especially if you're just reading a config file in an embedded setting. XML isn't necessarily the same as HTML. It may not be parsing a site.

  • 0
  On 09/04/2013 at 04:51, luc9 said:

sry guys i havent made myself clear on this

i know its regex, and my question was what this particular command actually meant

thanks

GreyWolf's explanation (post #9) is probably the best so far in this thread. There are different variants of regular expressions implemented by various libraries and languages, so you should probably lookup the documentation for your variant if you want more detail than GreyWolf's post (for example, the meaning of '\b' - the word boundary matcher found in some regex variants). If you can't find proper documentation or aren't targeting a specific regex variant, I recommend that you read the Perl regex documentation because it contains the most common regular expression extensions and is very well documented (with gratuitous examples). I also like the basic regex documentation on the Mozilla Developer Network because it is very well-formatted and looks appealing.

  • 0
  On 09/04/2013 at 04:11, The_Decryptor said:

Until the site markup changes for whatever reason and the hand written parsing code starts returning gibberish. And I can't imagine that the parsing is so time sensitive that it's worth maintaining your own parsing code, proper parsing wouldn't take that much time.

If the markup changes, likely the dom structure will likely change as well. Meaning either way, your code is broken. So your point is moot. :/

  • 0

The markup can change without the resulting DOM changing (yay tag soup), and even if the DOM did change there's way to target specific DOM nodes without relying on the structure of the DOM up until that point (XPath, CSS Selectors, IDs, etc.)

  • 0
  On 10/04/2013 at 15:19, Lant said:

A nice tool that I like to use to understand regexes is http://www.regexper.com

You'll need to enter "<([A-z][A-z0-9]*)\b[^>]*>(.*)<\/\1>" an extra backslash is needed for the regex parser they use to understand it

Adding that extra backslash actually changes the meaning of the last component somewhat, but it's perfectly understandable that the regex parser on that website doesn't understand that particular extension. Thanks for the link; regexper is really neat!

  • 0
  On 10/04/2013 at 15:19, Lant said:

A nice tool that I like to use to understand regexes is http://www.regexper.com

You'll need to enter "<([A-z][A-z0-9]*)\b[^>]*>(.*)<\/\1>" an extra backslash is needed for the regex parser they use to understand it

That's awesome! Shamelessly stealing it for my bookmarks.

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.