• 0

what does this code mean in unix?


Question

ive been trying to understand what this command means/stand for

could anyone help me out on this please?

'<([A-z][A-z0-9]*)\b[^>]*>(.*)</\1>'

Link to comment
Share on other sites

18 answers to this question

Recommended Posts

  • 0

That is a regular expression, used to match an HTML/XML tag. e.g.

&lt;h1&gt;This is a heading&lt;/h1&gt;
&lt;name&gt;Somebody's name&lt;/name&gt;

Link to comment
Share on other sites

  • 0

That is a regular expression, used to match an HTML/XML tag. e.g.

&lt;h1&gt;This is a heading&lt;/h1&gt;
&lt;name&gt;Somebody's name&lt;/name&gt;

That use case makes sense. I was going to suggest that it looks suspiciously close to a Perl regular expression (mostly due to the '/\1' slang at the end), but I wasn't sure what it would be used for. Check out the perlre perldoc for Perl's regex documentation.

The following is a test Perl script:


#!/usr/bin/perl -w

$r="<A057 B058> rar <something>";
print "rar == $r\n";

$r =~ s/<([A-z][A-z0-9]*)\b[^>]*>(.*)</\1>/;
print "rar == $r\n";
[/CODE]

And this is the script's output:

[CODE]
\1 better written as $1 at ./rar.pl line 6.
rar == <A057 B058> rar <something>
rar == A057>something>
[/CODE]

Link to comment
Share on other sites

  • 0

If you're parsing HTML with anything other than a proper HTML parser, you're doing it wrong. One of the "benefits" of HTML is that you can write it like an absolute mess and it'll still parse properly, but that only applies to an actual HTML parser, home grown solutions that try to pick out tags are just going to break on that type of input (Try "parsing" <table><td>blah<td>blah</tr> with regex)

Link to comment
Share on other sites

  • 0

If you're parsing HTML with anything other than a proper HTML parser, you're doing it wrong. One of the "benefits" of HTML is that you can write it like an absolute mess and it'll still parse properly, but that only applies to an actual HTML parser, home grown solutions that try to pick out tags are just going to break on that type of input (Try "parsing" <table><td>blah<td>blah</tr> with regex)

Sometimes using a full DOM parser isn't the best approach. If you're tailoring for a specific site to only pull out specific data, a full parser is often slower and overkill.

  • Like 1
Link to comment
Share on other sites

  • 0

'&lt;([A-z][A-z0-9]*)\b[^&gt;]*&gt;(.*)&lt;/\1&gt;'

Match "<", followed by any upper or lower-case letter, then any amount letter or numbers. These are put into a match group with the () around them.

(I'm not sure how to interpret the "\b" as I thought that was backspace.) After the backspace match any characters that are not a ">" followed by a ">".

Next, match any number of character other than backspace and put them in second match group. Then, match "</" followed by the text that was captured in group 1.

Finally, match ">".

If it's to parse html it doesn't look like it would work.

"<\s*([A-z][A-z0-9]*)\s*>([^<]*)<\s*/\1\s*>" might be closer and take whitespace into account although I did not test this. :)

(Groups are 0 index but the first one is the entire match, so the tag is in group 1 and the XML text is in group 2.)

EDIT: The edited one appears to work in C# at least.

Link to comment
Share on other sites

  • 0

Sometimes using a full DOM parser isn't the best approach. If you're tailoring for a specific site to only pull out specific data, a full parser is often slower and overkill.

Until the site markup changes for whatever reason and the hand written parsing code starts returning gibberish. And I can't imagine that the parsing is so time sensitive that it's worth maintaining your own parsing code, proper parsing wouldn't take that much time.

Link to comment
Share on other sites

  • 0

Until the site markup changes for whatever reason and the hand written parsing code starts returning gibberish. And I can't imagine that the parsing is so time sensitive that it's worth maintaining your own parsing code, proper parsing wouldn't take that much time.

This would be fine especially if you're just reading a config file in an embedded setting. XML isn't necessarily the same as HTML. It may not be parsing a site.

Link to comment
Share on other sites

  • 0

sry guys i havent made myself clear on this

i know its regex, and my question was what this particular command actually meant

thanks

Link to comment
Share on other sites

  • 0

sry guys i havent made myself clear on this

i know its regex, and my question was what this particular command actually meant

thanks

GreyWolf's explanation (post #9) is probably the best so far in this thread. There are different variants of regular expressions implemented by various libraries and languages, so you should probably lookup the documentation for your variant if you want more detail than GreyWolf's post (for example, the meaning of '\b' - the word boundary matcher found in some regex variants). If you can't find proper documentation or aren't targeting a specific regex variant, I recommend that you read the Perl regex documentation because it contains the most common regular expression extensions and is very well documented (with gratuitous examples). I also like the basic regex documentation on the Mozilla Developer Network because it is very well-formatted and looks appealing.

Link to comment
Share on other sites

  • 0

Until the site markup changes for whatever reason and the hand written parsing code starts returning gibberish. And I can't imagine that the parsing is so time sensitive that it's worth maintaining your own parsing code, proper parsing wouldn't take that much time.

If the markup changes, likely the dom structure will likely change as well. Meaning either way, your code is broken. So your point is moot. :/

Link to comment
Share on other sites

  • 0

The markup can change without the resulting DOM changing (yay tag soup), and even if the DOM did change there's way to target specific DOM nodes without relying on the structure of the DOM up until that point (XPath, CSS Selectors, IDs, etc.)

Link to comment
Share on other sites

  • 0

A nice tool that I like to use to understand regexes is http://www.regexper.com

You'll need to enter "<([A-z][A-z0-9]*)\b[^>]*>(.*)<\/\1>" an extra backslash is needed for the regex parser they use to understand it

Adding that extra backslash actually changes the meaning of the last component somewhat, but it's perfectly understandable that the regex parser on that website doesn't understand that particular extension. Thanks for the link; regexper is really neat!

Link to comment
Share on other sites

  • 0

A nice tool that I like to use to understand regexes is http://www.regexper.com

You'll need to enter "<([A-z][A-z0-9]*)\b[^>]*>(.*)<\/\1>" an extra backslash is needed for the regex parser they use to understand it

That's awesome! Shamelessly stealing it for my bookmarks.

Link to comment
Share on other sites

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.