Jump to content

19 posts in this topic

Posted

ive been trying to understand what this command means/stand for
could anyone help me out on this please?


'<([A-z][A-z0-9]*)\b[^>]*>(.*)</\1>'

Share this post


Link to post
Share on other sites

Posted

That is a [b]regular expression[/b], used to match an HTML/XML tag. e.g.
[code]
<h1>This is a heading</h1>
<name>Somebody's name</name>
[/code]

Share this post


Link to post
Share on other sites

Posted

That's a [url="http://en.wikipedia.org/wiki/Regular_expression"]regular expression[/url] (damnit got ninjaed).

Share this post


Link to post
Share on other sites

Posted

[quote name='virtorio' timestamp='1365472721' post='595625872']
That is a [b]regular expression[/b], used to match an HTML/XML tag. e.g.
[code]
<h1>This is a heading</h1>
<name>Somebody's name</name>
[/code]
[/quote]

That use case makes sense. I was going to suggest that it looks suspiciously close to a Perl regular expression (mostly due to the '/\1' slang at the end), but I wasn't sure what it would be used for. Check out the [url="http://perldoc.perl.org/perlre.html"]perlre perldoc[/url] for Perl's regex documentation.

The following is a test Perl script:
[CODE]
#!/usr/bin/perl -w

$r="<A057 B058> rar <something>";
print "rar == $r\n";

$r =~ s/<([A-z][A-z0-9]*)\b[^>]*>(.*)</\1>/;
print "rar == $r\n";
[/CODE]

And this is the script's output:
[CODE]
\1 better written as $1 at ./rar.pl line 6.
rar == <A057 B058> rar <something>
rar == A057>something>
[/CODE]

Share this post


Link to post
Share on other sites

Posted

Regular Expressions [url="http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html"]shouldn't be used for parsing HTML[/url]...
3 people like this

Share this post


Link to post
Share on other sites

Posted

[quote name='LogicalApex' timestamp='1365475712' post='595625946']
Regular Expressions [url="http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html"]shouldn't be used for parsing HTML[/url]...
[/quote]

While I don't completely agree with the author's [s]inflexible[/s] strong opinion, the rant [i]is[/i] amusingly written. Thanks for the laugh!
1 person likes this

Share this post


Link to post
Share on other sites

Posted

If you're parsing HTML with anything other than a proper HTML parser, you're doing it wrong. One of the "benefits" of HTML is that you can write it like an absolute mess and it'll still parse properly, but that only applies to an actual HTML parser, home grown solutions that try to pick out tags are just going to break on that type of input (Try "parsing" <table><td>blah<td>blah</tr> with regex)
1 person likes this

Share this post


Link to post
Share on other sites

Posted

[quote name='The_Decryptor' timestamp='1365478836' post='595626004']
If you're parsing HTML with anything other than a proper HTML parser, you're doing it wrong. One of the "benefits" of HTML is that you can write it like an absolute mess and it'll still parse properly, but that only applies to an actual HTML parser, home grown solutions that try to pick out tags are just going to break on that type of input (Try "parsing" <table><td>blah<td>blah</tr> with regex)
[/quote]
Sometimes using a full DOM parser isn't the best approach. If you're tailoring for a specific site to only pull out specific data, a full parser is often slower and overkill.
1 person likes this

Share this post


Link to post
Share on other sites

Posted

[code]'<([A-z][A-z0-9]*)\b[^>]*>(.*)</\1>'[/code]

Match "<", followed by any upper or lower-case letter, then any amount letter or numbers. These are put into a match group with the () around them.
(I'm not sure how to interpret the "\b" as I thought that was backspace.) After the backspace match any characters that are not a ">" followed by a ">".

Next, match any number of character other than backspace and put them in second match group. Then, match "</" followed by the text that was captured in group 1.
Finally, match ">".

If it's to parse html it doesn't look like it would work.

"<\s*([A-z][A-z0-9]*)\s*>([^<]*)<\s*/\1\s*>" might be closer and take whitespace into account although I did not test this. :)

(Groups are 0 index but the first one is the entire match, so the tag is in group 1 and the XML text is in group 2.)

EDIT: The edited one appears to work in C# at least.
1 person likes this

Share this post


Link to post
Share on other sites

Posted

[quote name='Xilo' timestamp='1365479449' post='595626014']

Sometimes using a full DOM parser isn't the best approach. If you're tailoring for a specific site to only pull out specific data, a full parser is often slower and overkill.
[/quote]
Until the site markup changes for whatever reason and the hand written parsing code starts returning gibberish. And I can't imagine that the parsing is so time sensitive that it's worth maintaining your own parsing code, proper parsing wouldn't take that much time.

Share this post


Link to post
Share on other sites

Posted

[quote name='The_Decryptor' timestamp='1365480672' post='595626038']
Until the site markup changes for whatever reason and the hand written parsing code starts returning gibberish. And I can't imagine that the parsing is so time sensitive that it's worth maintaining your own parsing code, proper parsing wouldn't take that much time.
[/quote]

This would be fine especially if you're just reading a config file in an embedded setting. XML isn't necessarily the same as HTML. It may not be parsing a site.

Share this post


Link to post
Share on other sites

Posted

i'm using proxomitron which relies on user-customized-RegEx to filtering/modifying annoying parts of html or javascripts or css.
its works.

Share this post


Link to post
Share on other sites

Posted

sry guys i havent made myself clear on this
i know its regex, and my question was what this particular command actually meant


thanks

Share this post


Link to post
Share on other sites

Posted

[quote name='luc9' timestamp='1365483068' post='595626082']
sry guys i havent made myself clear on this
i know its regex, and my question was what this particular command actually meant


thanks
[/quote]

GreyWolf's explanation (post #9) is probably the best so far in this thread. There are different variants of regular expressions implemented by various libraries and languages, so you should probably lookup the documentation for your variant if you want more detail than GreyWolf's post (for example, the meaning of '\b' - the word boundary matcher found in some regex variants). If you can't find proper documentation or aren't targeting a specific regex variant, I recommend that you read the [url="http://perldoc.perl.org/perlre.html"]Perl regex documentation[/url] because it contains the most common regular expression extensions and is [i]very[/i] well documented (with gratuitous examples). I also like the [url="https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/RegExp"]basic regex documentation on the Mozilla Developer Network[/url] because it is very well-formatted and looks appealing.

Share this post


Link to post
Share on other sites

Posted

[quote name='The_Decryptor' timestamp='1365480672' post='595626038']
Until the site markup changes for whatever reason and the hand written parsing code starts returning gibberish. And I can't imagine that the parsing is so time sensitive that it's worth maintaining your own parsing code, proper parsing wouldn't take that much time.
[/quote]
If the markup changes, likely the dom structure will likely change as well. Meaning either way, your code is broken. So your point is moot. :/

Share this post


Link to post
Share on other sites

Posted

The markup can change without the resulting DOM changing (yay tag soup), and even if the DOM did change there's way to target specific DOM nodes without relying on the structure of the DOM up until that point (XPath, CSS Selectors, IDs, etc.)

Share this post


Link to post
Share on other sites

Posted

A nice tool that I like to use to understand regexes is [url="http://www.regexper.com"]http://www.regexper.com[/url]

You'll need to enter "<([A-z][A-z0-9]*)\b[^>]*>(.*)<\/\1>" an extra backslash is needed for the regex parser they use to understand it
2 people like this

Share this post


Link to post
Share on other sites

Posted

[quote name='Lant' timestamp='1365607179' post='595628680']
A nice tool that I like to use to understand regexes is [url="http://www.regexper.com"]http://www.regexper.com[/url]

You'll need to enter "<([A-z][A-z0-9]*)\b[^>]*>(.*)<\/\1>" an extra backslash is needed for the regex parser they use to understand it
[/quote]

Adding that extra backslash actually changes the meaning of the last component somewhat, but it's perfectly understandable that the regex parser on that website doesn't understand that particular extension. Thanks for the link; regexper is really neat!

Share this post


Link to post
Share on other sites

Posted

[quote name='Lant' timestamp='1365607179' post='595628680']
A nice tool that I like to use to understand regexes is [url="http://www.regexper.com"]http://www.regexper.com[/url]

You'll need to enter "<([A-z][A-z0-9]*)\b[^>]*>(.*)<\/\1>" an extra backslash is needed for the regex parser they use to understand it
[/quote]

That's awesome! Shamelessly stealing it for my bookmarks.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0

  • Recently Browsing   0 members

    No registered users viewing this page.