SlayerS_BoxeR Posted April 9, 2013 Share Posted April 9, 2013 ive been trying to understand what this command means/stand for could anyone help me out on this please? '<([A-z][A-z0-9]*)\b[^>]*>(.*)</\1>' Link to comment Share on other sites More sharing options...
0 +virtorio MVC Posted April 9, 2013 MVC Share Posted April 9, 2013 That is a regular expression, used to match an HTML/XML tag. e.g. <h1>This is a heading</h1> <name>Somebody's name</name> Link to comment Share on other sites More sharing options...
0 Andre S. Veteran Posted April 9, 2013 Veteran Share Posted April 9, 2013 That's a regular expression (damnit got ninjaed). Link to comment Share on other sites More sharing options...
0 Karl L. Posted April 9, 2013 Share Posted April 9, 2013 That is a regular expression, used to match an HTML/XML tag. e.g. <h1>This is a heading</h1> <name>Somebody's name</name> That use case makes sense. I was going to suggest that it looks suspiciously close to a Perl regular expression (mostly due to the '/\1' slang at the end), but I wasn't sure what it would be used for. Check out the perlre perldoc for Perl's regex documentation. The following is a test Perl script: #!/usr/bin/perl -w$r="<A057 B058> rar <something>";print "rar == $r\n";$r =~ s/<([A-z][A-z0-9]*)\b[^>]*>(.*)</\1>/;print "rar == $r\n";[/CODE] And this is the script's output: [CODE]\1 better written as $1 at ./rar.pl line 6.rar == <A057 B058> rar <something>rar == A057>something>[/CODE] Link to comment Share on other sites More sharing options...
0 +LogicalApex MVC Posted April 9, 2013 MVC Share Posted April 9, 2013 Regular Expressions shouldn't be used for parsing HTML... Charisma, +Majesticmerc and Karl L. 3 Share Link to comment Share on other sites More sharing options...
0 Karl L. Posted April 9, 2013 Share Posted April 9, 2013 Regular Expressions shouldn't be used for parsing HTML... While I don't completely agree with the author's inflexible strong opinion, the rant is amusingly written. Thanks for the laugh! Charisma 1 Share Link to comment Share on other sites More sharing options...
0 The_Decryptor Veteran Posted April 9, 2013 Veteran Share Posted April 9, 2013 If you're parsing HTML with anything other than a proper HTML parser, you're doing it wrong. One of the "benefits" of HTML is that you can write it like an absolute mess and it'll still parse properly, but that only applies to an actual HTML parser, home grown solutions that try to pick out tags are just going to break on that type of input (Try "parsing" <table><td>blah<td>blah</tr> with regex) Karl L. 1 Share Link to comment Share on other sites More sharing options...
0 Xilo Posted April 9, 2013 Share Posted April 9, 2013 If you're parsing HTML with anything other than a proper HTML parser, you're doing it wrong. One of the "benefits" of HTML is that you can write it like an absolute mess and it'll still parse properly, but that only applies to an actual HTML parser, home grown solutions that try to pick out tags are just going to break on that type of input (Try "parsing" <table><td>blah<td>blah</tr> with regex) Sometimes using a full DOM parser isn't the best approach. If you're tailoring for a specific site to only pull out specific data, a full parser is often slower and overkill. +Majesticmerc 1 Share Link to comment Share on other sites More sharing options...
0 Eric Veteran Posted April 9, 2013 Veteran Share Posted April 9, 2013 '<([A-z][A-z0-9]*)\b[^>]*>(.*)</\1>' Match "<", followed by any upper or lower-case letter, then any amount letter or numbers. These are put into a match group with the () around them. (I'm not sure how to interpret the "\b" as I thought that was backspace.) After the backspace match any characters that are not a ">" followed by a ">". Next, match any number of character other than backspace and put them in second match group. Then, match "</" followed by the text that was captured in group 1. Finally, match ">". If it's to parse html it doesn't look like it would work. "<\s*([A-z][A-z0-9]*)\s*>([^<]*)<\s*/\1\s*>" might be closer and take whitespace into account although I did not test this. :) (Groups are 0 index but the first one is the entire match, so the tag is in group 1 and the XML text is in group 2.) EDIT: The edited one appears to work in C# at least. Karl L. 1 Share Link to comment Share on other sites More sharing options...
0 The_Decryptor Veteran Posted April 9, 2013 Veteran Share Posted April 9, 2013 Sometimes using a full DOM parser isn't the best approach. If you're tailoring for a specific site to only pull out specific data, a full parser is often slower and overkill. Until the site markup changes for whatever reason and the hand written parsing code starts returning gibberish. And I can't imagine that the parsing is so time sensitive that it's worth maintaining your own parsing code, proper parsing wouldn't take that much time. Link to comment Share on other sites More sharing options...
0 Eric Veteran Posted April 9, 2013 Veteran Share Posted April 9, 2013 Until the site markup changes for whatever reason and the hand written parsing code starts returning gibberish. And I can't imagine that the parsing is so time sensitive that it's worth maintaining your own parsing code, proper parsing wouldn't take that much time. This would be fine especially if you're just reading a config file in an embedded setting. XML isn't necessarily the same as HTML. It may not be parsing a site. Link to comment Share on other sites More sharing options...
0 Torolol Posted April 9, 2013 Share Posted April 9, 2013 i'm using proxomitron which relies on user-customized-RegEx to filtering/modifying annoying parts of html or javascripts or css. its works. Link to comment Share on other sites More sharing options...
0 SlayerS_BoxeR Posted April 9, 2013 Author Share Posted April 9, 2013 sry guys i havent made myself clear on this i know its regex, and my question was what this particular command actually meant thanks Link to comment Share on other sites More sharing options...
0 Karl L. Posted April 9, 2013 Share Posted April 9, 2013 sry guys i havent made myself clear on this i know its regex, and my question was what this particular command actually meant thanks GreyWolf's explanation (post #9) is probably the best so far in this thread. There are different variants of regular expressions implemented by various libraries and languages, so you should probably lookup the documentation for your variant if you want more detail than GreyWolf's post (for example, the meaning of '\b' - the word boundary matcher found in some regex variants). If you can't find proper documentation or aren't targeting a specific regex variant, I recommend that you read the Perl regex documentation because it contains the most common regular expression extensions and is very well documented (with gratuitous examples). I also like the basic regex documentation on the Mozilla Developer Network because it is very well-formatted and looks appealing. Link to comment Share on other sites More sharing options...
0 Xilo Posted April 9, 2013 Share Posted April 9, 2013 Until the site markup changes for whatever reason and the hand written parsing code starts returning gibberish. And I can't imagine that the parsing is so time sensitive that it's worth maintaining your own parsing code, proper parsing wouldn't take that much time. If the markup changes, likely the dom structure will likely change as well. Meaning either way, your code is broken. So your point is moot. :/ Link to comment Share on other sites More sharing options...
0 The_Decryptor Veteran Posted April 10, 2013 Veteran Share Posted April 10, 2013 The markup can change without the resulting DOM changing (yay tag soup), and even if the DOM did change there's way to target specific DOM nodes without relying on the structure of the DOM up until that point (XPath, CSS Selectors, IDs, etc.) Link to comment Share on other sites More sharing options...
0 Lant Posted April 10, 2013 Share Posted April 10, 2013 A nice tool that I like to use to understand regexes is http://www.regexper.com You'll need to enter "<([A-z][A-z0-9]*)\b[^>]*>(.*)<\/\1>" an extra backslash is needed for the regex parser they use to understand it +Majesticmerc and Karl L. 2 Share Link to comment Share on other sites More sharing options...
0 Karl L. Posted April 10, 2013 Share Posted April 10, 2013 A nice tool that I like to use to understand regexes is http://www.regexper.com You'll need to enter "<([A-z][A-z0-9]*)\b[^>]*>(.*)<\/\1>" an extra backslash is needed for the regex parser they use to understand it Adding that extra backslash actually changes the meaning of the last component somewhat, but it's perfectly understandable that the regex parser on that website doesn't understand that particular extension. Thanks for the link; regexper is really neat! Link to comment Share on other sites More sharing options...
0 +Majesticmerc MVC Posted April 11, 2013 MVC Share Posted April 11, 2013 A nice tool that I like to use to understand regexes is http://www.regexper.com You'll need to enter "<([A-z][A-z0-9]*)\b[^>]*>(.*)<\/\1>" an extra backslash is needed for the regex parser they use to understand it That's awesome! Shamelessly stealing it for my bookmarks. Link to comment Share on other sites More sharing options...
Question
SlayerS_BoxeR
ive been trying to understand what this command means/stand for
could anyone help me out on this please?
'<([A-z][A-z0-9]*)\b[^>]*>(.*)</\1>'
Link to comment
Share on other sites
18 answers to this question
Recommended Posts