Jump to content



Photo

what does this code mean in unix?


  • Please log in to reply
18 replies to this topic

#1 SlayerS_BoxeR

SlayerS_BoxeR

    Neowinian

  • Joined: 10-December 12
  • OS: Windows 7 Pro
  • Phone: Nexus 5

Posted 09 April 2013 - 01:52

ive been trying to understand what this command means/stand for
could anyone help me out on this please?


'<([A-z][A-z0-9]*)\b[^>]*>(.*)</\1>'


#2 virtorio

virtorio

    Neowinian Senior

  • Tech Issues Solved: 14
  • Joined: 28-April 03
  • Location: New Zealand
  • OS: OSX 10.10, Windows 8.1
  • Phone: LG G3

Posted 09 April 2013 - 01:58

That is a regular expression, used to match an HTML/XML tag. e.g.
<h1>This is a heading</h1>
<name>Somebody's name</name>


#3 Andre S.

Andre S.

    Asik

  • Tech Issues Solved: 14
  • Joined: 26-October 05

Posted 09 April 2013 - 01:58

That's a regular expression (damnit got ninjaed).

#4 +Karl L.

Karl L.

    xorangekiller

  • Tech Issues Solved: 15
  • Joined: 24-January 09
  • Location: Virginia, USA
  • OS: Debian Testing

Posted 09 April 2013 - 02:32

That is a regular expression, used to match an HTML/XML tag. e.g.

<h1>This is a heading</h1>
<name>Somebody's name</name>


That use case makes sense. I was going to suggest that it looks suspiciously close to a Perl regular expression (mostly due to the '/\1' slang at the end), but I wasn't sure what it would be used for. Check out the perlre perldoc for Perl's regex documentation.

The following is a test Perl script:
#!/usr/bin/perl -w

$r="<A057 B058>   rar   <something>";
print "rar == $r\n";

$r =~ s/<([A-z][A-z0-9]*)\b[^>]*>(.*)</\1>/;
print "rar == $r\n";

And this is the script's output:
\1 better written as $1 at ./rar.pl line 6.
rar == <A057 B058>   rar   <something>
rar == A057>something>


#5 +LogicalApex

LogicalApex

    Software Engineer

  • Tech Issues Solved: 8
  • Joined: 14-August 02
  • Location: Philadelphia, PA
  • OS: Windows 7 Ultimate x64
  • Phone: Nexus 5

Posted 09 April 2013 - 02:48

Regular Expressions shouldn't be used for parsing HTML...

#6 +Karl L.

Karl L.

    xorangekiller

  • Tech Issues Solved: 15
  • Joined: 24-January 09
  • Location: Virginia, USA
  • OS: Debian Testing

Posted 09 April 2013 - 03:24

Regular Expressions shouldn't be used for parsing HTML...


While I don't completely agree with the author's inflexible strong opinion, the rant is amusingly written. Thanks for the laugh!

#7 The_Decryptor

The_Decryptor

    STEAL THE DECLARATION OF INDEPENDENCE

  • Tech Issues Solved: 5
  • Joined: 28-September 02
  • Location: Sol System
  • OS: iSymbian 9.2 SP24.8 Mars Bar

Posted 09 April 2013 - 03:40

If you're parsing HTML with anything other than a proper HTML parser, you're doing it wrong. One of the "benefits" of HTML is that you can write it like an absolute mess and it'll still parse properly, but that only applies to an actual HTML parser, home grown solutions that try to pick out tags are just going to break on that type of input (Try "parsing" <table><td>blah<td>blah</tr> with regex)

#8 Xilo

Xilo

    Neowinian Senior

  • Joined: 28-May 04
  • Location: Austin, TX

Posted 09 April 2013 - 03:50

If you're parsing HTML with anything other than a proper HTML parser, you're doing it wrong. One of the "benefits" of HTML is that you can write it like an absolute mess and it'll still parse properly, but that only applies to an actual HTML parser, home grown solutions that try to pick out tags are just going to break on that type of input (Try "parsing" <table><td>blah<td>blah</tr> with regex)

Sometimes using a full DOM parser isn't the best approach. If you're tailoring for a specific site to only pull out specific data, a full parser is often slower and overkill.

#9 Eric

Eric

    Neowinian Senior

  • Tech Issues Solved: 13
  • Joined: 02-August 06
  • Location: Greenville, SC

Posted 09 April 2013 - 04:06

'<([A-z][A-z0-9]*)\b[^>]*>(.*)</\1>'

Match "<", followed by any upper or lower-case letter, then any amount letter or numbers. These are put into a match group with the () around them.
(I'm not sure how to interpret the "\b" as I thought that was backspace.) After the backspace match any characters that are not a ">" followed by a ">".

Next, match any number of character other than backspace and put them in second match group. Then, match "</" followed by the text that was captured in group 1.
Finally, match ">".

If it's to parse html it doesn't look like it would work.

"<\s*([A-z][A-z0-9]*)\s*>([^<]*)<\s*/\1\s*>" might be closer and take whitespace into account although I did not test this. :)

(Groups are 0 index but the first one is the entire match, so the tag is in group 1 and the XML text is in group 2.)

EDIT: The edited one appears to work in C# at least.

#10 The_Decryptor

The_Decryptor

    STEAL THE DECLARATION OF INDEPENDENCE

  • Tech Issues Solved: 5
  • Joined: 28-September 02
  • Location: Sol System
  • OS: iSymbian 9.2 SP24.8 Mars Bar

Posted 09 April 2013 - 04:11

Sometimes using a full DOM parser isn't the best approach. If you're tailoring for a specific site to only pull out specific data, a full parser is often slower and overkill.

Until the site markup changes for whatever reason and the hand written parsing code starts returning gibberish. And I can't imagine that the parsing is so time sensitive that it's worth maintaining your own parsing code, proper parsing wouldn't take that much time.

#11 Eric

Eric

    Neowinian Senior

  • Tech Issues Solved: 13
  • Joined: 02-August 06
  • Location: Greenville, SC

Posted 09 April 2013 - 04:14

Until the site markup changes for whatever reason and the hand written parsing code starts returning gibberish. And I can't imagine that the parsing is so time sensitive that it's worth maintaining your own parsing code, proper parsing wouldn't take that much time.


This would be fine especially if you're just reading a config file in an embedded setting. XML isn't necessarily the same as HTML. It may not be parsing a site.

#12 Torolol

Torolol

  • Joined: 24-November 12

Posted 09 April 2013 - 04:41

i'm using proxomitron which relies on user-customized-RegEx to filtering/modifying annoying parts of html or javascripts or css.
its works.

#13 OP SlayerS_BoxeR

SlayerS_BoxeR

    Neowinian

  • Joined: 10-December 12
  • OS: Windows 7 Pro
  • Phone: Nexus 5

Posted 09 April 2013 - 04:51

sry guys i havent made myself clear on this
i know its regex, and my question was what this particular command actually meant


thanks

#14 +Karl L.

Karl L.

    xorangekiller

  • Tech Issues Solved: 15
  • Joined: 24-January 09
  • Location: Virginia, USA
  • OS: Debian Testing

Posted 09 April 2013 - 15:50

sry guys i havent made myself clear on this
i know its regex, and my question was what this particular command actually meant


thanks


GreyWolf's explanation (post #9) is probably the best so far in this thread. There are different variants of regular expressions implemented by various libraries and languages, so you should probably lookup the documentation for your variant if you want more detail than GreyWolf's post (for example, the meaning of '\b' - the word boundary matcher found in some regex variants). If you can't find proper documentation or aren't targeting a specific regex variant, I recommend that you read the Perl regex documentation because it contains the most common regular expression extensions and is very well documented (with gratuitous examples). I also like the basic regex documentation on the Mozilla Developer Network because it is very well-formatted and looks appealing.

#15 Xilo

Xilo

    Neowinian Senior

  • Joined: 28-May 04
  • Location: Austin, TX

Posted 09 April 2013 - 22:22

Until the site markup changes for whatever reason and the hand written parsing code starts returning gibberish. And I can't imagine that the parsing is so time sensitive that it's worth maintaining your own parsing code, proper parsing wouldn't take that much time.

If the markup changes, likely the dom structure will likely change as well. Meaning either way, your code is broken. So your point is moot. :/