• 0

Totally confused on using HTML Agility Pack


Question

I have no idea how to use this thing. I need to scrape data off a page, specifically http://e-juice-recipes.com/recipe/reign-drops-throne-clone-banana-nut-bread/, in the <tbody id="results_table"> and all the tr's and td's under it.

 

I have no idea where to start - documentation seems non-existant.

Can anyone help?

 

Many thanks :-)

Link to comment
Share on other sites

8 answers to this question

Recommended Posts

  • 0

A quick Google search yields a lot of relevant info here: http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack , have you checked that out?

 

Note you don't necessarily need to use that, often you can get the data you want in a web page just by looking for lines that contain a particular string, depending on the complexity of what you want to do it might be more conceptual overhead to learn how to use a library than writing the parsing code yourself.

Link to comment
Share on other sites

  • 0

Im not designing with PHP, so the DOM parser won't work.

I'll take a look at the link Andre S. posted - thanks :-)

Link to comment
Share on other sites

  • 0

Im not designing with PHP, so the DOM parser won't work.

I'll take a look at the link Andre S. posted - thanks :-)

What are you using then? asp? node js?

 

If you use node js or do it even client side it's even easier since you can then use jquery to filter the html object.

 

Edit: seems you're writing for .net so I suppose a application or a asp website :P

Link to comment
Share on other sites

  • 0

What are you using then? asp? node js?

HTML Agility Pack is a dotNET library that's designed for this sort of thing, similar to Python's BeautifulSoup if you're familiar with it. 

 

That said, yea the documentation for it is rather weak but plenty of examples/answered questions on the web (Andre's link for example, lots of answered questions and demos out there to look at), pretty easy library to use, trickiest part is getting the node path just right for it to properly scrape the data. Once you get that right, the rest is really simple, used it a couple of times.

Link to comment
Share on other sites

  • 0

I see a bit of a problem by the way, the #results_table is generated with js after the page has loaded...

 

 

Which means you have to scrape the js instead at the bottom of the page instead and interpret that  :/

 

I'm looking currently for maybe a possibility that simpler.

 

Edit: nope, the values are put into the js block at the bottom of the page and then used to calculate the actual content for the html table with Recipe() (recipe.js).

So your only options is scraping the js block of code and then interpreting the js code and doing the calculations done by Recipe() yourself...

Link to comment
Share on other sites

  • 0

I see a bit of a problem by the way, the #results_table is generated with js after the page has loaded...

 

 

Which means you have to scrape the js instead at the bottom of the page instead and interpret that  :/

 

I'm looking currently for maybe a possibility that simpler.

 

Edit: nope, the values are put into the js block at the bottom of the page and then used to calculate the actual content for the html table with Recipe() (recipe.js).

So your only options is scraping the js block of code and then interpreting the js code and doing the calculations done by Recipe() yourself...

 

Interesting you found that. Thank you.

If it's that difficult I'll just abandon the idea, it's not worth that much hassle.

 

Oh and I'm developing a VB.NET application only, it was just going to pull data for a textbox in the program - so the DOM and PHP stuff wouldn't apply.

Link to comment
Share on other sites

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.