Grab data from other websites

10,368 · November 23, 2007

title explains what i want, but ill give some examples.

how do i grab data from other websites such as...an image from something like a daily comic, or post count from a forum, or grab the amount of comments have been posted on another website...

this would be without you owning the other website as well...so no forwarding the data to a page, then grabbing it from there.

can it be done? and how?

2,142 · November 23, 2007

Grabbing an image daily wouldn't be too difficult, just link to the image, as long as you can tell what the image would/should be called. Like if the daily image is names after the date like 11-23-2007.jpg then you have your php looking for the file whatever.com/11-23-2007.jpg then tommorow the php/asp will use tommorows date like 11-24-2007.jpg

Post count or number of comments will be more difficult most likely. You would have to see if that information in rendered somewhere on one of the page, then you have to make php/asp parse out that information so that you can display it. For example on neowin in the profile box it has a persons post count. In theroy you can have your site find that info and copy it.

You wouldn't be able to get raw data from the database unless this site allows you but since you are asking about a site that you don't own then I would assume you won't have access to the db.

P.S. some people may not like your idea and could even be illegal if you don't have permission, i'm not positive. I know your not allowed to constantly load an image for all your viewers off another host unless you have permission.

2,244 · November 23, 2007

you could use php and curl along with regular expressions to get all of the information you need.

10,368 · November 23, 2007

i was only using the image as an example, im intrested in using it for simple things like post count off a forum.

2,201 · November 23, 2007

Zend_Http to connect to the website then use regular expressions to find the data you're after and extract it.

44,329 · November 23, 2007

You really want to be looking at sites/webapps with a nice API.

Example: I can grab the number (and content) of things from Flickr.

Or do you mean something else?

10,368 · November 23, 2007

ok...lets make the goal to grab my post count.

Zend_Http to connect to the website then use regular expressions to find the data you're after and extract it.

hmmm ill look into that.

2,244 · November 26, 2007

ok...lets make the goal to grab my post count.

ok, so the first thing you need to know is a reliable place to get your post count from. i chose your profile page since each user has one and it uniquely identifies the user. your profile page is https://www.neowin.net/forum/index.php?showuser=114394. so first get the contents of that page. (i'm using php for all of this).

$page_contents = file_get_contents("https://www.neowin.net/forum/index.php?showuser=114394");

now parse the information for the post count. if we look at the html that was returned the part we're trying to search for is:

<td>3,274 posts (4 per day)</td>

so posts can have numbers and commas as their formatting. let's write a regular expression to pull only the post count back. i'm going to search for a td that first contains numbers and commas followed by a space and then the string "posts". it's a possibility that this pattern may match elsewhere on the page but i'm not going to worry about it. if this were a real example i would probably look for an even more unique pattern (ie. include the column above the post count that has the string "Local Time").

$matches = array();
preg_match('/&lt;td&gt;([0-9,]+) post/', $page_contents, $matches);
echo $matches[0];

this should match the pattern and echo it to the screen. it's important to note that this example will only work for sites where it doesn't require you to log in and all of the information is transfered when the original page is requested (if the page loaded the information with javascript then the this wouldn't work). in the case of neowin, you have to be logged in to actually view that page and since php is not logged into your account it can't actually get that profile page. if you need to be able to get to pages that require you to logged in, i would suggest using something like curl which allows php to simulate being a web browser and does all of the dirty work of creating and sending the cookie information.

May 24, 2010

Use Automation Anywhere. It's a data extraction tool that can grab data from sites. I think berserk needs different type of like an image url, post from forums, count, etc. You can extract data from web of any type using AA. I do it personally, hence I m advising you to use this software. Go through this web data grabber 's detail.

3,529 · May 24, 2010

Using regular expressions for parsing HTML is bad practice. You should be using DOMXPath.

There's a gorgeous library that uses syntax like jQuery's selectors built on DOMXPath called CSSlib.

10,368 · May 24, 2010

WOW this thread is old!

Use Automation Anywhere. It's a data extraction tool that can grab data from sites. I think berserk needs different type of like an image url, post from forums, count, etc. You can extract data from web of any type using AA. I do it personally, hence I m advising you to use this software. Go through this web data grabber 's detail.

thanks for the suggestion, unless your a spam bot.

3,529 · May 24, 2010

WOW this thread is old!

Haha, I should pay more attention :p

May 25, 2010

WOW this thread is old!

thanks for the suggestion, unless your a spam bot.

No berserk ?Im not a spam bot neither the software is. I m jus a proactive user. I use it regularly dude and believe me it works great! That's why I posted it , and I'm sure it will be of help to u also :)

2,244 · May 25, 2010

Haha, I should pay more attention :p

lol you're telling me. i was reading my response above thinking "ew dont use regular expressions for that" then i realized i was the person who made the post. :laugh:

Sign In

Grab data from other websites

Question

Berserk87

Link to comment

Share on other sites

13 answers to this question

Recommended Posts

Ash

Link to comment

Share on other sites

nvme

Link to comment

Share on other sites

Berserk87

Link to comment

Share on other sites

Popcorned1

Link to comment

Share on other sites

Dick Montage

Link to comment

Share on other sites

Berserk87

Link to comment

Share on other sites

nvme

Link to comment

Share on other sites

techsmart

Link to comment

Share on other sites

Kudos Veteran

Link to comment

Share on other sites

Berserk87

Link to comment

Share on other sites

Kudos Veteran

Link to comment

Share on other sites

techsmart

Link to comment

Share on other sites

nvme

Link to comment

Share on other sites

Recently Browsing 0 members