• 0

Grab data from other websites


Question

title explains what i want, but ill give some examples.

how do i grab data from other websites such as...an image from something like a daily comic, or post count from a forum, or grab the amount of comments have been posted on another website...

this would be without you owning the other website as well...so no forwarding the data to a page, then grabbing it from there.

can it be done? and how?

Link to comment
Share on other sites

13 answers to this question

Recommended Posts

  • 0

Grabbing an image daily wouldn't be too difficult, just link to the image, as long as you can tell what the image would/should be called. Like if the daily image is names after the date like 11-23-2007.jpg then you have your php looking for the file whatever.com/11-23-2007.jpg then tommorow the php/asp will use tommorows date like 11-24-2007.jpg

Post count or number of comments will be more difficult most likely. You would have to see if that information in rendered somewhere on one of the page, then you have to make php/asp parse out that information so that you can display it. For example on neowin in the profile box it has a persons post count. In theroy you can have your site find that info and copy it.

You wouldn't be able to get raw data from the database unless this site allows you but since you are asking about a site that you don't own then I would assume you won't have access to the db.

P.S. some people may not like your idea and could even be illegal if you don't have permission, i'm not positive. I know your not allowed to constantly load an image for all your viewers off another host unless you have permission.

Link to comment
Share on other sites

  • 0
ok...lets make the goal to grab my post count.

ok, so the first thing you need to know is a reliable place to get your post count from. i chose your profile page since each user has one and it uniquely identifies the user. your profile page is https://www.neowin.net/forum/index.php?showuser=114394. so first get the contents of that page. (i'm using php for all of this).

$page_contents = file_get_contents("https://www.neowin.net/forum/index.php?showuser=114394");

now parse the information for the post count. if we look at the html that was returned the part we're trying to search for is:

<td>3,274 posts (4 per day)</td>

so posts can have numbers and commas as their formatting. let's write a regular expression to pull only the post count back. i'm going to search for a td that first contains numbers and commas followed by a space and then the string "posts". it's a possibility that this pattern may match elsewhere on the page but i'm not going to worry about it. if this were a real example i would probably look for an even more unique pattern (ie. include the column above the post count that has the string "Local Time").

$matches = array();
preg_match('/&lt;td&gt;([0-9,]+) post/', $page_contents, $matches);
echo $matches[0];

this should match the pattern and echo it to the screen. it's important to note that this example will only work for sites where it doesn't require you to log in and all of the information is transfered when the original page is requested (if the page loaded the information with javascript then the this wouldn't work). in the case of neowin, you have to be logged in to actually view that page and since php is not logged into your account it can't actually get that profile page. if you need to be able to get to pages that require you to logged in, i would suggest using something like curl which allows php to simulate being a web browser and does all of the dirty work of creating and sending the cookie information.

Link to comment
Share on other sites

  • 0

Use Automation Anywhere. It's a data extraction tool that can grab data from sites. I think berserk needs different type of like an image url, post from forums, count, etc. You can extract data from web of any type using AA. I do it personally, hence I m advising you to use this software. Go through this web data grabber 's detail.

Link to comment
Share on other sites

  • 0

Using regular expressions for parsing HTML is bad practice. You should be using DOMXPath.

There's a gorgeous library that uses syntax like jQuery's selectors built on DOMXPath called CSSlib.

Link to comment
Share on other sites

  • 0

WOW this thread is old!

Use Automation Anywhere. It's a data extraction tool that can grab data from sites. I think berserk needs different type of like an image url, post from forums, count, etc. You can extract data from web of any type using AA. I do it personally, hence I m advising you to use this software. Go through this web data grabber 's detail.

thanks for the suggestion, unless your a spam bot.

Link to comment
Share on other sites

  • 0

WOW this thread is old!

thanks for the suggestion, unless your a spam bot.

No berserk ?Im not a spam bot neither the software is. I m jus a proactive user. I use it regularly dude and believe me it works great! That's why I posted it , and I'm sure it will be of help to u also :)

Link to comment
Share on other sites

  • 0

Haha, I should pay more attention :p

lol you're telling me. i was reading my response above thinking "ew dont use regular expressions for that" then i realized i was the person who made the post. :laugh:

Link to comment
Share on other sites

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.