Sign in to follow this  
Followers 0
Berserk87

Grab data from other websites

14 posts in this topic

title explains what i want, but ill give some examples.

how do i grab data from other websites such as...an image from something like a daily comic, or post count from a forum, or grab the amount of comments have been posted on another website...

this would be without you owning the other website as well...so no forwarding the data to a page, then grabbing it from there.

can it be done? and how?

Share this post


Link to post
Share on other sites

Grabbing an image daily wouldn't be too difficult, just link to the image, as long as you can tell what the image would/should be called. Like if the daily image is names after the date like 11-23-2007.jpg then you have your php looking for the file whatever.com/11-23-2007.jpg then tommorow the php/asp will use tommorows date like 11-24-2007.jpg

Post count or number of comments will be more difficult most likely. You would have to see if that information in rendered somewhere on one of the page, then you have to make php/asp parse out that information so that you can display it. For example on neowin in the profile box it has a persons post count. In theroy you can have your site find that info and copy it.

You wouldn't be able to get raw data from the database unless this site allows you but since you are asking about a site that you don't own then I would assume you won't have access to the db.

P.S. some people may not like your idea and could even be illegal if you don't have permission, i'm not positive. I know your not allowed to constantly load an image for all your viewers off another host unless you have permission.

Share this post


Link to post
Share on other sites

you could use php and curl along with regular expressions to get all of the information you need.

Share this post


Link to post
Share on other sites

i was only using the image as an example, im intrested in using it for simple things like post count off a forum.

Share this post


Link to post
Share on other sites
Zend_Http to connect to the website then use regular expressions to find the data you're after and extract it.

Share this post


Link to post
Share on other sites

You really want to be looking at sites/webapps with a nice API.

Example: I can grab the number (and content) of things from Flickr.

Or do you mean something else?

Share this post


Link to post
Share on other sites

ok...lets make the goal to grab my post count.

Zend_Http to connect to the website then use regular expressions to find the data you're after and extract it.

hmmm ill look into that.

Share this post


Link to post
Share on other sites
ok...lets make the goal to grab my post count.

ok, so the first thing you need to know is a reliable place to get your post count from. i chose your profile page since each user has one and it uniquely identifies the user. your profile page is http://www.neowin.net/forum/index.php?showuser=114394. so first get the contents of that page. (i'm using php for all of this).

$page_contents = file_get_contents("http://www.neowin.net/forum/index.php?showuser=114394");

now parse the information for the post count. if we look at the html that was returned the part we're trying to search for is:

<td>3,274 posts (4 per day)</td>

so posts can have numbers and commas as their formatting. let's write a regular expression to pull only the post count back. i'm going to search for a td that first contains numbers and commas followed by a space and then the string "posts". it's a possibility that this pattern may match elsewhere on the page but i'm not going to worry about it. if this were a real example i would probably look for an even more unique pattern (ie. include the column above the post count that has the string "Local Time").

$matches = array();
preg_match('/&lt;td&gt;([0-9,]+) post/', $page_contents, $matches);
echo $matches[0];

this should match the pattern and echo it to the screen. it's important to note that this example will only work for sites where it doesn't require you to log in and all of the information is transfered when the original page is requested (if the page loaded the information with javascript then the this wouldn't work). in the case of neowin, you have to be logged in to actually view that page and since php is not logged into your account it can't actually get that profile page. if you need to be able to get to pages that require you to logged in, i would suggest using something like curl which allows php to simulate being a web browser and does all of the dirty work of creating and sending the cookie information.

Share this post


Link to post
Share on other sites

Use Automation Anywhere. It's a data extraction tool that can grab data from sites. I think berserk needs different type of like an image url, post from forums, count, etc. You can extract data from web of any type using AA. I do it personally, hence I m advising you to use this software. Go through this web data grabber 's detail.

Share this post


Link to post
Share on other sites

Using regular expressions for parsing HTML is bad practice. You should be using DOMXPath.

There's a gorgeous library that uses syntax like jQuery's selectors built on DOMXPath called CSSlib.

Share this post


Link to post
Share on other sites

WOW this thread is old!

Use Automation Anywhere. It's a data extraction tool that can grab data from sites. I think berserk needs different type of like an image url, post from forums, count, etc. You can extract data from web of any type using AA. I do it personally, hence I m advising you to use this software. Go through this web data grabber 's detail.

thanks for the suggestion, unless your a spam bot.

Share this post


Link to post
Share on other sites

WOW this thread is old!

Haha, I should pay more attention :p

Share this post


Link to post
Share on other sites

WOW this thread is old!

thanks for the suggestion, unless your a spam bot.

No berserk ?Im not a spam bot neither the software is. I m jus a proactive user. I use it regularly dude and believe me it works great! That's why I posted it , and I'm sure it will be of help to u also :)

Share this post


Link to post
Share on other sites

Haha, I should pay more attention :p

lol you're telling me. i was reading my response above thinking "ew dont use regular expressions for that" then i realized i was the person who made the post. :laugh:

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0

  • Recently Browsing   0 members

    No registered users viewing this page.