Jump to content



Photo

Grab data from other websites


  • Please log in to reply
13 replies to this topic

#1 Berserk87

Berserk87

    Neowinian Senior

  • Joined: 14-June 05
  • Location: BC, Canada

Posted 23 November 2007 - 09:25

title explains what i want, but ill give some examples.

how do i grab data from other websites such as...an image from something like a daily comic, or post count from a forum, or grab the amount of comments have been posted on another website...

this would be without you owning the other website as well...so no forwarding the data to a page, then grabbing it from there.

can it be done? and how?


#2 Ash

Ash

    Neowinian Fanatic

  • Joined: 01-September 01

Posted 23 November 2007 - 09:40

Grabbing an image daily wouldn't be too difficult, just link to the image, as long as you can tell what the image would/should be called. Like if the daily image is names after the date like 11-23-2007.jpg then you have your php looking for the file whatever.com/11-23-2007.jpg then tommorow the php/asp will use tommorows date like 11-24-2007.jpg

Post count or number of comments will be more difficult most likely. You would have to see if that information in rendered somewhere on one of the page, then you have to make php/asp parse out that information so that you can display it. For example on neowin in the profile box it has a persons post count. In theroy you can have your site find that info and copy it.

You wouldn't be able to get raw data from the database unless this site allows you but since you are asking about a site that you don't own then I would assume you won't have access to the db.

P.S. some people may not like your idea and could even be illegal if you don't have permission, i'm not positive. I know your not allowed to constantly load an image for all your viewers off another host unless you have permission.

#3 nvme

nvme

    Disgruntled Poster

  • Joined: 28-April 03

Posted 23 November 2007 - 09:54

you could use php and curl along with regular expressions to get all of the information you need.

#4 OP Berserk87

Berserk87

    Neowinian Senior

  • Joined: 14-June 05
  • Location: BC, Canada

Posted 23 November 2007 - 15:55

i was only using the image as an example, im intrested in using it for simple things like post count off a forum.

#5 Popcorned1

Popcorned1

    Neowinian Senior

  • Joined: 03-October 04

Posted 23 November 2007 - 15:58

Zend_Http to connect to the website then use regular expressions to find the data you're after and extract it.

#6 +Nik L

Nik L

    Where's my pants?

  • Tech Issues Solved: 2
  • Joined: 14-January 03

Posted 23 November 2007 - 16:02

You really want to be looking at sites/webapps with a nice API.

Example: I can grab the number (and content) of things from Flickr.

Or do you mean something else?

#7 OP Berserk87

Berserk87

    Neowinian Senior

  • Joined: 14-June 05
  • Location: BC, Canada

Posted 23 November 2007 - 19:33

ok...lets make the goal to grab my post count.



Zend_Http to connect to the website then use regular expressions to find the data you're after and extract it.


hmmm ill look into that.

#8 nvme

nvme

    Disgruntled Poster

  • Joined: 28-April 03

Posted 26 November 2007 - 13:02

ok...lets make the goal to grab my post count.


ok, so the first thing you need to know is a reliable place to get your post count from. i chose your profile page since each user has one and it uniquely identifies the user. your profile page is http://www.neowin.ne...showuser=114394. so first get the contents of that page. (i'm using php for all of this).

$page_contents = file_get_contents("http://www.neowin.net/forum/index.php?showuser=114394");

now parse the information for the post count. if we look at the html that was returned the part we're trying to search for is:

<td>3,274 posts (4 per day)</td>


so posts can have numbers and commas as their formatting. let's write a regular expression to pull only the post count back. i'm going to search for a td that first contains numbers and commas followed by a space and then the string "posts". it's a possibility that this pattern may match elsewhere on the page but i'm not going to worry about it. if this were a real example i would probably look for an even more unique pattern (ie. include the column above the post count that has the string "Local Time").

$matches = array();
preg_match('/<td>([0-9,]+) post/', $page_contents, $matches);
echo $matches[0];

this should match the pattern and echo it to the screen. it's important to note that this example will only work for sites where it doesn't require you to log in and all of the information is transfered when the original page is requested (if the page loaded the information with javascript then the this wouldn't work). in the case of neowin, you have to be logged in to actually view that page and since php is not logged into your account it can't actually get that profile page. if you need to be able to get to pages that require you to logged in, i would suggest using something like curl which allows php to simulate being a web browser and does all of the dirty work of creating and sending the cookie information.

#9 techsmart

techsmart

    Neowinian

  • Joined: 24-May 10

Posted 24 May 2010 - 09:56

Use Automation Anywhere. It's a data extraction tool that can grab data from sites. I think berserk needs different type of like an image url, post from forums, count, etc. You can extract data from web of any type using AA. I do it personally, hence I m advising you to use this software. Go through this web data grabber 's detail.

#10 vetKudos

Kudos

    Neowinian Senior

  • Joined: 19-October 04
  • Location: Ireland

Posted 24 May 2010 - 10:17

Using regular expressions for parsing HTML is bad practice. You should be using DOMXPath.

There's a gorgeous library that uses syntax like jQuery's selectors built on DOMXPath called CSSlib.

#11 OP Berserk87

Berserk87

    Neowinian Senior

  • Joined: 14-June 05
  • Location: BC, Canada

Posted 24 May 2010 - 10:28

WOW this thread is old!

Use Automation Anywhere. It's a data extraction tool that can grab data from sites. I think berserk needs different type of like an image url, post from forums, count, etc. You can extract data from web of any type using AA. I do it personally, hence I m advising you to use this software. Go through this web data grabber 's detail.


thanks for the suggestion, unless your a spam bot.

#12 vetKudos

Kudos

    Neowinian Senior

  • Joined: 19-October 04
  • Location: Ireland

Posted 24 May 2010 - 10:32

WOW this thread is old!


Haha, I should pay more attention :p

#13 techsmart

techsmart

    Neowinian

  • Joined: 24-May 10

Posted 25 May 2010 - 04:16

WOW this thread is old!



thanks for the suggestion, unless your a spam bot.


No berserk ?Im not a spam bot neither the software is. I m jus a proactive user. I use it regularly dude and believe me it works great! That's why I posted it , and I'm sure it will be of help to u also :)

#14 nvme

nvme

    Disgruntled Poster

  • Joined: 28-April 03

Posted 25 May 2010 - 22:19

Haha, I should pay more attention :p

lol you're telling me. i was reading my response above thinking "ew dont use regular expressions for that" then i realized i was the person who made the post. :laugh: