• 0

PHP Scraping


Question

Hi All,

For a project at work I'm trying to use php scrape to get the recently added titles to a library catalog system. (See here)

However the only problem is the recently added section seems to load after everything else. I guess suggesting some other script is calling for this information hence the delay. The problem is when I use file_get_contents to retrieve the information from http://encore.exeter.ac.uk/iii/encore/search/C%7CSlaw%7COrightresult%7CU1?lang=eng&suite=def, all i get is loading suggestions.

I'm new to php scraping so any help is welcome.

Thanks

Link to comment
Share on other sites

3 answers to this question

Recommended Posts

  • 0

That particular data is loaded via an AJAX request. If you use Firefox and Firebug you can see the requests that get fired after the initial page content is loaded. One of them returns the Recently Added listings. The URL for this data is:

http://encore.exeter...e=1293032634108

However, clicking that link simply returns an error. For it to work, it looks like the server requires a valid Referer header, along with a dojo-ajax-request header and a valid session id.

We can send the required headers using PHP's file_get_contents 3rd argument, which takes a stream_context. You can create an array to pass to stream_context_create which you can pass to file_get_contents. That's the easy bit, but in order to get a session ID you're going to need to make a request to the main page and grab the session id from the response headers. Fortunately PHP makes this relatively easy too (Hurrah!). Below is what I've managed to knock together in about 30 minutes. Hopefully it will give you want you need:

Had to attach the script as Neowin wouldn't let me post with it inserted into the Post Reply field. Odd.

exeter.php

Link to comment
Share on other sites

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.