• 0

Can't PHP screen scrape


Question

Hey guys! On my website you all helped me display the text from another website. When the source website changed, you all helped me fix it so it would work again. It's broken once again so I decided I would try to learn it on my own but I can't figure it out. Could someone post the code and then explain what's going on by any chance? :cry:

So this is the site and I would like to display the downloads from that webpage. I would like it to load the downloads everytime someone loads MY webpage so the downloads are always current.

https://addons.mozilla.org/en-US/firefox/addon/strike-49896/

Thanks again Neowinians!! :yes:

Link to comment
https://www.neowin.net/forum/topic/973286-cant-php-screen-scrape/
Share on other sites

16 answers to this question

Recommended Posts

  • 0

It is very easy using php oop.

$data = file_get_contents('https://addons.mozilla.org/en-US/firefox/addon/strike-49896/');

$html = new DOMDocument();

@$html->loadHTML($data);

foreach($html->getElementsByTagName('strong') as $strong):
        if ($strong->getAttribute('class') === 'downloads'):
                echo $strong->childNodes[0]->nodeValue;
        endif;
endforeach;

  • 0

<?php
$data = file_get_contents('https://addons.mozilla.org/en-US/firefox/addon/strike-49896/');

$html = new DOMDocument();

@$html->loadHTML($data);

foreach($html->getElementsByTagName('strong') as $strong):
        if ($strong->getAttribute('class') === 'downloads'):
                echo $strong->childNodes[0]->nodeValue;
        endif;
endforeach;
?>

I added this but it didn't work :(

http://firefox.thechillroom.com/strike.php

  • 0

YQL can make this easier...

Check out %27"]this query.

<?php
try{
  $response = new SimpleXMLElement(
    "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22https%3A%2F%2Faddons.mozilla.org%2Fen-US%2Ffirefox%2Faddon%2Fstrike-49896%2F%22%20and%0A%20%20%20%20%20%20xpath%3D'%2F%2Fstrong%5B%40class%20%3D%20%22downloads%22%5D    null,
    true
  );
}catch(Exception $e){
  $response = null;
}

$count = '0';

if(null !== $response && 1 === count($response->results)){
  $count = (string)$response->results->strong;
}

echo $count;

It's untested, so you may have to tweak.

  • 0

Yeah I tested it after seeing your post and indeed it does not work. file_get_contents does not support https. There is some tweaks to get it to work but at this point it is not worth the effort. Anthony's approach is much better.

  On 04/02/2011 at 03:48, thatguyandrew1992 said:

<?php
$data = file_get_contents('https://addons.mozilla.org/en-US/firefox/addon/strike-49896/');

$html = new DOMDocument();

@$html->loadHTML($data);

foreach($html->getElementsByTagName('strong') as $strong):
        if ($strong->getAttribute('class') === 'downloads'):
                echo $strong->childNodes[0]->nodeValue;
        endif;
endforeach;
?>

I added this but it didn't work :(

http://firefox.thechillroom.com/strike.php

  • 0
  On 04/02/2011 at 11:54, AnthonySterling said:

YQL can make this easier...

Check out %27"]this query.


$count = '0';

It's untested, so you may have to tweak.

I appreciate your help very much. I tried it out. But that line is what defines the number that appears instead of what is from the mozilla site, and I'm not sure how to fix it :blush:

  • 0
  On 04/02/2011 at 22:04, sweetsam said:

I tried what Anthony posted and it works perfectly and the output matches the number on Mozilla's website. You might wanna check the code and make sure there are not line breaks where there shouldn't be any.

You were right! Copying it from neowin caused some line breaks! Thanks for all the help guys!!! :D

  • 0
  On 05/02/2011 at 02:40, sweetsam said:

I would recommend storing the data locally and updating it once a day. Do not load the data remotely every time your page loads because it might get you banned.

I see. I don't know how to do that though. :unsure:

  • 0

Try this...

		$file = 'count.txt';

		if ( file_exists($file) and time() - filemtime($file) < 3600):
			$count = file_get_contents($file);
		else:
			try{
				$response = new SimpleXMLElement("http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22https%3A%2F%2Faddons.mozilla.org%2Fen-US%2Ffirefox%2Faddon%2Fstrike-49896%2F%22%20and%0A%20%20%20%20%20%20xpath%3D'%2F%2Fstrong%5B%40class%20%3D%20%22downloads%22%5Dnull, true);
			}catch(Exception $e){
				$response = null;
			}

			$count = '0';

			if(null !== $response && 1 === count($response->results)):
			  $count = (string)$response->results->strong;
			endif;

			file_put_contents($file, $count);
		endif;

		echo $count;

  • 0
  On 05/02/2011 at 17:23, sweetsam said:

Try this...

		$file = 'count.txt';

		if ( file_exists($file) and time() - filemtime($file) < 3600):
			$count = file_get_contents($file);
		else:
			try{
				$response = new SimpleXMLElement("http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22https%3A%2F%2Faddons.mozilla.org%2Fen-US%2Ffirefox%2Faddon%2Fstrike-49896%2F%22%20and%0A%20%20%20%20%20%20xpath%3D'%2F%2Fstrong%5B%40class%20%3D%20%22downloads%22%5Dnull, true);
			}catch(Exception $e){
				$response = null;
			}

			$count = '0';

			if(null !== $response && 1 === count($response->results)):
			  $count = (string)$response->results->strong;
			endif;

			file_put_contents($file, $count);
		endif;

		echo $count;

Now I would use the code for multiple pages on my website. Should I change count.txt to countstrike.txt. Then have countsky.txt and countroyalblue.txt etc on my other pages?

  • 0

Elaborating on SweetSam's example, you could create two handy-dandy functions to help.

function get_stats($id){
  try{
    $response = new SimpleXMLElement(
      "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22https%3A%2F%2Faddons.mozilla.org%2Fen-US%2Ffirefox%2Faddon%2F" . $id . "%2F%22%20and%0A%20%20%20%20%20%20xpath%3D'%2F%2Fstrong%5B%40class%20%3D%20%22downloads%22%5D'",
      null,
      true
    );
  }catch(Exception $e){
    $response = null;
  }

  $count = '0';

  if(null !== $response && 1 === count($response->results)){
    $count = (string)$response->results->strong;
  }

  return $count;
}

function get_cached_stats($id, $lifetime = 3600){
  $file = $id . '_stats_cache.txt';
  if(is_readable($file) && time() - filemtime($file) < (int)$lifetime){
    return file_get_contents($file);
  }
  $data = get_stats($id);
  file_put_contents($file, $data);
  return $data;
}

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • It makes no sense to name it iOS 26 after iOS 18 but logic has left the building a long time ago (remember Microsoft and their Windows 8, 8.1 and 10 or Windows 7 that was actually 6.2)? Not to mention the stupid trend of "rapid releases" where minor releases with some bug fixes and UI changes (basically unnoticeable) masquerade as major releases...
    • Have you tried setting your font to Aptos? Its specifically designed to display better.
    • WebChangeMonitor 25.06 by Razvan Serea Monitors allows you to quickly check a number of web pages and tracks changes based on the content of the web pages. Allows to monitor several protocols, including HTTP and HTTPS. Allows to view and record differences. Available for Win7/10, Linux and others. WebChangeMonitor features: Allows monitoring of web pages and informs about content changes Indication of states of currently monitored items in the tool and taskbar Reporting as sound and/or email as well as log file or HTML log Several configuration / filter options Support all protocols, e.g. http, https Multi-threaded, running in the background Bulk-import and bulk-export of items (from/to CSV) to monitor Export of results to CSV file for further processing Allows running command on items states and/or showing diff (changes) of content with preferred diff-tool ...and many more! Open Source (C++, wxWidgets) Cross platform for Windows (7/10), Linux, RPi and Mac (if self-compiled) WebChangeMonitor 25.06 release notes: Updates several libraries to stay up-to-date with latest security fixes. Few bug-fixes took place related to ignore items being no displayed correctly sometimes in the UI. especially, the sqlite library was updated to hopefully also fix a recent (but very rare and random) crash bug. Download: WebChangeMonitor 64-bit | 8.9 MB (Open Source) Download: WebChangeMonitor 32-bit | 8.2 MB View: WebChangeMonitor Website | Other Operating Systems | Screenshot Get alerted to all of our Software updates on Twitter at @NeowinSoftware
    • KEF Q Concerto Meta, Polk Reserve R200 Limited Edition speakers are lowest priced by Sayan Sen Nowadays, soundbars with wireless subwoofer systems offer people a convenient way to experience decent-quality audio; systems such as the Samsung Q900F, Q800F, and Q800F, which are available at their lowest price levels. However, if you are someone who is looking for a smaller bookshelf speaker for more sound accuracy and can live without louder or deeper bass notes, then take a look at KEF and Polk Audio's Q Concerto Meta and Reserve R200 speakers, respectively, as both of them are up for sale at their lowest ever prices (purchase links under the specs lists below). First up, we have the KEF Q Concerto Meta, which is a three-way stereo system despite being designed within a small bookshelf footprint. That is because the KEF has a coaxial Uni-Q driver array that packs both the mid-range and tweeter. The woofer is a 6.5-inch driver, and thus it does not go below 48 Hz, so you will need a separate subwoofer if you want bass frequencies below that. This is a passive speaker with two terminals and no other fancy connectivity features. The technical specs of the KEF Q Concerto Meta are given below: Drive Units: Uni‑Q Driver Array: HF: 19 mm (0.75 in.) vented aluminium dome with MAT™ MF: 100 mm (4 in.) aluminium cone Bass Unit (LF): 165 mm (6.5 in.) hybrid aluminium cone Frequency Response: (-6 dB): 40 Hz – 20 kHz (±3 dB): 48 Hz – 20 kHz Typical In-Room Bass Response (at -6 dB): 36 Hz Crossover Frequencies: 430 Hz and 2.9 kHz Amplifier Requirements: 15 – 180 W Sensitivity (2.83V/1m): 85 dB Harmonic Distortion (at 90 dB, 1m): Less than 2% above 37 Hz Less than 1% between 91 Hz – 20 kHz Maximum Output: 108 dB Impedance: 4 Ω (minimum 3.2 Ω) Connectivity: Wired connection only (no wireless options) Get the KEF Q Concerto Meta at the link below: KEF Q Concerto Meta Three-Way Bookshelf Speaker (White, Pair): $1199 (Amazon US) Up next, we have the Polk Audio Reserve R200 50th Anniversary model. This is a special edition speaker and the company says that only 1000 pairs were produced. Unlike the KEF, this is a two-way design and so does not have a mid-range driver. So you do sacrifice some of the mids. However, it should still be great for listening to music as vocals are mainly necessary for movie-watching. The technical specs of the Polk Reserve R200 are given below: 1" Ring Radiator Tweeter 6½" Turbine Cone Woofer Minimum Impedance 3.8Ω Sensitivity 86dB Recommended Amplifier Power 30–200W Frequency Response (-3dB) 51Hz–38kHz Get the Polk Reserve R200 limited edition at the link below: Polk Reserve R200 50th Anniversary Limited Edition Bookshelf Speaker - Dolby Atmos & IMAX Enhanced, Cherry: $797.97 (Amazon US) This Amazon deal is US-specific and not available in other regions unless specified. If you don't like it or want to look at more options, check out the Amazon US deals page here. Get Prime (SNAP), Prime Video, Audible Plus or Kindle / Music Unlimited. Free for 30 days. As an Amazon Associate, we earn from qualifying purchases.
  • Recent Achievements

    • Week One Done
      abortretryfail earned a badge
      Week One Done
    • First Post
      Mr bot earned a badge
      First Post
    • First Post
      Bkl211 earned a badge
      First Post
    • One Year In
      Mido gaber earned a badge
      One Year In
    • One Year In
      Vladimir Migunov earned a badge
      One Year In
  • Popular Contributors

    1. 1
      +primortal
      488
    2. 2
      +FloatingFatMan
      257
    3. 3
      snowy owl
      247
    4. 4
      ATLien_0
      222
    5. 5
      +Edouard
      191
  • Tell a friend

    Love Neowin? Tell a friend!