• 0

Can't PHP screen scrape


Question

Hey guys! On my website you all helped me display the text from another website. When the source website changed, you all helped me fix it so it would work again. It's broken once again so I decided I would try to learn it on my own but I can't figure it out. Could someone post the code and then explain what's going on by any chance? :cry:

So this is the site and I would like to display the downloads from that webpage. I would like it to load the downloads everytime someone loads MY webpage so the downloads are always current.

https://addons.mozilla.org/en-US/firefox/addon/strike-49896/

Thanks again Neowinians!! :yes:

Link to comment
https://www.neowin.net/forum/topic/973286-cant-php-screen-scrape/
Share on other sites

16 answers to this question

Recommended Posts

  • 0

It is very easy using php oop.

$data = file_get_contents('https://addons.mozilla.org/en-US/firefox/addon/strike-49896/');

$html = new DOMDocument();

@$html->loadHTML($data);

foreach($html->getElementsByTagName('strong') as $strong):
        if ($strong->getAttribute('class') === 'downloads'):
                echo $strong->childNodes[0]->nodeValue;
        endif;
endforeach;

  • 0

<?php
$data = file_get_contents('https://addons.mozilla.org/en-US/firefox/addon/strike-49896/');

$html = new DOMDocument();

@$html->loadHTML($data);

foreach($html->getElementsByTagName('strong') as $strong):
        if ($strong->getAttribute('class') === 'downloads'):
                echo $strong->childNodes[0]->nodeValue;
        endif;
endforeach;
?>

I added this but it didn't work :(

http://firefox.thechillroom.com/strike.php

  • 0

YQL can make this easier...

Check out %27"]this query.

<?php
try{
  $response = new SimpleXMLElement(
    "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22https%3A%2F%2Faddons.mozilla.org%2Fen-US%2Ffirefox%2Faddon%2Fstrike-49896%2F%22%20and%0A%20%20%20%20%20%20xpath%3D'%2F%2Fstrong%5B%40class%20%3D%20%22downloads%22%5D    null,
    true
  );
}catch(Exception $e){
  $response = null;
}

$count = '0';

if(null !== $response && 1 === count($response->results)){
  $count = (string)$response->results->strong;
}

echo $count;

It's untested, so you may have to tweak.

  • 0

Yeah I tested it after seeing your post and indeed it does not work. file_get_contents does not support https. There is some tweaks to get it to work but at this point it is not worth the effort. Anthony's approach is much better.

  On 04/02/2011 at 03:48, thatguyandrew1992 said:

<?php
$data = file_get_contents('https://addons.mozilla.org/en-US/firefox/addon/strike-49896/');

$html = new DOMDocument();

@$html->loadHTML($data);

foreach($html->getElementsByTagName('strong') as $strong):
        if ($strong->getAttribute('class') === 'downloads'):
                echo $strong->childNodes[0]->nodeValue;
        endif;
endforeach;
?>

I added this but it didn't work :(

http://firefox.thechillroom.com/strike.php

  • 0
  On 04/02/2011 at 11:54, AnthonySterling said:

YQL can make this easier...

Check out %27"]this query.


$count = '0';

It's untested, so you may have to tweak.

I appreciate your help very much. I tried it out. But that line is what defines the number that appears instead of what is from the mozilla site, and I'm not sure how to fix it :blush:

  • 0
  On 04/02/2011 at 22:04, sweetsam said:

I tried what Anthony posted and it works perfectly and the output matches the number on Mozilla's website. You might wanna check the code and make sure there are not line breaks where there shouldn't be any.

You were right! Copying it from neowin caused some line breaks! Thanks for all the help guys!!! :D

  • 0
  On 05/02/2011 at 02:40, sweetsam said:

I would recommend storing the data locally and updating it once a day. Do not load the data remotely every time your page loads because it might get you banned.

I see. I don't know how to do that though. :unsure:

  • 0

Try this...

		$file = 'count.txt';

		if ( file_exists($file) and time() - filemtime($file) < 3600):
			$count = file_get_contents($file);
		else:
			try{
				$response = new SimpleXMLElement("http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22https%3A%2F%2Faddons.mozilla.org%2Fen-US%2Ffirefox%2Faddon%2Fstrike-49896%2F%22%20and%0A%20%20%20%20%20%20xpath%3D'%2F%2Fstrong%5B%40class%20%3D%20%22downloads%22%5Dnull, true);
			}catch(Exception $e){
				$response = null;
			}

			$count = '0';

			if(null !== $response && 1 === count($response->results)):
			  $count = (string)$response->results->strong;
			endif;

			file_put_contents($file, $count);
		endif;

		echo $count;

  • 0
  On 05/02/2011 at 17:23, sweetsam said:

Try this...

		$file = 'count.txt';

		if ( file_exists($file) and time() - filemtime($file) < 3600):
			$count = file_get_contents($file);
		else:
			try{
				$response = new SimpleXMLElement("http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22https%3A%2F%2Faddons.mozilla.org%2Fen-US%2Ffirefox%2Faddon%2Fstrike-49896%2F%22%20and%0A%20%20%20%20%20%20xpath%3D'%2F%2Fstrong%5B%40class%20%3D%20%22downloads%22%5Dnull, true);
			}catch(Exception $e){
				$response = null;
			}

			$count = '0';

			if(null !== $response && 1 === count($response->results)):
			  $count = (string)$response->results->strong;
			endif;

			file_put_contents($file, $count);
		endif;

		echo $count;

Now I would use the code for multiple pages on my website. Should I change count.txt to countstrike.txt. Then have countsky.txt and countroyalblue.txt etc on my other pages?

  • 0

Elaborating on SweetSam's example, you could create two handy-dandy functions to help.

function get_stats($id){
  try{
    $response = new SimpleXMLElement(
      "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22https%3A%2F%2Faddons.mozilla.org%2Fen-US%2Ffirefox%2Faddon%2F" . $id . "%2F%22%20and%0A%20%20%20%20%20%20xpath%3D'%2F%2Fstrong%5B%40class%20%3D%20%22downloads%22%5D'",
      null,
      true
    );
  }catch(Exception $e){
    $response = null;
  }

  $count = '0';

  if(null !== $response && 1 === count($response->results)){
    $count = (string)$response->results->strong;
  }

  return $count;
}

function get_cached_stats($id, $lifetime = 3600){
  $file = $id . '_stats_cache.txt';
  if(is_readable($file) && time() - filemtime($file) < (int)$lifetime){
    return file_get_contents($file);
  }
  $data = get_stats($id);
  file_put_contents($file, $data);
  return $data;
}

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • It's likely going to be the on-board models that have been finetuned to perform all the Galaxy AI processing needs, thus eliminating any need for the cloud/subscription...
    • They did remove the Start button with Windows 8, but everyone had their pitchforks out. That OS had its many issues, but you could work on a desktop environment without the Start. Safe to say not many liked it.
    • I always just assumed that how I start programs is pretty close to how most people do it, which is... 1)Either start a program from desktop icon. or... 2)Windows key (or mouse click on start menu) and when start menu appears type in a little of what program you are trying to find, it will highlight, then press enter (or find it with mouse pointer and click it) is the very basic idea. this is very basic core functionality I would never change since it's quick and to the point and just works and has been this way a long time now. NOTE: I am on Linux Mint 22.1-Xfce (I am using the 'dark' menu since the default light grey is a bit out of place with the rest of Mint's darker look), but the very basics like this are the same as Windows in this regard. I like how Mint tends to pretty much stay the same (minor tweaks from point release to point release that are slight but overall it's pretty much the same) instead of change for the sake of change like Windows does and ends up making some stuff potentially worse as a result. I say screw all of those trends where something has to 'look current' as I am more of the mindset once something looks good enough, which Mint does (along with say Windows 7's core interface and the like), you don't screw with it as if someone does want to mess with it, they can do their own custom tweaks on the side, but the base install should be like that 'old faithful' type of interface that everyone has been familiar with for decades now. so by that standard the 'Start Menu' is still useful. I would NEVER get rid of that core functionality as Win8 pretty much tried that upon release and it made doing VERY basic stuff a chore which is why after I briefly tried Win8 in a VM, I never bothered with that OS again as that was easily Microsoft's biggest mess up with interface changes and I have been using Windows since v3.11 in mid-1990's and that Win8 interface change was by far the biggest mess up from Microsoft (how that made it to the final product is beyond me). I realize they supposedly fixed it in Win 8.1, but by then no one really cared as Win7 was the standard and those moving on from that went to Win10.
    • i click a few things on the start menu, other wise I do still use the run box daily.
  • Recent Achievements

    • Week One Done
      Wayne Robinson earned a badge
      Week One Done
    • One Month Later
      Karan Khanna earned a badge
      One Month Later
    • Week One Done
      Karan Khanna earned a badge
      Week One Done
    • First Post
      MikeK13 earned a badge
      First Post
    • Week One Done
      OHI Accounting earned a badge
      Week One Done
  • Popular Contributors

    1. 1
      +primortal
      679
    2. 2
      ATLien_0
      275
    3. 3
      Michael Scrip
      207
    4. 4
      +FloatingFatMan
      171
    5. 5
      Steven P.
      148
  • Tell a friend

    Love Neowin? Tell a friend!