Jump to content



Photo

Use web browser engine to extract URLs from a page?

Answered Go to the full post

  • Please log in to reply
5 replies to this topic

#1 kingneil

kingneil

    Neowinian

  • Joined: 17-April 14

Posted 17 April 2014 - 12:28

In PHP, I've tried using simple_html_dom in order to extract URLs from web pages.

And it works a lot of the time, but not all of the time.

For example, it doesn't work on the website ArsTechnica.com, because it has a different use of HTML URLs.

So... one thing I do know... is that Firefox perfectly gets a list of all links on a page, hence, how you can load up a web page in Firefox, and all the links are clickable.

And so... I was wondering... is it possible to download the open source Firefox browser engine, or Chrome, or whatever... and pass some parameters to it somehow, and this will give me a list of all URLs on the page..??

I can then feed that into PHP by whatever means, whether it's shell_exec() or whatever.

Is this possible? How do I do it?



Best Answer Haggis , 17 April 2014 - 20:10

managed to test it once i got home

 

ignore the above

 

this works

<?php

$html = file_get_contents('http://arstechnica.com/');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');
       $title = $href->firstChild->nodeValue;
       echo $url.'<br />';
       echo $title.'<br />';
}


?>

Go to the full post



#2 Haggis

Haggis

    Neowinian Senior

  • Tech Issues Solved: 14
  • Joined: 13-June 07
  • Location: Near Stirling, Scotland
  • OS: Debian 7
  • Phone: Samsung Galaxy S3 LTE (i9305)

Posted 17 April 2014 - 12:33

Not sure about the web browser engine but with php you can use this

<?php

$html = file_get_contents('http://www.example.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');
       echo $url.'<br />';
}


?>


#3 OP kingneil

kingneil

    Neowinian

  • Joined: 17-April 14

Posted 17 April 2014 - 14:47

That's some nice code, and it gathered all the arstechnica.com links

 

How do I use this code to grab the text overlay for the links? As in, the text that you click to make the link go to its destination?



#4 Haggis

Haggis

    Neowinian Senior

  • Tech Issues Solved: 14
  • Joined: 13-June 07
  • Location: Near Stirling, Scotland
  • OS: Debian 7
  • Phone: Samsung Galaxy S3 LTE (i9305)

Posted 17 April 2014 - 15:58

not able to test here but try either

<?php

$html = file_get_contents('http://www.example.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');
       $title = $href->nodeValue($i);
       echo $url.'<br />';
       echo $title.'<br />';
}


?>

or

<?php

$html = file_get_contents('http://www.example.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');
       $title = $href->firstChild->nodeValue($i);
       echo $url.'<br />';
       echo $title.'<br />';
}


?>

note the slight difference



#5 Haggis

Haggis

    Neowinian Senior

  • Tech Issues Solved: 14
  • Joined: 13-June 07
  • Location: Near Stirling, Scotland
  • OS: Debian 7
  • Phone: Samsung Galaxy S3 LTE (i9305)

Posted 17 April 2014 - 20:10   Best Answer

managed to test it once i got home

 

ignore the above

 

this works

<?php

$html = file_get_contents('http://arstechnica.com/');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');
       $title = $href->firstChild->nodeValue;
       echo $url.'<br />';
       echo $title.'<br />';
}


?>



#6 Haggis

Haggis

    Neowinian Senior

  • Tech Issues Solved: 14
  • Joined: 13-June 07
  • Location: Near Stirling, Scotland
  • OS: Debian 7
  • Phone: Samsung Galaxy S3 LTE (i9305)

Posted 18 April 2014 - 17:59

did it work :)