Use web browser engine to extract URLs from a page?


Recommended Posts

In PHP, I've tried using simple_html_dom in order to extract URLs from web pages.

And it works a lot of the time, but not all of the time.

For example, it doesn't work on the website ArsTechnica.com, because it has a different use of HTML URLs.

So... one thing I do know... is that Firefox perfectly gets a list of all links on a page, hence, how you can load up a web page in Firefox, and all the links are clickable.

And so... I was wondering... is it possible to download the open source Firefox browser engine, or Chrome, or whatever... and pass some parameters to it somehow, and this will give me a list of all URLs on the page..??

I can then feed that into PHP by whatever means, whether it's shell_exec() or whatever.

Is this possible? How do I do it?

Link to comment
Share on other sites

Not sure about the web browser engine but with php you can use this

<?php

$html = file_get_contents('http://www.example.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');
       echo $url.'<br />';
}


?>
Link to comment
Share on other sites

That's some nice code, and it gathered all the arstechnica.com links

 

How do I use this code to grab the text overlay for the links? As in, the text that you click to make the link go to its destination?

Link to comment
Share on other sites

not able to test here but try either

<?php

$html = file_get_contents('http://www.example.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');
       $title = $href->nodeValue($i);
       echo $url.'<br />';
       echo $title.'<br />';
}


?>

or

<?php

$html = file_get_contents('http://www.example.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');
       $title = $href->firstChild->nodeValue($i);
       echo $url.'<br />';
       echo $title.'<br />';
}


?>

note the slight difference

Link to comment
Share on other sites

managed to test it once i got home

 

ignore the above

 

this works

<?php

$html = file_get_contents('http://arstechnica.com/');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');
       $title = $href->firstChild->nodeValue;
       echo $url.'<br />';
       echo $title.'<br />';
}


?>

Link to comment
Share on other sites

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.