Use web browser engine to extract URLs from a page?

April 17, 2014

In PHP, I've tried using simple_html_dom in order to extract URLs from web pages.

And it works a lot of the time, but not all of the time.

For example, it doesn't work on the website ArsTechnica.com, because it has a different use of HTML URLs.

So... one thing I do know... is that Firefox perfectly gets a list of all links on a page, hence, how you can load up a web page in Firefox, and all the links are clickable.

And so... I was wondering... is it possible to download the open source Firefox browser engine, or Chrome, or whatever... and pass some parameters to it somehow, and this will give me a list of all URLs on the page..??

I can then feed that into PHP by whatever means, whether it's shell_exec() or whatever.

Is this possible? How do I do it?

3,531 · April 17, 2014

Not sure about the web browser engine but with php you can use this

<?php

$html = file_get_contents('http://www.example.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');
       echo $url.'<br />';
}


?>

April 17, 2014

That's some nice code, and it gathered all the arstechnica.com links

How do I use this code to grab the text overlay for the links? As in, the text that you click to make the link go to its destination?

3,531 · April 17, 2014

not able to test here but try either

<?php

$html = file_get_contents('http://www.example.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');
       $title = $href->nodeValue($i);
       echo $url.'<br />';
       echo $title.'<br />';
}


?>

or

<?php

$html = file_get_contents('http://www.example.com');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');
       $title = $href->firstChild->nodeValue($i);
       echo $url.'<br />';
       echo $title.'<br />';
}


?>

note the slight difference

3,531 · April 17, 2014

managed to test it once i got home

ignore the above

this works

<?php

$html = file_get_contents('http://arstechnica.com/');

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');
       $title = $href->firstChild->nodeValue;
       echo $url.'<br />';
       echo $title.'<br />';
}


?>

3,531 · April 18, 2014

did it work :)

Sign In

Use web browser engine to extract URLs from a page?

Recommended Posts

kingneil

Link to comment

Share on other sites

Haggis Veteran

Link to comment

Share on other sites

kingneil

Link to comment

Share on other sites

Haggis Veteran

Link to comment

Share on other sites

Haggis Veteran

Link to comment

Share on other sites

Haggis Veteran

Link to comment

Share on other sites

Recently Browsing 0 members

Company

Community

Social

Partners

Forums

News

Features

More

Themes