kingneil Posted April 17, 2014 Share Posted April 17, 2014 In PHP, I've tried using simple_html_dom in order to extract URLs from web pages.And it works a lot of the time, but not all of the time.For example, it doesn't work on the website ArsTechnica.com, because it has a different use of HTML URLs.So... one thing I do know... is that Firefox perfectly gets a list of all links on a page, hence, how you can load up a web page in Firefox, and all the links are clickable.And so... I was wondering... is it possible to download the open source Firefox browser engine, or Chrome, or whatever... and pass some parameters to it somehow, and this will give me a list of all URLs on the page..??I can then feed that into PHP by whatever means, whether it's shell_exec() or whatever.Is this possible? How do I do it? Link to comment Share on other sites More sharing options...
Haggis Veteran Posted April 17, 2014 Veteran Share Posted April 17, 2014 Not sure about the web browser engine but with php you can use this <?php $html = file_get_contents('http://www.example.com'); $dom = new DOMDocument(); @$dom->loadHTML($html); // grab all the on the page $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//a"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); echo $url.'<br />'; } ?> Link to comment Share on other sites More sharing options...
kingneil Posted April 17, 2014 Author Share Posted April 17, 2014 That's some nice code, and it gathered all the arstechnica.com links How do I use this code to grab the text overlay for the links? As in, the text that you click to make the link go to its destination? Link to comment Share on other sites More sharing options...
Haggis Veteran Posted April 17, 2014 Veteran Share Posted April 17, 2014 not able to test here but try either <?php $html = file_get_contents('http://www.example.com'); $dom = new DOMDocument(); @$dom->loadHTML($html); // grab all the on the page $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//a"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); $title = $href->nodeValue($i); echo $url.'<br />'; echo $title.'<br />'; } ?> or <?php $html = file_get_contents('http://www.example.com'); $dom = new DOMDocument(); @$dom->loadHTML($html); // grab all the on the page $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//a"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); $title = $href->firstChild->nodeValue($i); echo $url.'<br />'; echo $title.'<br />'; } ?> note the slight difference Link to comment Share on other sites More sharing options...
Haggis Veteran Posted April 17, 2014 Veteran Share Posted April 17, 2014 managed to test it once i got home ignore the above this works <?php $html = file_get_contents('http://arstechnica.com/'); $dom = new DOMDocument(); @$dom->loadHTML($html); // grab all the on the page $xpath = new DOMXPath($dom); $hrefs = $xpath->evaluate("/html/body//a"); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); $title = $href->firstChild->nodeValue; echo $url.'<br />'; echo $title.'<br />'; } ?> Link to comment Share on other sites More sharing options...
Haggis Veteran Posted April 18, 2014 Veteran Share Posted April 18, 2014 did it work :) Link to comment Share on other sites More sharing options...
Recommended Posts