simple_html_dom: simple use-case - to get back data for storing in SQLite db

March 1, 2020

hello dear experts and friends on Neowin,

i fairly new to simple_html_dom usage and methods. I know a little the parser,

i want to gather some information from this site:

https://europa.eu/youth/volunteering/organisations_en#open

is this possible to get the content - of let us say 10 or 20 last records on that page - and subesquently to store it in my mysql - db!?

<?php
// Report all PHP errors (see changelog)
error_reporting(E_ALL);

include('inc/simple_html_dom.php');

    //base url
    $base = 'https://europa.eu/youth/volunteering/organisations_en#open';

    //home page HTML
    $html_base = file_get_html( $base );

    //get all category links
    foreach($html_base->find('a') as $element) {
        echo "<pre>";
        print_r( $element->href );
        echo "</pre>";
    }

    $html_base->clear(); 
    unset($html_base);

?>

I have the above code and I'm trying to get certain elements of the page but it isn't returning anything.

Is it possible that certain PHP functions might be disabled on the server to stop that?

The above code works perfectly on other sites.

Is there any workaround?

btw: i have created a small snipped as a proof of concept to run this with Python and BeautifulSoup -


import requests
from bs4 import BeautifulSoup
 
url = 'https://europa.eu/youth/volunteering/organisations_en#open'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('title').text)
block = soup.find('div', class_="eyp-card block-is-flex")

and this....

European Youth Portal
>>> block.a
<a href="/youth/volunteering/organisation/48592_en" target="_blank">"Academy for Peace and Development" Union</a>
>>> block.a.text
'"Academy for Peace and Development" Union'
 
>>> block.select_one('div > div > p:nth-child(9)')
<p><strong>PIC:</strong> 948417016</p>
>>> block.select_one('div > div > p:nth-child(9)').text
'PIC: 948417016'

what is aimed in the end - i want to gather the first 20 results of the page - and put them in to a sql-db or alternatively show the information in a little widget

March 2, 2020

Correct, file_get_html is not a native PHP function, the function resides within your included file that cannot be found so you need to fix that first.

You should immediately be able to see weather or not the file exists by checking the inc folder in your project root.

March 2, 2020

As the error states, simple_html_dom.php either doesn't exist or isn't in the right location.

March 5, 2020

Simply change line 4 to

include('includes/simple_html__dom.php');

March 2, 2020

hi there good day .

after the first try i did another one: - i still get following errors:

<br />
<b>Warning</b>:  include(inc/simple_html_dom.php): failed to open stream: No such file or directory in <b>[...][...]</b> on line <b>5</b><br />
<br />
<b>Warning</b>:  include(): Failed opening 'inc/simple_html_dom.php' for inclusion (include_path='.:') in <b>[...][...]</b> on line <b>5</b><br />
<br />
<b>Fatal error</b>:  Uncaught Error: Call to undefined function file_get_html() in [...][...]:11
Stack trace:
#0 {main}
  thrown in <b>[...][...]</b> on line <b>11</b><br />

and this one:


FATAL ERROR syntax error, unexpected '<', expecting end of file on line number 1

hmmm - i think that i have to do some corrections. I have to investigate the target to find out what is missing - what i have to correct in my testcode.

i will come back later the day..

regards

March 2, 2020

i received the annoying errors on including files _ i need to have more insights on parsing a DOM. i have to check whether the included file exists

March 2, 2020

hi there good day dear @+virtorio, and good day dear @Rix,

first of all: many many thanks for the input. I am very glad to hear from you both

agreed: since file_get_html is not a native PHP function, the function resides within my included file that cannot be found so i need to to fix that first.
The filed does not exist by checking the inc folder in the project root. i am going to fix it.

note - the first testrun i made on systems like the following:
- PHP Sandbox, test PHP online, PHP testersandbox.onlinephpfunctions.com
- PHPTESTER - Test PHP code onlinephptester.net
- Online PHP Editor | Online editor and compilerpaiza.io › projects ›

so yes. this was not able to function. This had to go wrong and fail.

now i am happy that you encouraged me to go into the right direction.

on my machine ( a windows seven) i have installed Atom with the php IDE

so the question is: - i run ATOM

this is my project-folder:

/project
/project/includes/

where to put the above mentioned file - (from the threastart in)?

question: does the above mentioned file resides in that folder - the includes file!?

many thanks for any and all help and for hints with this.

love to hear from you

regards

Edited March 2, 2020 by tarifa

March 2, 2020

it looks like so:

C:\Users\Kasper\Documents\_mk_\_dev_\php\ ->here my_project-file_ 
C:\Users\Kasper\Documents\_mk_\_dev_\php\includes  (and here the "simplehtmldom-parser" from  https://sourceforge.net/projects/simplehtmldom/ goes in

i am going to testrun now the whole thing on my machine - using ATOM

i come back later the day
love to hear from you

regards

ps - see the picture:

Edited March 2, 2020 by tarifa

March 6, 2020

hi there Rix - hello dear +virtorio

first of all - many many thanks for the reply and the hint. I am very glad to hear from you. Thanks for encouraging me to go the way. I am very happy. I appreciate every help.

i have now some first approaches : the "semantic" class is suppoese to be "eyp-card".


function get_eyp_cards_data(){
  $dom = new DomDocument();
  $my_cards = array();

  if ( $dom->load('https://europa.eu/youth/volunteering/organisations_en') ) { // true or false https://www.php.net/manual/en/domdocument.loadhtml.php
    $domx = new DOMXpath($dom);
    $eyp_cards = $domx->query('div[contains(@class,"eyp-card")]'); // returns DOMNodeList https://www.php.net/manual/en/class.domnodelist.php

    if ( $eyp_cards->length > 0 ) { // length IS a property of DOMNodeList. works but looks a bit JSy 
      foreach ( $eyp_cards as $eyp_card ) {
        // Debug: echo '<pre>', var_dump($eyp_card), '<pre>';
        $my_cards[] = array(
          'title' => $eyp_card->getElementsByTagName('h5')->item(0)->nodeValue,
          'content' => $eyp_card->firstChild->nodeValue, // includes title
        );
      }
    }
  }
  return !empty($my_cards) ? $my_cards : false;
}

$my_cards = get_eyp_cards_data();

the referenc of selectors

https://stackoverflow.com/questions/1390568/how-can-i-match-on-an-attribute-that-contains-a-certain-string

regarding the target page: i want to gather some information from this site:

https://europa.eu/youth/volunteering/organisations_en#open

note -there are approx 200 pages or more.

i guess that i will rework this and enhance it to get some data stored in a sql-db

many thanks for all your feed-back and your hints

March 6, 2020

to find out more about how i work with the DOMdocument i go ahead - eg like so: I have this html code:

<html>
    <head>
    ...
    </head>
<body>
    <div>
    <div class="foo" data-type="bar">
        SOMECONTENTWITHMORETAGS
    </div>
    </div>
</body>

and now I'd like to return all html tags (including its attributes) of DOMElement. How I can do that?

How to achive this!?



private function get_html_from_node($node){
  $html = '';
  $children = $node->childNodes;

  foreach ($children as $child) {
    $tmp_doc = new DOMDocument();
    $tmp_doc->appendChild($tmp_doc->importNode($child,true));
    $html .= $tmp_doc->saveHTML();
  } 
  return $html;
}

I already can get the "foo" element (but only its content) with this function above,....

okay - so far so good: I already can get the "foo" element (but only its content) with this function above,....

furthermore: guess that it is pretty woth to thake a closer look at the optional argument to DOMDocument::saveHTML: this says "output this element only".

return $node->ownerDocument->saveHTML($node);

Note that the argument is now in PHP7 available - - it is this since the good old version 5xcy . Before that, you would need to use DOMDocument::saveXML instead. The good thing is that the results may very very helpful - Also, if we already have a reference to the document, we can just do this:

$doc->saveHTML($node);

okay - and now i will work on the above mentioned example... - in europe

if anybody has got some ideas or hints - i appreciate any and all help

Edited March 6, 2020 by tarifa
more insights

Sign In

simple_html_dom: simple use-case - to get back data for storing in SQLite db

Question

tarifa

Link to comment

Share on other sites

9 answers to this question

Recommended Posts

Rix

Link to comment

Share on other sites

+virtorio MVC

Link to comment

Share on other sites

Rix

Link to comment

Share on other sites

tarifa

Link to comment

Share on other sites

tarifa

Link to comment

Share on other sites

tarifa

Link to comment

Share on other sites

tarifa

Link to comment

Share on other sites

tarifa

Link to comment

Share on other sites

tarifa

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Posts

Recent Achievements

Popular Contributors

Tell a friend