Install the PHP Simple HTML DOM Parser for easy web scraping

(2020.02.17)

When brainstorming for this project, I knew I'd eventually need to write a web scraper to gather text content for the screaming.computer. I did quite a bit of Googling, and none of the scraping options looked promising. The available PHP libraries tended to be wildly overcomplicated for my needs, or very out-of-date, or just unnecessarily difficult to use. There were also a bunch of cloud-based scrapers and a few browser plugins, but these have dependencies I'd rather avoid.

Discovering the PHP Simple HTML DOM Parser library was like striking gold. It's up-to-date, easy-to-use, and still quite fully-featured.

Install SimpleHtmlDom using composer

Having previously installed composer to manage another library, it's already available on my LAMP server.

Create a directory for the SimpleHtmlDom library:

mkdir /var/www/_simplehtmldom cd /var/www/_simplehtmldom

Attempt to install the library stable version:

composer require simplehtmldom/simplehtmldom

Could not find a version of package simplehtmldom/simplehtmldom matching your minimum-stability (stable).

Attempt to install any available version of the library:

composer require simplehtmldom/simplehtmldom:*

Your requirements could not be resolved to an installable set of packages. The requested package simplehtmldom/simplehtmldom * is satisfiable by simplehtmldom/simplehtmldom[2.0-RC2, dev-master] but these conflict with your requirements or minimum-stability.

Okay, fine, we'll install the Release Candidate by name:

composer require simplehtmldom/simplehtmldom:2.0-RC2

Clearly I don't know anything about how to use composer, but we have success nonetheless!

Scrape the web using PHP

This PHP fragment shows one basic use of SimpleHtmlDom — loading the CBC website and printing a list of all links on the page:

require_once ('/var/www/_simplehtmldom/vendor/autoload.php'); use simplehtmldom\HtmlWeb; $webParser = new HtmlWeb(); $htmlDoc = $webParser->load ('https://www.cbc.ca/news/'); foreach ($htmlDoc->find ('a') as $anchor) echo $anchor->href, '<br>';

There are also some very helpful examples at the SimpleHtmlDom site.