When brainstorming for this project, I knew I'd eventually need to write a web scraper to gather text content for the screaming.computer. I did quite a bit of Googling, and none of the scraping options looked promising. The available PHP libraries tended to be wildly overcomplicated for my needs, or very out-of-date, or just unnecessarily difficult to use. There were also a bunch of cloud-based scrapers and a few browser plugins, but these have dependencies I'd rather avoid.
Discovering the PHP Simple HTML DOM Parser library was like striking gold. It's up-to-date, easy-to-use, and still quite fully-featured.
Install SimpleHtmlDom using composer
Having previously installed composer
to manage another library, it's already available on my LAMP server.
Create a directory for the SimpleHtmlDom
library:
mkdir /var/www/_simplehtmldom
cd /var/www/_simplehtmldom
Attempt to install the library stable version:
composer require simplehtmldom/simplehtmldom
Could not find a version of package simplehtmldom/simplehtmldom matching your minimum-stability (stable).
Attempt to install any available version of the library:
composer require simplehtmldom/simplehtmldom:*
Your requirements could not be resolved to an installable set of packages.
The requested package simplehtmldom/simplehtmldom * is satisfiable by simplehtmldom/simplehtmldom
Okay, fine, we'll install the Release Candidate by name:
composer require simplehtmldom/simplehtmldom:2.0-RC2
Clearly I don't know anything about how to use composer
, but we have success nonetheless!
Scrape the web using PHP
This PHP fragment shows one basic use of SimpleHtmlDom
— loading the CBC website and printing a list of all links on the page:
require_once ('/var/www/_simplehtmldom/vendor/autoload.php');
use simplehtmldom\HtmlWeb;
$webParser = new HtmlWeb();
$htmlDoc = $webParser->load ('https://www.cbc.ca/news/');
foreach ($htmlDoc->find ('a') as $anchor)
echo $anchor->href, '<br>';
There are also some very helpful examples at the SimpleHtmlDom
site.