When brainstorming for this project, I knew I'd eventually need to write a web scraper to gather text content for the screaming.computer. I did quite a bit of Googling, and none of the scraping options looked promising. The available PHP libraries tended to be wildly overcomplicated for my needs, or very out-of-date, or just unnecessarily difficult to use. There were also a bunch of cloud-based scrapers and a few browser plugins, but these have dependencies I'd rather avoid.
Discovering the PHP Simple HTML DOM Parser library was like striking gold. It's up-to-date, easy-to-use, and still quite fully-featured.
Install SimpleHtmlDom using composer
Create a directory for the
Attempt to install the library stable version:
composer require simplehtmldom/simplehtmldom
Could not find a version of package simplehtmldom/simplehtmldom matching your minimum-stability (stable).
Attempt to install any available version of the library:
composer require simplehtmldom/simplehtmldom:*
Your requirements could not be resolved to an installable set of packages.
The requested package simplehtmldom/simplehtmldom * is satisfiable by simplehtmldom/simplehtmldom
Okay, fine, we'll install the Release Candidate by name:
composer require simplehtmldom/simplehtmldom:2.0-RC2
Clearly I don't know anything about how to use
composer, but we have success nonetheless!
Scrape the web using PHP
This PHP fragment shows one basic use of
SimpleHtmlDom — loading the CBC website and printing a list of all links on the page:
$webParser = new HtmlWeb();
$htmlDoc = $webParser->load ('https://www.cbc.ca/news/');
foreach ($htmlDoc->find ('a') as $anchor)
echo $anchor->href, '<br>';
There are also some very helpful examples at the