Over the past five weeks I've been busy working on a web scraper to gather daily news content from the web. The scraper has been running for about a month, and has so far collected over 10,000 unique articles. It continues to run, gathering new content in realtime.

Scraper activity summary

Source Total indexed URLs Total parsed URLs Daily scraped URLs
(7-day average)
News Site A28992668105
News Site B3202290499
News Site C56174649206

At this time the article text is simply being stored in a database along with some minimal metadata (date of publication, etc.). The eventual goal is to use this text as source material for the screaming.computer's generative algorithms. There is also great potential to run various statistical analyses on the text. All of that is still to come.

Scraping process

Never having written a web scraper before, I stuck to a straightforward approach using PHP and the Simple HTML DOM Parser library. The resulting indexer/scraper is only as sophisticated as it needs to be to get the job done. It follows a three-stage process:

  1. Indexing
    Gather links to potential news articles from a news site's main page.
  2. Scraping
    Screen out unwanted links; download the page; verify it conforms to article format.
  3. Parsing
    Parse the article page; strip out unwanted content (related links sections, pullquotes, embedded multimedia, etc.); reduce to plain text; store headline, publication date, and article text in database.

All this is scheduled using cron jobs.

There is a bunch of logic to filter out duplicate articles (news sites love to provide the same content under multiple headlines and URLs). The code to normalize the article text (weeding out unwanted bits of the web page) is customized for each site. This custom code is necessarily brittle and threatens to break at any moment, but such is the nature of web scraping!

Stack of newspapers

Next steps

The indexer/scraper/parser framework is robust enough that I can add additional news sources in the future if desired. Each new source takes a few days to customize, including many rounds of reviewing and correcting the results to handle edge-cases.

With the current sources combined, I'm getting about 350 successfully-parsed articles per day. This should be sufficient to move on to the next stage of the project: breaking the articles into component parts and performing basic text analysis.