Over the past five weeks I've been busy working on a web scraper to gather daily news content from the web. The scraper has been running for about a month, and has so far collected over 10,000 unique articles. It continues to run, gathering new content in realtime.
Scraper activity summary
Source | Total indexed URLs | Total parsed URLs | Daily scraped URLs (7-day average) |
---|---|---|---|
News Site A | 2899 | 2668 | 105 |
News Site B | 3202 | 2904 | 99 |
News Site C | 5617 | 4649 | 206 |
Total | 11718 | 10221 | 410 |
At this time the article text is simply being stored in a database along with some minimal metadata (date of publication, etc.). The eventual goal is to use this text as source material for the screaming.computer's generative algorithms. There is also great potential to run various statistical analyses on the text. All of that is still to come.
Scraping process
Never having written a web scraper before, I stuck to a straightforward approach using PHP and the Simple HTML DOM Parser library. The resulting indexer/
- Indexing
Gather links to potential news articles from a news site's main page. - Scraping
Screen out unwanted links; download the page; verify it conforms to article format. - Parsing
Parse the article page; strip out unwanted content (related links sections, pullquotes, embedded multimedia, etc.); reduce to plain text; store headline, publication date, and article text in database.
All this is scheduled using cron jobs.
There is a bunch of logic to filter out duplicate articles (news sites love to provide the same content under multiple headlines and URLs). The code to normalize the article text (weeding out unwanted bits of the web page) is customized for each site. This custom code is necessarily brittle and threatens to break at any moment, but such is the nature of web scraping!
Next steps
The indexer/
With the current sources combined, I'm getting about 350 successfully-parsed articles per day. This should be sufficient to move on to the next stage of the project: breaking the articles into component parts and performing basic text analysis.