Over the past five weeks I've been busy working on a web scraper to gather daily news content from the web. The scraper has been running for about a month, and has so far collected over 10,000 unique articles. It continues to run, gathering new content in realtime.
Scraper activity summary
|Source||Total indexed URLs||Total parsed URLs||Daily scraped URLs
|News Site A||2899||2668||105|
|News Site B||3202||2904||99|
|News Site C||5617||4649||206|
At this time the article text is simply being stored in a database along with some minimal metadata (date of publication, etc.). The eventual goal is to use this text as source material for the screaming.computer's generative algorithms. There is also great potential to run various statistical analyses on the text. All of that is still to come.
Never having written a web scraper before, I stuck to a straightforward approach using PHP and the Simple HTML DOM Parser library. The resulting indexer/
Gather links to potential news articles from a news site's main page.
Screen out unwanted links; download the page; verify it conforms to article format.
Parse the article page; strip out unwanted content (related links sections, pullquotes, embedded multimedia, etc.); reduce to plain text; store headline, publication date, and article text in database.
All this is scheduled using cron jobs.
There is a bunch of logic to filter out duplicate articles (news sites love to provide the same content under multiple headlines and URLs). The code to normalize the article text (weeding out unwanted bits of the web page) is customized for each site. This custom code is necessarily brittle and threatens to break at any moment, but such is the nature of web scraping!
With the current sources combined, I'm getting about 350 successfully-parsed articles per day. This should be sufficient to move on to the next stage of the project: breaking the articles into component parts and performing basic text analysis.