screaming.computer


Mount a Network Share in Linux

()

To access a shared folder on your network from Linux Mint, it must be mounted to the local Linux filesystem. Getting the read/write permissions correct is a bit tricky, and often glossed over in other tutorials.

This guide, as usual, assumes your system is configured per the LAMP server setup guide. Furthermore, I assume your network share is wide open (permitting anonymous access) and you want to allow full read/write access from Linux. If your use-case is different, you'll need to adjust the mount parameters.

Since I want to access this share from PHP and Apache, I will be setting its group to the www-data usergroup and enabling group read/write permissions.

Create a directory for your mount point

Create a directory, then set owner and permissions:

sudo mkdir /media/my-network-share
sudo chown $USER:www-data -R /media/my-network-share
sudo chmod 02775 /media/my-network-share

Mount a network share from terminal

Mount the folder using the mount command (all one line, no spaces after commas):

sudo mount -t cifs -o username=$USER,password=,uid=$(id -u),gid=$(id -g www-data),file_mode=0664,dir_mode=02775 //SHARE-IP/SHARE-PATH /media/my-network-share/

SHARE-IP is the IP address of the system which has the shared folder. (You should also be able to use its name, however on my system the name wouldn't resolve and rather than fixing that issue I just used the IP instead.) SHARE-PATH is the shared folder name or path.

The uid and gid parameters mean the mounted directory will be owned by you (the current Linux user) and the www-data Linux usergroup.

The file_mode and dir_mode parameters set the permissions on the mounted directory so that both owner and group can read and write.

This network share will remain mounted until the next reboot. See below for persistent mounts.

Unmount all shares

sudo umount -a -t cifs -l

Auto-mount a network share on every boot

To have the network share automatically mounted on every boot, add it to the fstab file. This requires slightly different syntax, and you must manually determine your user and group ID numbers in advance.

Find your USERID number:

id -u

Find the GROUPID number for the www-data usergroup:

id -g www-data

Use nano to append a line to the fstab file (all one line, no spaces after commas):

sudo nano -w /etc/fstab

//SHARE-IP/SHARE-PATH /media/my-network-share cifs username=USERNAME,password=,uid=USERID,gid=GROUPID,file_mode=0664,dir_mode=02775

Ctrl+O, Enter (to save)
Ctrl+X (to exit nano)

To immediately mount via the fstab file, run:

sudo mount -a

To ensure everything works as expected, reboot your system, then check the share:

sudo reboot
ls -l /media/my-network-share/


Scraping the News

()

Over the past five weeks I've been busy working on a web scraper to gather daily news content from the web. The scraper has been running for about a month, and has so far collected over 10,000 unique articles. It continues to run, gathering new content in realtime.

Scraper activity summary

Source Total indexed URLs Total parsed URLs Daily scraped URLs
(7-day average)
News Site A28992668105
News Site B3202290499
News Site C56174649206
Total1171810221410

At this time the article text is simply being stored in a database along with some minimal metadata (date of publication, etc.). The eventual goal is to use this text as source material for the screaming.computer's generative algorithms. There is also great potential to run various statistical analyses on the text. All of that is still to come.

Scraping process

Never having written a web scraper before, I stuck to a straightforward approach using PHP and the Simple HTML DOM Parser library. The resulting indexer/scraper is only as sophisticated as it needs to be to get the job done. It follows a three-stage process:

  1. Indexing
    Gather links to potential news articles from a news site's main page.
  2. Scraping
    Screen out unwanted links; download the page; verify it conforms to article format.
  3. Parsing
    Parse the article page; strip out unwanted content (related links sections, pullquotes, embedded multimedia, etc.); reduce to plain text; store headline, publication date, and article text in database.

All this is scheduled using cron jobs.

There is a bunch of logic to filter out duplicate articles (news sites love to provide the same content under multiple headlines and URLs). The code to normalize the article text (weeding out unwanted bits of the web page) is customized for each site. This custom code is necessarily brittle and threatens to break at any moment, but such is the nature of web scraping!

Stack of newspapers

Next steps

The indexer/scraper/parser framework is robust enough that I can add additional news sources in the future if desired. Each new source takes a few days to customize, including many rounds of reviewing and correcting the results to handle edge-cases.

With the current sources combined, I'm getting about 350 successfully-parsed articles per day. This should be sufficient to move on to the next stage of the project: breaking the articles into component parts and performing basic text analysis.


Saturday Night Coding Fuel

()

YouTube randomly suggested this live DJ set by Belgium-based Amelie Lens, and it makes the perfect background for some weekend evening coding.

Been a while since I've listened to some proper techno.

As one commenter put it, “come for the music, stay for the cat.”