Easier HTML Table-scraping For Scripts With Google Drive

We had our first, real, snowfall of the season in Maine today and that usually means school delays/closings. Our “local” station – @WCHS6 – has a Storm Center Closings page as well as an SMS notification service. I decided this morning that I needed a command line version (and, eventually, a version that sends me a Twitter DM), but I also was tight for time (a lunchtime meeting ending early is responsible for this blog post).

While I’ve consumed my share of Beautiful Soup and can throw down some mechanize with the best of them, it came to me that there may be an even easier way, and one that may also help with the eventual blocking of such a scraping service.

I setup a Google Drive spreadsheet to use the importHTML formula to read in the closings table on the page:

=importHTML("http://www.wcsh6.com/weather/severe_weather/cancellations_closings/default.aspx","table",0)

Then did a File→Publish to the web and setup up Sheet 1 to “Automatically republish when changes are made” and also to have the link be to the CSV version of the data:

The raw output looks a bit like:

Name,Status,Last Updated
,,
Westbook Seniors,Luncheon PPD to January 7th,12/17/2012 5:22:51
,,
Allied Wheelchair Van Services,Closed,12/17/2012 6:49:47
,,
American Legion - Dixfield,Bingo cancelled,12/17/2012 11:44:12
,,
American Legion Post 155 - Naples,Closed,12/17/2012 12:49:00

The conversion has some “blank” lines but that’s easy enough to filter out with some quick bash:

curl --silent "https://docs.google.com/spreadsheet/pub?key=0AlCY1qfmPPZVdFBsX3kzLUVHZl9Mdmw3bS1POWNsWnc&single=true&gid=0&outpu
t=csv" | grep -v "^,,"

And, looking for the specific school(s) of our kids is an easy grep as well.

The reason this is interesting is that the importHTML is dynamic and will re-convert the HTML table each time the code retrieves the CSV URL. Couple that with the fact that it’s far less likely that Google will be blocked than it is my IP address(es) and this seems to be a pretty nice alternative to traditional parsing.

If I get some time over the break, I’ll do a quick benchmark of using this method over some python and perl scraping/parsing methods.

rud.is