Development Archives

Category Archives: Development

An OS X R Task Runner for—and a Mini-R-centric review of—Microsoft’s Visual Studio Code Editor

Microsoft’s newfound desire to make themselves desirable to the hipster development community has caused them to make many things [open](https://github.com/Microsoft/) and/or free of late. One of these manifestations is [Visual Studio Code](https://code.visualstudio.com/), an [Atom](https://atom.io/)-ish editor for us code jockeys. I have friends at Microsoft and the Revolution R folks are there now, so I try to give things from Redmond a shot more than I previously would, especially when they make things for Mac.

VS code is so much like Atom (or even [Sublime Text](http://www.sublimetext.com/)) that I won’t go into a full review of it. Suffice it to say it has a file selector pane, editor panes, output panes, snippets, theme support and is pretty extensible. One requirement I appreciate is that it forces you to think of code in terms of projects (you select a directory to edit in) and I also appreciate that they made git a first-class citizen.

Since I do not spend much time building large, compiled applications (this—along with web apps—seems to be VS Code’s sweet spot) there isn’t much initial appeal for me. It also lacks the “intellisense” support for the main language I use (R) so I’m left with basic syntax highlighting (the 90’s called and want their basic editor capabilities back).

None of that would initially drive me away from using something like VS Code and I may end up using it for HTML/CSS/JavaScript projects or even fire it up when I need to do some work in Python or Go. But I won’t be using it for R any time soon. While the aforementioned lack of “intellisense” for R is an issue, I don’t rely on the auto-completion for R but it does occasionally speed up typing and definitely helps with the more esoteric function definitions in equally esoteric packages.

The biggest show-stopper for VS Code is the lack of REPL (a read-eval-print loop) for R. I can fire up an R script in Sublime Text or even Atom and run individual lines of code that are executed in an R session that runs in the background and outputs in an editor pane. It works well but it is (unsurprisingly) a far cry from the tight integration of similar functionality in RStudio. VS Code can run R scripts (it just runs the code through R as you would at the command-line) but has no REPL for R, which means you end up executing the entire script as you go along. No saved state (more on that in a second) means that the beautiful data frame your code created that took 10 minutes to build will take 10 minutes to build every time you tweak model parameters or ggplot2 aesthetics. Granted, you could call R with `–save` but then you have to check for the presence of data structures in your code (so you might as well be programming in non-interactive Python).

An offshoot of the details behind this show-stopper is that you do not get graphics output in a window. You get a single PDF of all plots, just as you would if you ran the R script at the command-line. If you’ve been spoiled by RStudio or even cutting and pasting code from an editor into the R GUI, you will immediately miss the graphics viewer pane.

Unless Microsoft (or some community contributors who desperately want to use R in VS Code for some reason) add some of this functionality to VS Code (including support for seamlessly spinning R scripts and knitting R markdown documents), I cannot recommend using to anyone in the R community.

Having said that, here’s the `tasks.json` configuration if you want to be able to hit `Command-Shift-B` in an R script in VS Code and have it execute and display the output. This configuration is for the official R Project build of R and should work even after a R version upgrade).

{
	"version": "0.1.0",
	"command": "/Library/Frameworks/R.framework/Resources/bin/R",
	"showOutput": "always",
	"args": [
		"--no-restore",
		"--no-save",
		"--quiet",
		"--file=${file}"
	]
}

If you are using VS Code for R (on any platform) your comments would be especially most welcome. It’d be great to hear why you’re using it and how you’ve configured it to help make you as productive as RStudio or ESS has for others.

Rforecastio 1.3.0 Update

2014-05-06 – 20:56
Posted in Development, Open Source, Weather
Tagged post, R, Rforecastio
Comments (2)

Thanks to a comment suggestion, the Rforecastio package is now up to version 1.3.0 and has a new parameter which lets you specify which time conversion function you want to use. Details are up on [github](https://github.com/hrbrmstr/Rforecastio).

Rforecastio 1.2.0 Bug-fix Update

2014-05-05 – 12:57
Posted in Development, Programming, Weather
Tagged forecast.io, post, R, Rforecastio
Comments (3)

Not even going to put an `R` category on this since I don’t want to pollute R-bloggers with this tiny post, but I had to provide the option to let folks specify `ssl.verifypeer=FALSE` (so I made it a generic option to pass in any CURL parameters) and I had a couple gaping bugs that I missed due to not clearing out my environment before building & testing.

Rforecastio Package Update (1.1.0)

2014-05-04 – 08:04
Posted in Development, Open Source, Programming, R, Weather
Tagged post, Rforecastio
Comments (7)

I’ve bumped up the version number of `Rforecastio` ([github](https://github.com/hrbrmstr/Rforecastio)) to `1.1.0`. The new
features are:

– removing the SSL certificate bypass check (it doesn’t need it
anymore)
– using `plyr` for easier conversion of JSON-\>data frame
– adding in a new `daily` forecast data frame
– roxygen2 inline documentation

library(Rforecastio)
library(ggplot2)
library(plyr)
 
# NEVER put API keys in revision control systems or source code!
fio.api.key= readLines("~/.forecast.io")
 
my.latitude = "43.2673"
my.longitude = "-70.8618"
 
fio.list <- fio.forecast(fio.api.key, my.latitude, my.longitude)
 
fio.gg <- ggplot(data=fio.list$hourly.df, aes(x=time, y=temperature))
fio.gg <- fio.gg + labs(y="Readings", x="Time", title="Houry Readings")
fio.gg <- fio.gg + geom_line(aes(y=humidity*100), color="green")
fio.gg <- fio.gg + geom_line(aes(y=temperature), color="red")
fio.gg <- fio.gg + geom_line(aes(y=dewPoint), color="blue")
fio.gg <- fio.gg + theme_bw()
fio.gg

fio.gg <- ggplot(data=fio.list$daily.df, aes(x=time, y=temperature))
fio.gg <- fio.gg + labs(y="Readings", x="Time", title="Daily Readings")
fio.gg <- fio.gg + geom_line(aes(y=humidity*100), color="green")
fio.gg <- fio.gg + geom_line(aes(y=temperatureMax), color="red")
fio.gg <- fio.gg + geom_line(aes(y=temperatureMin), color="red", linetype=2)
fio.gg <- fio.gg + geom_line(aes(y=dewPoint), color="blue")
fio.gg <- fio.gg + theme_bw()
fio.gg

Points, Polygons and Power Outages

2013-12-27 – 10:30
Posted in Data Visualization, DataVis, DataViz, Development, maine, maps, Programming, R
Comments (1)

Most of my free coding time has been spent tweaking a D3-based live power outage tracker for Central Maine Power customers (there’s also a woefully less-featured Shiny app for it, too). There is some R associated with the D3 vis, but it’s limited to a cron job that’s makes the CSV files for the sparklines in D3 vis by

reading historical outage data from a database of scraped readings I’ve been keeping
filling out the time series
reducing it to an hourly time series, and
trimming the data set to the last 30 days of records:

#!/usr/bin/Rscript
# running in a cron job so no spurious text output pls
options(warn=-1)
options(show.error.messages=FALSE)
 
suppressMessages(library(methods))
suppressMessages(library(zoo))
library(chron)
library(xts)
library(reshape2)
library(DBI)
library(RMySQL)
 
m <- dbDriver("MySQL");
con <- dbConnect(m, user='DBUSER', password='DBPASSWORD', host='localhost', dbname='DBNAME');
res <- dbSendQuery(con, "SELECT * FROM outage") # cld just pull the 2 fields I need
outages <- fetch(res, n = -1)
outages$ts <- as.POSIXct(gsub("\\:[0-9]+\\..*$","", outages$ts), format="%Y-%m-%d %H:%M")
 
# for each county we have data for
for (county in unique(outages$county)) {
 
  # get the 2 fields I need (shld prbly filter that in the SELECT)
  outage.raw <- outages[outages$county == county,c(1,4)]
 
  # make it a zoo object
  outage.zoo <- zoo(outage.raw$withoutpower, outage.raw$ts)
 
  # fill out the 15-minute readings
  complete.zoo <- merge(outage.zoo, zoo(, seq(start(outage.zoo), max(outages$ts), by="15 min")), all=TRUE)
  complete.zoo[is.na(complete.zoo)] <- 0
 
  # shrink to hourly and trim at 30 days
  hourly.zoo <- last(to.hourly(complete.zoo), "30 days")
 
  # crank out a CSV  
  df <- data.frame(hourly.zoo)
  df <- data.frame(ts=rownames(df), withoutPower=df$complete.zoo.High)
 
  write.csv(df, sprintf("OUTPOUT_LOCATION/%s.csv",county), row.names=FALSE)
 
}

I knew there were other power companies in Maine, but CMP is the largest, so I focused my attention on getting data from it. After hearing an outage update on MPBN I decided to see if Bangor Hydro Electric had scrape-able content and it turns out there was a lovely (well, as lovely as XML can be) XML file delivered on the page with this “meh” Google push-pin map:

The XML file is used to build the markers for the map and has marker tags that look like this:

<marker name="Exeter" outages="18" 
        lat="44.96218" lng="-69.12253" 
        streets="BUTTERS  RD;CHAMPEON  RD;GREENBUSH  RD;" 
        reflabels="87757-1;84329-1;85032-1;"/>

I’m really only tracking county-level data and BHE does not provide that, even in the huge table of street-level outages that you’ll see on that outage page. I decided to use R to aggregate the BHE data to the county level via the “point-in-polygon” method.

Getting Right To the Point

To perform the aggregation in R, I needed county-level polygons for Maine. I already had that thanks to the previous work, but I wanted to optimize the search process, so I took the US counties shapefile and used OGR from the GDAL (Geospatial Data Abstraction Library) suite to extract just the Maine county polygons:

ogr2ogr -f "ESRI Shapefile" \
        -where "STATE_NAME = 'MAINE'" maine.shp counties.shp

You can see both a reduction in file size and complexity via ogrinfo:

$ll *shp
-rwxr-xr-x@ 1 bob  staff  1517624 Jan  8  2010 counties.shp
-rw-r--r--  1 bob  staff    12588 Dec 26 23:03 maine.shp

$ ogrinfo -sql "SELECT COUNT(*) FROM counties" counties.shp
INFO: Open of 'counties.shp'
      using driver 'ESRI Shapefile' successful.
 
Layer name: counties
Geometry: Polygon
Feature Count: 1
Layer SRS WKT:
(unknown)
COUNT_*: Integer (0.0)
OGRFeature(counties):0
  COUNT_* (Integer) = 3141

$ ogrinfo -sql "SELECT COUNT(*) FROM maine" maine.shp
INFO: Open of 'maine.shp'
      using driver 'ESRI Shapefile' successful.
 
Layer name: maine
Geometry: Polygon
Feature Count: 1
Layer SRS WKT:
(unknown)
COUNT_*: Integer (0.0)
OGRFeature(maine):0
  COUNT_* (Integer) = 16

The conversion reduces the file size from 1.5MB to ~12K and shrinks the number of polygons from 3,141 to 16. The counties.shp and maine.shp shapefiles were built from U.S. census data and have scads more information that you might want to use (i.e. perhaps, for correlation with the outage info, though storms are the prime causal entity for the outages :-):

$ ogrinfo -al -so counties.shp
INFO: Open of 'counties.shp'
      using driver 'ESRI Shapefile' successful.
 
Layer name: counties
Geometry: Polygon
Feature Count: 3141
Extent: (-176.806138, 18.921786) - (-66.969271, 71.406235)
Layer SRS WKT:
GEOGCS["GCS_WGS_1984",
    DATUM["WGS_1984",
        SPHEROID["WGS_84",6378137.0,298.257223563]],
    PRIMEM["Greenwich",0.0],
    UNIT["Degree",0.0174532925199433]]
NAME: String (32.0)
STATE_NAME: String (25.0)
STATE_FIPS: String (2.0)
CNTY_FIPS: String (3.0)
FIPS: String (5.0)
POP2000: Integer (9.0)
POP2007: Integer (9.0)
POP00_SQMI: Real (10.1)
POP07_SQMI: Real (7.1)
WHITE: Integer (9.0)
BLACK: Integer (9.0)
AMERI_ES: Integer (9.0)
ASIAN: Integer (9.0)
HAWN_PI: Integer (9.0)
OTHER: Integer (9.0)
MULT_RACE: Integer (9.0)
HISPANIC: Integer (9.0)
MALES: Integer (9.0)
FEMALES: Integer (9.0)
AGE_UNDER5: Integer (9.0)
AGE_5_17: Integer (9.0)
AGE_18_21: Integer (9.0)
AGE_22_29: Integer (9.0)
AGE_30_39: Integer (9.0)
AGE_40_49: Integer (9.0)
AGE_50_64: Integer (9.0)
AGE_65_UP: Integer (9.0)
MED_AGE: Real (9.1)
MED_AGE_M: Real (9.1)
MED_AGE_F: Real (9.1)
HOUSEHOLDS: Integer (9.0)
AVE_HH_SZ: Real (9.2)
HSEHLD_1_M: Integer (9.0)
HSEHLD_1_F: Integer (9.0)
MARHH_CHD: Integer (9.0)
MARHH_NO_C: Integer (9.0)
MHH_CHILD: Integer (9.0)
FHH_CHILD: Integer (9.0)
FAMILIES: Integer (9.0)
AVE_FAM_SZ: Real (9.2)
HSE_UNITS: Integer (9.0)
VACANT: Integer (9.0)
OWNER_OCC: Integer (9.0)
RENTER_OCC: Integer (9.0)
NO_FARMS97: Real (11.0)
AVG_SIZE97: Real (11.0)
CROP_ACR97: Real (11.0)
AVG_SALE97: Real (7.2)
SQMI: Real (8.1)
OID: Integer (9.0)

With the new shapefile in hand, the basic workflow to get BHE outages at the county level is:

Read and parse the BHE outages XML file to get the lat/long pairs
Build a SpatialPoints object out of those pairs
Make a SpatialPolygonsDataFrame out of the reduced Maine counties shapefile
Overlay the points in the polygons and get the feature metadata intersection (including county)
Aggregate the outage data

The R code (below) does all that and is liberally commented. One has to appreciate how succinct the XML parsing is and the beautiful simplicity of the over() function (which does all the really hard work).

library(XML)
library(maptools)
library(sp)
library(plyr)
 
# Small script to get county-level outage info from Bangor Hydro
# Electric's town(-ish) level info
#
# BHE's outage google push-pin map is at
#   http://apps.bhe.com/about/outages/outage_map.cfm
 
# read BHE outage XML file that was intended for the google map
# yep. One. Line. #takethatpython
 
doc <- xmlTreeParse("http://apps.bhe.com/about/outages/outage_map.xml", 
                    useInternalNodes=TRUE)
 
# xmlToDataFrame() coughed up blood on that simple file, so we have to
# resort to menial labor to bend the XML to our will
 
doc.ls <- xmlToList(doc)
doc.attrs <- doc.ls$.attrs
doc.ls$.attrs <- NULL
 
# this does the data frame conversion magic, tho it winces a bit
 
suppressWarnings(doc.df <- data.frame(do.call(rbind, doc.ls), 
                                      stringsAsFactors=FALSE))
 
# need numbers for some of the columns (vs strings)
 
doc.df$outages <- as.numeric(doc.df$outages)
doc.df$lat <- as.numeric(doc.df$lat)
doc.df$lng <- as.numeric(doc.df$lng)
 
# SpatialPoints likes matrices, note that it's in LON, LAT order
# that always messes me up for some reason
 
doc.m <- as.matrix(doc.df[,c(4,3)])
doc.pts <- SpatialPoints(doc.m)
 
# I trimmed down the country-wide counties file from
#   http://www.baruch.cuny.edu/geoportal/data/esri/usa/census/counties.zip
# with
#   ogr2ogr -f "ESRI Shapefile" -where "STATE_NAME = 'MAINE'" maine.shp counties.shp
# to both save load time and reduce the number of iterations for over() later
 
counties <- readShapePoly("maine.shp", repair=TRUE, IDvar="NAME")
 
# So, all the above was pretty much just for this next line which does  
# the "is this point 'a' in polygon 'b' automagically for us. 
 
found.pts <- over(doc.pts, counties)
 
# steal the column we need (county name) and squirrel it away with outage count
 
doc.df$county <- found.pts$NAME
doc.sub <- doc.df[,c(2,7)]
 
# aggregate the result to get outage count by county
 
count(doc.sub, c("county"), wt_var="outages")
 
##      county freq
##1    Hancock 4440
##2  Penobscot  869
##3      Waldo   28
##4 Washington  545
##5       <NA>  328

Astute readers will notice unresolved points (the NAs). I suspect they are right on coastal boundaries that were probably missed in these simplified county polygons. We can see what they are by looking at the NA entries in the merged data frame:

doc.df[is.na(doc.df$county),c(1:4)]
           name outages      lat       lng
35    Deer Isle       1 44.22451 -68.67778
38   Harborside     166 44.34900 -68.81555
39     Sorrento      43 44.47341 -68.17723
62    Bucksport      71 44.57369 -68.79562
70    Penobscot      40 44.44552 -68.81780
78      Bernard       1 44.24119 -68.35585
79   Stonington       5 44.15619 -68.66669
80 Roque Bluffs       1 44.61286 -67.47971

But a map would be more useful for those not familiar with Maine geography/extents:

library(ggplot2)
 
ff = fortify(counties, region = "NAME")
 
missing <- doc.df[is.na(doc.df$county),]
 
gg <- ggplot(ff, aes(x = long, y = lat))
gg <- gg + geom_path(aes(group = group), size=0.15, fill="black")
gg <- gg + geom_point(data=missing, aes(x=lng, y=lat), 
                      color="#feb24c", size=3)
gg <- gg + coord_map(xlim=extendrange(range(missing$lng)), ylim=extendrange(range(missing$lat)))
gg <- gg + theme_bw()
gg <- gg + labs(x="", y="")
gg <- gg + theme(plot.background = element_rect(fill = "transparent",colour = NA),
                 panel.border = element_blank(),
                 panel.background =element_rect(fill = "transparent",colour = NA),
                 panel.grid = element_blank(),
                 axis.text = element_blank(),
                 axis.ticks = element_blank(),
                 legend.position="right",
                 legend.title=element_blank())
gg

The “zoom in” is done by taking and slightly extending the range of the extracted points via range() and extendrange(), reproduced below:

range(missing$lng)
[1] -68.81780 -67.47971
range(missing$lat)
[1] 44.15619 44.61286
 
extendrange(range(missing$lng))
[1] -68.88470 -67.41281
extendrange(range(missing$lat))
[1] 44.13336 44.63569

It turns out my suspicion was right, so to use this in “production” I’ll need a more accurate shapefile for Maine counties (which I have, but Descent is calling me, so it will have to wait for another day).

I’ll leave you with a non-Google push-pin map of outages that you can build upon (it needs some tweaking):

Plot_Zoom-2

gg <- ggplot(ff, aes(x = long, y = lat))
gg <- gg + geom_polygon(aes(group = group), size=0.15, fill="black", color="#7f7f7f")
gg <- gg + geom_point(data=doc.df, aes(x=lng, y=lat, alpha=outages, size=outages), 
                      color="#feb24c")
gg <- gg + coord_map(xlim=c(-71.5,-66.75), ylim=c(43,47.5))
gg <- gg + theme_bw()
gg <- gg + labs(x="", y="")
gg <- gg + theme(plot.background = element_rect(fill = "transparent",colour = NA),
                 panel.border = element_blank(),
                 panel.background =element_rect(fill = "transparent",colour = NA),
                 panel.grid = element_blank(),
                 axis.text = element_blank(),
                 axis.ticks = element_blank(),
                 legend.position="right",
                 legend.title=element_blank())
gg

You can find all the R code in one, compact gist.

Slopegraph Workbench/Workshop in D3

2013-12-02 – 16:50
Posted in Charts & Graphs, d3, Data Analysis, Data Visualization, DataVis, DataViz, Development, HTML5, Javascript, Open Source, slopegraph
Comments (2)

I’ve been getting a huge uptick in views of my Slopegraphs in Python post and I think it’s due to @edwardtufte’s recent slopegraph contest announcement.

The original Python code is crufty and a mess mostly due to the intermittent attention to it, wanting to reduce dependencies and hacking vs programming. I’ve been wanting to do a D3 version for a while, so I went a bit overboard once I learned of Mr Tufte’s challenge and made more of a “workbench” for making slopegraphs:

It’s all in D3/HTML5/javascrpt/CSS and requires no server-side components at all.

You can play with a live, alpha-quality version and check out the rest of the components on github.

It needs work, but it should be a good starting point for folks.

As my track record for “winning” things is scant, if you do end up using the code, passing on word of my upcoming book with @jayjacobs would be
#spiffy :-)

Mapping Power Outages in Maine Dynamically with Shiny/R

2013-11-27 – 15:13
Posted in data driven security, Data Visualization, DataVis, DataViz, Development, HTML5, R, shiny
Comments (6)

I decided to forego the D3 map mentioned in the previous post in favor of a Shiny one since I had 90% of the mapping code written.

I binned the ranges into three groups, changed the color over to something more pleasant (with RColorBrewer), added an interactive table for the counties with outage and have the elements updating every minute.

You can see the Live Outage Map over at it’s live Shiny server. Source is below or over at github if you’ve got blockers enabled.

Scraping Content From Google Groups

2013-09-25 – 11:57
Posted in Development, hacks, Javascript
Comments (1)

I was helping a friend out who wanted to build a word cloud from the text in Google Groups posts. If you’ve made any efforts to try to get content out of Google Groups you know that the only way to do so is to ensure you subscribe to the group posts via e-mail, then extract all those messages. If you don’t e-mail subscribe to a group, there really is no way to create an archive of the content.

After hacking around a bit and failing, I pulled up the mobile version of the group. You can do that for any Google Group by using the following URL and filling in GROUPNAME for the group you’re interested in: https://groups.google.com/forum/m/#!topic/GROUPNAME.

Then, you’ll need to navigate to a thread, use the double-down arrow to expand all the items in the thread, open up the JavaScript inspector on one of the posts and look for <div dir="ltr">. If that surrounds the post, the following hack will work. Google only seems to add this left-to-right attribute on newer groups, so if you have an older group you need to work with, you’ll need to figure out a different selector (which is coming up in a bit).

With all of the posts expanded, paste the following code into the JavaScript console:

nl = document.querySelectorAll('[dir=ltr]');
s="" ; 
for (i=0; i<nl.length; i++) {
  s = s + nl[i].textContent + "<br/><br/>";
}; 
nw = window.open(); 
nd = nw.document; 
nd.write(s); 
nd.close()

and hit return (I have it spaced out in the code above just for clarity; it will all fit on one line which makes it easier to execute in the console).

Untitled_and_input_text_not_updated_-_Google_Groups

You should get a new browser window (so, you may need to temporarily enable popups on Google Groups for this to work) with the text of all the posts in it. I only put the double <br/> tags in there for the purposes of this example. I just needed the raw text, but you can mark the posts any way you’d like.

You can tweak this hack in many ways to pull as much post metadata as you need since it’s all wrapped in heavily marked <div>s and the base technique should work in a GreaseMonkey or TamperMonkey userscript for those of you with time to code one up.

This hack only lessens the tedium a small amount. You still need to go topic by topic in the group if you want all the content. There’s probably a way to get that navigation automation coded into the script as well. Thankfully, I didn’t need to do that this time around.

If you have other ways to free Google Groups content, drop a note in the comments.