Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Don’t let the morons in Congress stop you from visiting our fair state during this beautiful, colorful autumn season. Below is a Google Maps view of known, open campgrounds that will let you experience the best our state has to offer this time of year.

Zoom, pan and click on the URLs for more campground info.

UPDATE: Added some extra visualization elements since this post went live. New select menu and hover text for individual job impact detail lines in the table.

I was reviewing RSS feeds when I came across this story about “ObamaCare Employer Mandate: A List Of Cuts To Work Hours, Jobs” over on Investors.com. Efficacy of the law notwithstanding, I thought it might be interesting to visualize the data since the folks over at Investors.com provided a handy spreadsheet that they seem to maintain pretty well (link is in the article).

The spreadsheet is organized by date and lists each state where the jobs were impacted along with the employer, employer type (public/private), reason and number of jobs impacted (if available). They also have links to news stories related to each entry.

My first thought was to compare impact across states by date, so I threw together a quick R script to build a faceted bar chart:

library(ggplot2)
library(plyr)
 
# Source for job impact data:
# http://news.investors.com/politics-obamacare/092513-669013-obamacare-employer-mandate-a-list-of-cuts-to-work-hours-jobs.htm
 
emp.f <- read.csv("~/employers.csv", stringsAsFactors=FALSE)
colnames(emp.f) <- c("State","Employer","Type","Action","Jobs.Cut","Action.Date")
emp.f[is.na(emp.f$Jobs.Cut),]$Jobs.Cut = median(emp.f$Jobs.Cut, na.rm=TRUE)
emp.f[emp.f$State=="Virgina", ]$State = "Virginia"
emp.f[emp.f$State=="Washington DC", ]$State = "District of Columbia"

Yes, they really spelled “Virginia” wrong, at least in the article text where I initially scraped the data from before I saw there was a spreadsheet available. Along with fixing “Virginia”, I also changed the name of “Washington DC” to “District of Columbia” for reasons you’ll see later on in this post. I’m finding it very helpful to do as much of the data cleanup in-code (R or Python) whenever possible since it makes the process far more repeatable than performing the same tasks by hand in a text editor and is essential if you know the data is going to change/expand.

After reading in the data, it was trivial to get a ggplot of the job impacts by state (click image for larger version):

p <- ggplot(emp.f, aes(x=Action.Date, y=Jobs.Cut))
p <- p + geom_bar(aes(fill=State), stat="identity")
p <- p + facet_wrap(~State)
p <- p + theme_bw()
p <- p + theme(legend.position=0, axis.text.x = element_text(angle = 90))
p <- p + labs(x="Action Date", y="# Jobs Cut")
p

oc-facet

That visualization provided some details, but I decided to expand the scope a bit and wanted to make an interactive “bubble chart” (since folks seem to love bubbles) with circle size relative to the total job cuts per state and circle color reflecting the conservative/liberal leaning of each state (i.e. ‘red’ vs ‘blue’) to see if there was any visual correlation by that attribute. I found the political data over at Gallup and went to work prepping the data with some additional R code. (NOTE: The Gallup data was the reason for the “DC” name change since Gallup uses “District of Columbia” in their data set.)

# aggregate state data
emp.state.sum.df <- count(emp.f,c("State"),c("Jobs.Cut"))
colnames(emp.state.sum.df) <- c("State","Total.Jobs.Cut")
 
# get total (estimated) jobs impacted
total.jobs <- sum(emp.state.sum.df$Total.Jobs.Cut)
 
# Source for the red v blue state data:
# http://www.gallup.com/poll/125066/State-States.aspx
# read political leanings
red.blue.df <- read.csv("~/red-blue.csv", stringsAsFactors=FALSE)
 
# join the jobs and leaning data together
s <- join(emp.state.sum.df, red.blue.df, by="State")
 
# cheat and get leaning range for manual input into the datavis
leaning.range <- range(s$Conservative.Advantage)
 
# build the JSON data file. store state summary data for the bubbles, but also include
# the detail level for extra data for the viz
# need to clean up this file post-write and definitely run it through http://jsonlint.com/
jsfile = file("states.tmp","w")
by(s, 1:nrow(s), function(row) {
  writeLines(sprintf('      {"name": "%s", "size":%d, "leaning":%2.1f, "detail":[',row$State,row$Total.Jobs.Cut,row$Conservative.Advantage),jsfile)
  employers = emp.f[emp.f$State == row$State,]
  by(employers, 1:nrow(employers), function(emp.row) {
    writeLines(sprintf('          { "employer":"%s", "emptype":"%s", "actiondetail":"%s", "jobsimpacted":%d, "when":"%s"},',
                       emp.row$Employer, emp.row$Type, gsub('"',"'",emp.row$Action), emp.row$Jobs.Cut, emp.row$Action.Date),jsfile)
 
  })
  writeLines("]},\n",jsfile)   
})
close(jsfile)

I know the comments point out the need to tweak the resulting JSON a bit (mostly to remove “errant” commas, which is one of the annoying bits about JSON), but I wanted to re-emphasize the huge utility of JSONlint as it can save you a great deal of time debugging large amounts of gnarly JSON data.

With the data prepped, I threw together a D3 visualization that shows the bubbles on the left and details by date and employer on the right.

oc-snap.png

Since it’s D3, there’s no need to put the source code in the blog post. Just do a “view-source” on the resulting visualization or poke around the github repository. I will, however, point out a couple useful/interesting bits from the code.

First, coloring circles by political leaning took exactly one line of code since D3 provides a means to map a range of values to colors:

var ramp = d3.scale.linear().domain([-21,36]).range(["#253494","#B30000"]);

I chose the colors with Color Brewer but cheated (as I indicated in the R code) by pre-computing the range of the values for the palette. You can see the tiny District of Columbia’s very blue circle in the lower-left of the field of circles. Hopefully Investors.com will maintain the data set and we can look at changes over a larger period of time.

Second, you get rudimentary “popups” for free via element “title” tags on the SVG circles, so no need for custom tooltip code:

node.append("title")
   .text(function(d) { return d.stateName + ": " + format(d.value) + " jobs impacted"; });

I could have tweaked the display a bit more, added links to the stories and provided a means to sort the “# Jobs” column by count or date, but I took enough time away from the book to scratch this visualization itch and it came out pretty much the way I wanted it to.

If you do hack at it and build something better (which should not be terribly difficult), drop a note in the comments or over at github.

I was helping a friend out who wanted to build a word cloud from the text in Google Groups posts. If you’ve made any efforts to try to get content out of Google Groups you know that the only way to do so is to ensure you subscribe to the group posts via e-mail, then extract all those messages. If you don’t e-mail subscribe to a group, there really is no way to create an archive of the content.

After hacking around a bit and failing, I pulled up the mobile version of the group. You can do that for any Google Group by using the following URL and filling in GROUPNAME for the group you’re interested in: https://groups.google.com/forum/m/#!topic/GROUPNAME.

input_text_not_updated_-_Google_Groups

Then, you’ll need to navigate to a thread, use the double-down arrow to expand all the items in the thread, open up the JavaScript inspector on one of the posts and look for <div dir="ltr">. If that surrounds the post, the following hack will work. Google only seems to add this left-to-right attribute on newer groups, so if you have an older group you need to work with, you’ll need to figure out a different selector (which is coming up in a bit).

With all of the posts expanded, paste the following code into the JavaScript console:

nl = document.querySelectorAll('[dir=ltr]');
s="" ; 
for (i=0; i<nl.length; i++) {
  s = s + nl[i].textContent + "<br/><br/>";
}; 
nw = window.open(); 
nd = nw.document; 
nd.write(s); 
nd.close()

and hit return (I have it spaced out in the code above just for clarity; it will all fit on one line which makes it easier to execute in the console).

Untitled_and_input_text_not_updated_-_Google_Groups

You should get a new browser window (so, you may need to temporarily enable popups on Google Groups for this to work) with the text of all the posts in it. I only put the double <br/> tags in there for the purposes of this example. I just needed the raw text, but you can mark the posts any way you’d like.

You can tweak this hack in many ways to pull as much post metadata as you need since it’s all wrapped in heavily marked <div>s and the base technique should work in a GreaseMonkey or TamperMonkey userscript for those of you with time to code one up.

This hack only lessens the tedium a small amount. You still need to go topic by topic in the group if you want all the content. There’s probably a way to get that navigation automation coded into the script as well. Thankfully, I didn’t need to do that this time around.

If you have other ways to free Google Groups content, drop a note in the comments.

Avast me hearRties! (ok, enough of the pirate speak in a blog post)

It wouldn’t be TLAPD without out some modest code & idea pilfering from Mark Bulling & Simon Raper. While those mateys did a fine job hoisting up some R code (your really didn’t think I’d stop with the pirate speak, did you?) for their example, I took it one step furrrrther to build an animation of cumulative, yearly IRL pirate attacks from 1978 to the present. I found it a bit interesting to see how the hotspots shifted over time. Click on the graphic for the largeR version or I’ll make ye walk the plank!!.

ARRRRRRR!

library(maps)
library(hexbin)
library(maptools)
library(ggplot2)
library(sp)
library(mapproj)
 
# piRate the data from the militaRy
download.file("http://msi.nga.mil/MSISiteContent/StaticFiles/Files/ASAM_shp.zip", destfile="ASAM_shp.zip")
unzip("ASAM_shp.zip")
 
# extRact the data fRame we need fRom the shape file
pirates.df <- as.data.frame(readShapePoints("ASAM 19 SEP 13")) # you may need to use a diffeRent name depending on d/l date
 
# get the woRld map data
world <- map_data("world")
world <- subset(world, region != "Antarctica") # inteRcouRse AntaRctica
 
# yeaRs we want to loop thoRugh
ends <- 1979:2013
 
# loop thRough, extRact data, build plot, save plot: BOOM
for (end in ends) {
  png(filename=sprintf("arrr-%d.png",end),width=500,height=250,bg="white") # change to 1000x500 or laRgeR
  dec.df <- pirates.df[pirates.df$DateOfOcc > "1970-01-01" & pirates.df$DateOfOcc < as.Date(sprintf("%s-12-31",end)),] 
  rng <- range(dec.df$DateOfOcc)
  p <- ggplot() 
  p <- p + geom_polygon(data=world, aes(x=long, y=lat, group=group), fill="gray40", colour="white")
  p <- p + stat_summary_hex(fun="length", data=dec.df, aes(x=coords.x1, y=coords.x2, z=coords.x2), alpha=0.8)
  p <- p + scale_fill_gradient(low="white", high="red", "Pirate Attacks recorded")
  p <- p + theme_bw() + labs(x="",y="", title=sprintf("Pirate Attacks From %s to %s",rng[1],rng[2]))
  p <- p + theme(panel.background = element_rect(fill='#A6BDDB', colour='white'))
  print(p)
  dev.off()
}
 
# requires imagemagick
system("convert -delay 45 -loop 0 arrr*g arrr500.gif")

As I was about to “buffer” a reference to a CSM article on “cyber war” (below), I paused to look for a “print view” or some other icon that would let me show the whole article as a single page vs the annoying multi-page layout view they normally are presented in (which also doesn’t work in my “hardened” Chrome config #meh).

A quick bout of googling got me to the CSM “text edition“. I found the same article there and noticed that there was some straightforward magic in the URL that could be applied to any article.

For any regular article URL like the following (you’ll need to prepend the “http://” part as I had to remove it so WP would stop interpreting and auto-linking the URLs):

www.csmonitor.com/USA/Military/2013/0915/Cyber-security-The-new-arms-race-for-a-new-front-line

Just insert the bolded bits (“layout/set/text/“) below right after the “www.csmonitor.com/” part:

www.csmonitor.com/layout/set/text/USA/Military/2013/0915/Cyber-security-The-new-arms-race-for-a-new-front-line

for an annoyance-free reading experience. You’ll lose any associated pictures, but you’ll be able to get all the real content in one, distraction free page. There’s probably a Chrome or Firefox extension that does this already, but my tolerance for add-ons is growing thin.

Having received a couple follow-ups to the OS X notifications on RStudio Desktop for the Mac post, I was determined to find a quick hack to get remote notifications to OS X working from (at least) RStudio Server instances running on the same network. It turns out the hack was pretty straightforward just by using a combination of Growl and gntp-send.

To preempt detractors: Yes, Growl isn’t free for newer versions of OS X; but $3.99USD is worth skipping a frappuccino for if you desire this functionality (IMO). I’ve had Growl running since before there was an app store and it’s far more hackable than the Notification Center is (as demonstrated by this post).

You’ll need to configure Growl to listen for incoming connections (with an optional password, which is a good idea if you’re fairly mobile).

Preferences

Plus, you’ll also want to decide whether you want Notification Center integration or have Growl work independently. My preference is integrated, but YMMV.

The gntp-send app should build without issues. I did it via source download / configure / make / make install on a recent-ish Ubuntu box.

Then it’s just a matter of sourcing a version of this function. You’ll most likely wish to make more of the items defaults. Window users will need to tweak this a bit to work, but I’m betting most RStudio Server instances are running on Linux variants. I have it automatically setting the title and including which RStudio Server host the notice came from.

notify.gntp <- function(message, server, port=23053) {
  system(sprintf("/usr/local/bin/gntp-send -a 'RStudio Server' -s %s:%s '[running on %s]' '%s'",
                 server, port, as.character(Sys.info()["nodename"]), message),
         ignore.stdout=TRUE, ignore.stderr=TRUE, wait=FALSE)
}
 
# test it out 
WORKSTATION_RUNNING_GROWL = "10.0.1.5"
notify.gntp("ddply() finished", WORKSTATION_RUNNING_GROWL)

Banners_and_Alerts

You are going to need to do some additional work/coding if your IP address(es) change. I was going to hack something together that parses netstat output to make a guess at which node was the originating OS X system, but it should be quick enough to change out what your client IP address is, especially since this hack is intended for long-running jobs.

It’d be #spiffy if RStudio Server supported the browser notifications API and/or access to http header info from within the R session to make hacks like this easier or not necessary.

Thanks to a comment, I tweaked the data retrieval to ignore SSL cert errors. You can change that tweak back if you go through the pain of updating the SSL libraries on your Windows boxes (it doesn’t seem to be an issue on OS X/Linux).

I also changed the date routines to use as.POSIXlt instead of ISOdatetime as the latter seemed to cause issues for some folks.

All changes have been pushed to the github repo.

2013-09-16 UPDATE: I took suggestions from a couple comments, expanded the function a bit and stuck it in a gist. See this comment for details.

The data retrieval and computation operations are taking longer and longer as we start cranking through more security data and I’ll often let tasks run in the background whilst performing more mundane tasks or wasting time on Twitter. For folks using RStudio Desktop on a Mac, you can use the #spiffy terminal-notifier from Julien Blanchard (@julienXX) wrapped in a cozy little R function to alert you when your long-running jobs are complete.

After a quick “gem install terminal-notifier” you just need to add this notify() function to your R repertoire:

notify <- function(message="Operation complete") {
  system(sprintf("/usr/bin/terminal-notifier -title 'RStudio' -message '%s' -sender org.rstudio.RStudio -activate org.rstudio.RStudio",
                 message),
         ignore.stdout=TRUE, ignore.stderr=TRUE, wait=FALSE)
}

and add a call to it right after a potentially long-running operation to get a clickable notification right in the Notification Center:

system("sleep 10")
notify("long computation complete")

Banners_and_Alerts

I’m working on a way to send notifications from RStudio Server when using one of the standalone clients mentioned in a previous post, so stay tuned if you need that functionality as well.