Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Data Driven Security launches in February 2014. @jayjacobs & I have seen half of the book in PDF form so far and it’s almost unbelievable that this journey is almost over.

Data_Driven_Security___Amazon_Sales_Rank_Tracker

We setup a live Amazon “sales rank” tracker over at the book’s web site and provided some Python and JavaScript code to show folks how use the AWS API in conjunction with the dygraphs charting library to do the same for any ISBN. In the coming weeks, we’ll have a Google App Engine component you can clone to setup something similar without the need for your own server(s).

Since @jayjacobs & I are down to the home stretch on Data Driven Security, I thought it would be interesting to do some post-writing pseudo-analyses of the book itself. I won’t have exact page or word counts for a bit, but I wanted to see how many R packages we ended up relying on for the examples in the chapters. It was fairly straightforward to run a grep for calls to library() or require() across all the source files, and I grouped the results into four categories: “analysis”, “core”, “munging” and “visualization”.

Since I <3 D3 circular dendrograms, I figured that would be a fun way to show the groupings. For those who dislike spinning your noggin around, a more traditional one is also presented. You'll need an SVG-capable browser to see the visualizations (below). Stay on the lookout for more "behind the scenes" posts.

visualizationaplpackcolorspaceggdendroggplot2ggthemesgridExtraigraphmapsmaptoolsRColorBrewervcdanalysisbinomcareffectsportfoliosplinesscalesverisrzoorgdalcoredevtoolsstatsmungingbitopsgdatareshapeplyrrjsonRJSONIO

visualizationaplpackcolorspaceggdendroggplot2ggthemesgridExtraigraphmapsmaptoolsRColorBrewervcdanalysisbinomcareffectsportfoliosplinesscalesverisrzoorgdalcoredevtoolsstatsmungingbitopsgdatareshapeplyrrjsonRJSONIO

The #spiffy @dseverski gave me this posit the other day:

and, I obliged shortly thereafter, but figured I’d toss a post up on the blog before heading to Strata.

To rephrase the tweet a bit, Mr. Severski asked me what alternate encoding I’d use for this grouped bar chart (larger version at the link in David’s tweet):

linkedinq31

I have almost as much disdain for grouped bar charts as I do for pie or donut charts, so appreciated the opportunity to try a makeover. However, I ran into an immediate problem: the usually #spiffy 451 Group folks did not include raw data. So, I reverse engineered the graph with WebPlotDigitizer, cleaned up the result and made a CSV from it. Then, I headed to RStudio with a plan in mind.

The old chart and data screamed faceted dot plot. The only trick necessary was to manually order the factor levels.

library(ggplot)
 
# read in the CSV file
nosql.df <- read.csv("nosql.csv", header=TRUE)
# manually order facets
nosql.df$Database <- factor(nosql.df$Database,
                            levels=c("MongoDB","Cassandra","Redis","HBase","CouchDB",
                                     "Neo4j","Riak","MarkLogic","Couchbase","DynamoDB"))
 
# start the plot
gg <- ggplot(data=nosql.df, aes(x=Quarter, y=Index))
# use points, colored by Quarter
gg <- gg + geom_point(aes(color=Quarter), size=3)
# make strips by nosql db factor
gg <- gg + facet_grid(Database~.)
# rotate the plot
gg <- gg + coord_flip()
# get rid of most of the junk
gg <- gg + theme_bw()
# add a title
gg <- gg + labs(x="", title="NoSQL LinkedIn Skills Index\nSeptember 2013")
# get rid of the legend
gg <- gg + theme(legend.position = "none")
# ensure the strip is gone
gg <- gg + theme(strip.text.x = element_blank())
gg

The result is below in SVG form (install a proper browser if you can’t see it, or run the R code :-) I think it conveys the data in a much more informative way. How would you encode the data to make it more informative and accessible?

Full source & data over at github.




Don’t let the morons in Congress stop you from visiting our fair state during this beautiful, colorful autumn season. Below is a Google Maps view of known, open campgrounds that will let you experience the best our state has to offer this time of year.

Zoom, pan and click on the URLs for more campground info.

UPDATE: Added some extra visualization elements since this post went live. New select menu and hover text for individual job impact detail lines in the table.

I was reviewing RSS feeds when I came across this story about “ObamaCare Employer Mandate: A List Of Cuts To Work Hours, Jobs” over on Investors.com. Efficacy of the law notwithstanding, I thought it might be interesting to visualize the data since the folks over at Investors.com provided a handy spreadsheet that they seem to maintain pretty well (link is in the article).

The spreadsheet is organized by date and lists each state where the jobs were impacted along with the employer, employer type (public/private), reason and number of jobs impacted (if available). They also have links to news stories related to each entry.

My first thought was to compare impact across states by date, so I threw together a quick R script to build a faceted bar chart:

library(ggplot2)
library(plyr)
 
# Source for job impact data:
# http://news.investors.com/politics-obamacare/092513-669013-obamacare-employer-mandate-a-list-of-cuts-to-work-hours-jobs.htm
 
emp.f <- read.csv("~/employers.csv", stringsAsFactors=FALSE)
colnames(emp.f) <- c("State","Employer","Type","Action","Jobs.Cut","Action.Date")
emp.f[is.na(emp.f$Jobs.Cut),]$Jobs.Cut = median(emp.f$Jobs.Cut, na.rm=TRUE)
emp.f[emp.f$State=="Virgina", ]$State = "Virginia"
emp.f[emp.f$State=="Washington DC", ]$State = "District of Columbia"

Yes, they really spelled “Virginia” wrong, at least in the article text where I initially scraped the data from before I saw there was a spreadsheet available. Along with fixing “Virginia”, I also changed the name of “Washington DC” to “District of Columbia” for reasons you’ll see later on in this post. I’m finding it very helpful to do as much of the data cleanup in-code (R or Python) whenever possible since it makes the process far more repeatable than performing the same tasks by hand in a text editor and is essential if you know the data is going to change/expand.

After reading in the data, it was trivial to get a ggplot of the job impacts by state (click image for larger version):

p <- ggplot(emp.f, aes(x=Action.Date, y=Jobs.Cut))
p <- p + geom_bar(aes(fill=State), stat="identity")
p <- p + facet_wrap(~State)
p <- p + theme_bw()
p <- p + theme(legend.position=0, axis.text.x = element_text(angle = 90))
p <- p + labs(x="Action Date", y="# Jobs Cut")
p

oc-facet

That visualization provided some details, but I decided to expand the scope a bit and wanted to make an interactive “bubble chart” (since folks seem to love bubbles) with circle size relative to the total job cuts per state and circle color reflecting the conservative/liberal leaning of each state (i.e. ‘red’ vs ‘blue’) to see if there was any visual correlation by that attribute. I found the political data over at Gallup and went to work prepping the data with some additional R code. (NOTE: The Gallup data was the reason for the “DC” name change since Gallup uses “District of Columbia” in their data set.)

# aggregate state data
emp.state.sum.df <- count(emp.f,c("State"),c("Jobs.Cut"))
colnames(emp.state.sum.df) <- c("State","Total.Jobs.Cut")
 
# get total (estimated) jobs impacted
total.jobs <- sum(emp.state.sum.df$Total.Jobs.Cut)
 
# Source for the red v blue state data:
# http://www.gallup.com/poll/125066/State-States.aspx
# read political leanings
red.blue.df <- read.csv("~/red-blue.csv", stringsAsFactors=FALSE)
 
# join the jobs and leaning data together
s <- join(emp.state.sum.df, red.blue.df, by="State")
 
# cheat and get leaning range for manual input into the datavis
leaning.range <- range(s$Conservative.Advantage)
 
# build the JSON data file. store state summary data for the bubbles, but also include
# the detail level for extra data for the viz
# need to clean up this file post-write and definitely run it through http://jsonlint.com/
jsfile = file("states.tmp","w")
by(s, 1:nrow(s), function(row) {
  writeLines(sprintf('      {"name": "%s", "size":%d, "leaning":%2.1f, "detail":[',row$State,row$Total.Jobs.Cut,row$Conservative.Advantage),jsfile)
  employers = emp.f[emp.f$State == row$State,]
  by(employers, 1:nrow(employers), function(emp.row) {
    writeLines(sprintf('          { "employer":"%s", "emptype":"%s", "actiondetail":"%s", "jobsimpacted":%d, "when":"%s"},',
                       emp.row$Employer, emp.row$Type, gsub('"',"'",emp.row$Action), emp.row$Jobs.Cut, emp.row$Action.Date),jsfile)
 
  })
  writeLines("]},\n",jsfile)   
})
close(jsfile)

I know the comments point out the need to tweak the resulting JSON a bit (mostly to remove “errant” commas, which is one of the annoying bits about JSON), but I wanted to re-emphasize the huge utility of JSONlint as it can save you a great deal of time debugging large amounts of gnarly JSON data.

With the data prepped, I threw together a D3 visualization that shows the bubbles on the left and details by date and employer on the right.

oc-snap.png

Since it’s D3, there’s no need to put the source code in the blog post. Just do a “view-source” on the resulting visualization or poke around the github repository. I will, however, point out a couple useful/interesting bits from the code.

First, coloring circles by political leaning took exactly one line of code since D3 provides a means to map a range of values to colors:

var ramp = d3.scale.linear().domain([-21,36]).range(["#253494","#B30000"]);

I chose the colors with Color Brewer but cheated (as I indicated in the R code) by pre-computing the range of the values for the palette. You can see the tiny District of Columbia’s very blue circle in the lower-left of the field of circles. Hopefully Investors.com will maintain the data set and we can look at changes over a larger period of time.

Second, you get rudimentary “popups” for free via element “title” tags on the SVG circles, so no need for custom tooltip code:

node.append("title")
   .text(function(d) { return d.stateName + ": " + format(d.value) + " jobs impacted"; });

I could have tweaked the display a bit more, added links to the stories and provided a means to sort the “# Jobs” column by count or date, but I took enough time away from the book to scratch this visualization itch and it came out pretty much the way I wanted it to.

If you do hack at it and build something better (which should not be terribly difficult), drop a note in the comments or over at github.

I was helping a friend out who wanted to build a word cloud from the text in Google Groups posts. If you’ve made any efforts to try to get content out of Google Groups you know that the only way to do so is to ensure you subscribe to the group posts via e-mail, then extract all those messages. If you don’t e-mail subscribe to a group, there really is no way to create an archive of the content.

After hacking around a bit and failing, I pulled up the mobile version of the group. You can do that for any Google Group by using the following URL and filling in GROUPNAME for the group you’re interested in: https://groups.google.com/forum/m/#!topic/GROUPNAME.

input_text_not_updated_-_Google_Groups

Then, you’ll need to navigate to a thread, use the double-down arrow to expand all the items in the thread, open up the JavaScript inspector on one of the posts and look for <div dir="ltr">. If that surrounds the post, the following hack will work. Google only seems to add this left-to-right attribute on newer groups, so if you have an older group you need to work with, you’ll need to figure out a different selector (which is coming up in a bit).

With all of the posts expanded, paste the following code into the JavaScript console:

nl = document.querySelectorAll('[dir=ltr]');
s="" ; 
for (i=0; i<nl.length; i++) {
  s = s + nl[i].textContent + "<br/><br/>";
}; 
nw = window.open(); 
nd = nw.document; 
nd.write(s); 
nd.close()

and hit return (I have it spaced out in the code above just for clarity; it will all fit on one line which makes it easier to execute in the console).

Untitled_and_input_text_not_updated_-_Google_Groups

You should get a new browser window (so, you may need to temporarily enable popups on Google Groups for this to work) with the text of all the posts in it. I only put the double <br/> tags in there for the purposes of this example. I just needed the raw text, but you can mark the posts any way you’d like.

You can tweak this hack in many ways to pull as much post metadata as you need since it’s all wrapped in heavily marked <div>s and the base technique should work in a GreaseMonkey or TamperMonkey userscript for those of you with time to code one up.

This hack only lessens the tedium a small amount. You still need to go topic by topic in the group if you want all the content. There’s probably a way to get that navigation automation coded into the script as well. Thankfully, I didn’t need to do that this time around.

If you have other ways to free Google Groups content, drop a note in the comments.

Avast me hearRties! (ok, enough of the pirate speak in a blog post)

It wouldn’t be TLAPD without out some modest code & idea pilfering from Mark Bulling & Simon Raper. While those mateys did a fine job hoisting up some R code (your really didn’t think I’d stop with the pirate speak, did you?) for their example, I took it one step furrrrther to build an animation of cumulative, yearly IRL pirate attacks from 1978 to the present. I found it a bit interesting to see how the hotspots shifted over time. Click on the graphic for the largeR version or I’ll make ye walk the plank!!.

ARRRRRRR!

library(maps)
library(hexbin)
library(maptools)
library(ggplot2)
library(sp)
library(mapproj)
 
# piRate the data from the militaRy
download.file("http://msi.nga.mil/MSISiteContent/StaticFiles/Files/ASAM_shp.zip", destfile="ASAM_shp.zip")
unzip("ASAM_shp.zip")
 
# extRact the data fRame we need fRom the shape file
pirates.df <- as.data.frame(readShapePoints("ASAM 19 SEP 13")) # you may need to use a diffeRent name depending on d/l date
 
# get the woRld map data
world <- map_data("world")
world <- subset(world, region != "Antarctica") # inteRcouRse AntaRctica
 
# yeaRs we want to loop thoRugh
ends <- 1979:2013
 
# loop thRough, extRact data, build plot, save plot: BOOM
for (end in ends) {
  png(filename=sprintf("arrr-%d.png",end),width=500,height=250,bg="white") # change to 1000x500 or laRgeR
  dec.df <- pirates.df[pirates.df$DateOfOcc > "1970-01-01" & pirates.df$DateOfOcc < as.Date(sprintf("%s-12-31",end)),] 
  rng <- range(dec.df$DateOfOcc)
  p <- ggplot() 
  p <- p + geom_polygon(data=world, aes(x=long, y=lat, group=group), fill="gray40", colour="white")
  p <- p + stat_summary_hex(fun="length", data=dec.df, aes(x=coords.x1, y=coords.x2, z=coords.x2), alpha=0.8)
  p <- p + scale_fill_gradient(low="white", high="red", "Pirate Attacks recorded")
  p <- p + theme_bw() + labs(x="",y="", title=sprintf("Pirate Attacks From %s to %s",rng[1],rng[2]))
  p <- p + theme(panel.background = element_rect(fill='#A6BDDB', colour='white'))
  print(p)
  dev.off()
}
 
# requires imagemagick
system("convert -delay 45 -loop 0 arrr*g arrr500.gif")

As I was about to “buffer” a reference to a CSM article on “cyber war” (below), I paused to look for a “print view” or some other icon that would let me show the whole article as a single page vs the annoying multi-page layout view they normally are presented in (which also doesn’t work in my “hardened” Chrome config #meh).

A quick bout of googling got me to the CSM “text edition“. I found the same article there and noticed that there was some straightforward magic in the URL that could be applied to any article.

For any regular article URL like the following (you’ll need to prepend the “http://” part as I had to remove it so WP would stop interpreting and auto-linking the URLs):

www.csmonitor.com/USA/Military/2013/0915/Cyber-security-The-new-arms-race-for-a-new-front-line

Just insert the bolded bits (“layout/set/text/“) below right after the “www.csmonitor.com/” part:

www.csmonitor.com/layout/set/text/USA/Military/2013/0915/Cyber-security-The-new-arms-race-for-a-new-front-line

for an annoyance-free reading experience. You’ll lose any associated pictures, but you’ll be able to get all the real content in one, distraction free page. There’s probably a Chrome or Firefox extension that does this already, but my tolerance for add-ons is growing thin.