Skip navigation

Category Archives: Data Analysis

UPDATE: Added some extra visualization elements since this post went live. New select menu and hover text for individual job impact detail lines in the table.

I was reviewing RSS feeds when I came across this story about “ObamaCare Employer Mandate: A List Of Cuts To Work Hours, Jobs” over on Investors.com. Efficacy of the law notwithstanding, I thought it might be interesting to visualize the data since the folks over at Investors.com provided a handy spreadsheet that they seem to maintain pretty well (link is in the article).

The spreadsheet is organized by date and lists each state where the jobs were impacted along with the employer, employer type (public/private), reason and number of jobs impacted (if available). They also have links to news stories related to each entry.

My first thought was to compare impact across states by date, so I threw together a quick R script to build a faceted bar chart:

library(ggplot2)
library(plyr)
 
# Source for job impact data:
# http://news.investors.com/politics-obamacare/092513-669013-obamacare-employer-mandate-a-list-of-cuts-to-work-hours-jobs.htm
 
emp.f <- read.csv("~/employers.csv", stringsAsFactors=FALSE)
colnames(emp.f) <- c("State","Employer","Type","Action","Jobs.Cut","Action.Date")
emp.f[is.na(emp.f$Jobs.Cut),]$Jobs.Cut = median(emp.f$Jobs.Cut, na.rm=TRUE)
emp.f[emp.f$State=="Virgina", ]$State = "Virginia"
emp.f[emp.f$State=="Washington DC", ]$State = "District of Columbia"

Yes, they really spelled “Virginia” wrong, at least in the article text where I initially scraped the data from before I saw there was a spreadsheet available. Along with fixing “Virginia”, I also changed the name of “Washington DC” to “District of Columbia” for reasons you’ll see later on in this post. I’m finding it very helpful to do as much of the data cleanup in-code (R or Python) whenever possible since it makes the process far more repeatable than performing the same tasks by hand in a text editor and is essential if you know the data is going to change/expand.

After reading in the data, it was trivial to get a ggplot of the job impacts by state (click image for larger version):

p <- ggplot(emp.f, aes(x=Action.Date, y=Jobs.Cut))
p <- p + geom_bar(aes(fill=State), stat="identity")
p <- p + facet_wrap(~State)
p <- p + theme_bw()
p <- p + theme(legend.position=0, axis.text.x = element_text(angle = 90))
p <- p + labs(x="Action Date", y="# Jobs Cut")
p

oc-facet

That visualization provided some details, but I decided to expand the scope a bit and wanted to make an interactive “bubble chart” (since folks seem to love bubbles) with circle size relative to the total job cuts per state and circle color reflecting the conservative/liberal leaning of each state (i.e. ‘red’ vs ‘blue’) to see if there was any visual correlation by that attribute. I found the political data over at Gallup and went to work prepping the data with some additional R code. (NOTE: The Gallup data was the reason for the “DC” name change since Gallup uses “District of Columbia” in their data set.)

# aggregate state data
emp.state.sum.df <- count(emp.f,c("State"),c("Jobs.Cut"))
colnames(emp.state.sum.df) <- c("State","Total.Jobs.Cut")
 
# get total (estimated) jobs impacted
total.jobs <- sum(emp.state.sum.df$Total.Jobs.Cut)
 
# Source for the red v blue state data:
# http://www.gallup.com/poll/125066/State-States.aspx
# read political leanings
red.blue.df <- read.csv("~/red-blue.csv", stringsAsFactors=FALSE)
 
# join the jobs and leaning data together
s <- join(emp.state.sum.df, red.blue.df, by="State")
 
# cheat and get leaning range for manual input into the datavis
leaning.range <- range(s$Conservative.Advantage)
 
# build the JSON data file. store state summary data for the bubbles, but also include
# the detail level for extra data for the viz
# need to clean up this file post-write and definitely run it through http://jsonlint.com/
jsfile = file("states.tmp","w")
by(s, 1:nrow(s), function(row) {
  writeLines(sprintf('      {"name": "%s", "size":%d, "leaning":%2.1f, "detail":[',row$State,row$Total.Jobs.Cut,row$Conservative.Advantage),jsfile)
  employers = emp.f[emp.f$State == row$State,]
  by(employers, 1:nrow(employers), function(emp.row) {
    writeLines(sprintf('          { "employer":"%s", "emptype":"%s", "actiondetail":"%s", "jobsimpacted":%d, "when":"%s"},',
                       emp.row$Employer, emp.row$Type, gsub('"',"'",emp.row$Action), emp.row$Jobs.Cut, emp.row$Action.Date),jsfile)
 
  })
  writeLines("]},\n",jsfile)   
})
close(jsfile)

I know the comments point out the need to tweak the resulting JSON a bit (mostly to remove “errant” commas, which is one of the annoying bits about JSON), but I wanted to re-emphasize the huge utility of JSONlint as it can save you a great deal of time debugging large amounts of gnarly JSON data.

With the data prepped, I threw together a D3 visualization that shows the bubbles on the left and details by date and employer on the right.

oc-snap.png

Since it’s D3, there’s no need to put the source code in the blog post. Just do a “view-source” on the resulting visualization or poke around the github repository. I will, however, point out a couple useful/interesting bits from the code.

First, coloring circles by political leaning took exactly one line of code since D3 provides a means to map a range of values to colors:

var ramp = d3.scale.linear().domain([-21,36]).range(["#253494","#B30000"]);

I chose the colors with Color Brewer but cheated (as I indicated in the R code) by pre-computing the range of the values for the palette. You can see the tiny District of Columbia’s very blue circle in the lower-left of the field of circles. Hopefully Investors.com will maintain the data set and we can look at changes over a larger period of time.

Second, you get rudimentary “popups” for free via element “title” tags on the SVG circles, so no need for custom tooltip code:

node.append("title")
   .text(function(d) { return d.stateName + ": " + format(d.value) + " jobs impacted"; });

I could have tweaked the display a bit more, added links to the stories and provided a means to sort the “# Jobs” column by count or date, but I took enough time away from the book to scratch this visualization itch and it came out pretty much the way I wanted it to.

If you do hack at it and build something better (which should not be terribly difficult), drop a note in the comments or over at github.

R lacks some of the more “utilitarian” features found in other scripting languages that were/are more geared—at least initially—towards systems administration. One of the most frustrating missing pieces for security data scientists is the lack of ability to perform basic IP address manipulations, including reverse DNS resolution (even though it has nsl() which is just glue to gethostbyname()!).

If you need to perform reverse resolution, the only two viable options available are to (a) pre-resolve a list of IP addresses or (b) whip up something in R that takes advantage of the ability to perform system calls. Yes, one could write a C/C++ API library that accesses native resolver routines, but that becomes a pain to maintain across platforms. System calls also create some cross-platform issues, but they are usually easier for the typical R user to overcome.

Assuming the dig command is available on your linux, BSD or Mac OS system, it’s pretty trivial to pass in a list of IP addresses to a simple sapply() one-liner:

resolved = sapply(ips, function(x) system(sprintf("dig -x %s +short",x), intern=TRUE))

That works for fairly small lists of addresses, but doesn’t scale well to hundreds or thousands of addresses. (Also, @jayjacobs kinda hates my one-liners #true.)

A better way is to generate a batch query to dig, but the results will be synchronous, which could take A Very Long Time depending on the size of the list and types of results.

The best way (IMO) to tackle this problem is to perform an asynchronous batch query and post-process the results, which we can do with a little help from adns (which homebrew users can install with a quick “brew install adns“).

Once adns is installed, it’s just a matter of writing out a query list, performing the asynchronous batch lookup, parsing the results and re-integrating with the original IP list (which is necessary since errant or unresponsive reverse queries will not be returned by the adns system call).

#pretend this is A Very Long List of IPs
ip.list = c("1.1.1.1", "2.3.4.99", "1.1.1.2", "2.3.4.100", "70.196.7.32", 
  "146.160.21.171", "146.160.21.172", "146.160.21.186", "2.3.4.101", 
  "216.167.176.93", "1.1.1.3", "2.3.4.5", "2.3.4.88", "2.3.4.9", 
  "98.208.205.1", "24.34.218.80", "159.127.124.209", "70.196.198.151", 
  "70.192.72.48", "173.192.34.24", "65.214.243.208", "173.45.242.179", 
  "184.106.97.102", "198.61.171.18", "71.184.118.37", "70.215.200.159", 
  "184.107.87.105", "174.121.93.90", "172.17.96.139", "108.59.250.112", 
  "24.63.14.4")
 
# "ips" is a list of IP addresses
ip.to.host <- function(ips) {
  # save out a list of IP addresses in adnshost reverse query format
  # if you're going to be using this in "production", you *might*
  # want to consider using tempfile() #justsayin
  writeLines(laply(ips, function(x) sprintf("-i%s",x)),"/tmp/ips.in")
  # call adnshost with the file
  # requires adnshost :: http://www.chiark.greenend.org.uk/~ian/adns/
  system.output <- system("cat /tmp/ips.in | adnshost -f",intern=TRUE)
  # keep our file system tidy
  unlink("/tmp/ips.in")
  # clean up the result
  cleaned.result <- gsub("\\.in-addr\\.arpa","",system.output)
  # split the reply
  split.result <- strsplit(cleaned.result," PTR ")
  # make a data frame of the reply
  result.df <- data.frame(do.call(rbind, lapply(split.result, rbind)))
  colnames(result.df) <- c("IP","hostname")
  # reverse the octets in the IP address list
  result.df$IP <- sapply(as.character(result.df$IP), function(x) {
    y <- unlist(strsplit(x,"\\."))
    sprintf("%s.%s.%s.%s",y[4],y[3],y[2],y[1])
  })
  # fill errant lookups with "NA"
  final.result <- merge(ips,result.df,by.x="x",by.y="IP",all.x=TRUE)
  colnames(final.result) = c("IP","hostname")
  return(final.result)
}
 
resolved.df <- ip.to.host(ip.list)
head(resolved.df,n=10)
 
                IP                                   hostname
1          1.1.1.1                                       <NA>
2          1.1.1.2                                       <NA>
3          1.1.1.3                                       <NA>
4   108.59.250.112      vps-1068142-5314.manage.myhosting.com
5   146.160.21.171                                       <NA>
6   146.160.21.172                                       <NA>
7   146.160.21.186                                       <NA>
8  159.127.124.209                                       <NA>
9    172.17.96.139                                       <NA>
10   173.192.34.24 173.192.34.24-static.reverse.softlayer.com

If you wish to suppress adns error messages and any resultant R warnings, you can add an “ignore.stderr=TRUE” to the system() call and an “options(warn=-1)” to the function itself (remember to get/reset the current value). I kinda like leaving them in, though, as it shows progress is being made.

Whether you end up using a one-liner or the asynchronous function, it would be a spiffy idea to setup a local caching server, such as Unbound, to speed up subsequent queries (because you will undoubtedly have subsequent queries unless your R scripts are perfect on the first go-round).

If you’ve solved the “efficient reverse DNS query problem” a different way in R, drop a note in the comments! I know quite a few folks who’d love to buy you tasty beverage!

You can find similar, handy IP address and other security-oriented R code in our (me & @jayjacobs’) upcoming book on security data analysis and visualization.

Many thanks to all who attended the talk @jayjacobs & I gave at @Secure360 on Wednesday, May 15, 2013. As promised, here are the [slides](https://dl.dropboxusercontent.com/u/43553/Secure360-2013.pdf).

We’ve enumerated quite a bit of non-slide-but-in-presentation information that we wanted to aggregate into a blog post so you can vi[sz] along at home. If you need more of a guided path, I strongly encourage you to take a look at some of the free courses over at [Coursera](https://www.coursera.org/).

For starters, here’s a bit.ly bundle of data analysis & visualization bookmarks that @dseverski & I maintain. We’ve been doing (IMO) a pretty good job adding new resources as they come up and may have some duplicates to the ones below.

People Mentioned

– [Stephen Few’s Perceptual Edge blog](http://www.perceptualedge.com/) : Start from the beginning to learn from a giant in information visualization
– [Andy Kirk’s Visualising Data blog](http://www.visualisingdata.com/) (@visualisingdata) : Perhaps the quintessential leader in the modern visualization movement.
– [Mike Bostock’s blog](http://bost.ocks.org/mike/) (@mbostock) : Creator of D3 and producer of amazing, interactive graphics for the @NYTimes
– [Edward Tufte’s blog](http://www.edwardtufte.com/tufte/) : The father of what we would now identify as our core visualization principles & practices.
– [Nathan Yau’s Flowing Data blog](http://flowingdata.com/) : Making visualization accessible, practical and repeatable.
– [Data Stories Podcast](http://datastori.es/) : Yes, you can learn much about data visualization from an audio podacst (@datastories)
– [storytelling with data](http://www.storytellingwithdata.com/) (@storywithdata) : Extremely practical blog by Cole Nussbaumer that will especially help folks “stuck” in Excel
– [Jay’s blog](http://beechplane.wordpress.com/)
– [My {this} blog](http://rud.is/b)

Tools Mentioned

– [R](http://www.r-project.org/) : Jay & I probably use this a bit too much as a hammer (i.e. treat every data project as a nail) but it’s just far too flexible and powerful to not use as a go-to resource
– [RStudio](http://www.rstudio.com/) : An *amazing* IDE for R. I, personally, usually despise IDEs (yes, I even dislike Xcode), but RStudio truly improves workflow by several orders of magnitude. There are both desktop and server versions of it; the latter gives you the ability to setup a multi-user environment and use the IDE from practically anywhere you are. RStudio also makes generating [reproducible research](http://cran.r-project.org/web/views/ReproducibleResearch.html) a joy with built-in easy access to tools like [kintr](http://yihui.name/knitr/).
– [iPython](http://ipython.org/) : This version of Python takes an already amazing language and kicks it up a few notches. It brings it up to the level of R+RStudio, especially with it’s knitr-like [iPython Notebooks](http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html) for–again–reproducible research.
– [SecViz](http://secviz.org/) : Security-centric Visualization Site & Tools by @raffaelmarty
– [Mondrian](http://www.theusrus.de/Mondrian/) : This tool needs far more visibility. It enables extremely quick visualization of even very large data sets. The interface takes a bit of getting used to, but it’s faster then typing R commands or fumbling in Excel.
– [Tableau](http://www.tableausoftware.com/) : This tool may be one of the most accessible, fast & flexible ways to explore data sets to get an idea of where you need to/can do further analysis.
– [Processing](http://processing.org/) : A tool that was designed from the ground up to help journalists create powerful, interactive data visualizations that you can slipstream directly onto the web via the [Processing.js](http://processingjs.org/) library.
– [D3](http://d3js.org/) : The foundation of modern, data-driven visualization on the web.
– [Gephi](https://gephi.org/) : A very powerful tool when you need to explore networks & create beautiful, publication-worthy visualizations.
– [MongoDB](http://www.mongodb.org/) : NoSQL database that’s highly & easily scaleable without a steep learning curve.
– [CRUSH Tools by Google](https://code.google.com/p/crush-tools/) : Kicks up your command-line data munging.

Many thanks to all who attended the talk @jayjacobs & I gave at @SOURCEconf on Thursday, April 18, 2013. As promised, here are the [slides](https://dl.dropboxusercontent.com/u/43553/SOURCE-Boston-2013.pdf) which should be much less washed out than the projector version :-)

We’ve enumerated quite a bit of non-slide-but-in-presentation information that we wanted to aggregate into a blog post so you can viz along at home. If you need more of a guided path, I strongly encourage you to take a look at some of the free courses over at [Coursera](https://www.coursera.org/).

For starters, here’s a bit.ly bundle of data analysis & visualization bookmarks that @dseverski & I maintain. We’ve been doing (IMO) a pretty good job adding new resources as they come up and may have some duplicates to the ones below.

People Mentioned

– [Stephen Few’s Perceptual Edge blog](http://www.perceptualedge.com/) : Start from the beginning to learn from a giant in information visualization
– [Andy Kirk’s Visualising Data blog](http://www.visualisingdata.com/) (@visualisingdata) : Perhaps the quintessential leader in the modern visualization movement.
– [Mike Bostock’s blog](http://bost.ocks.org/mike/) (@mbostock) : Creator of D3 and producer of amazing, interactive graphics for the @NYTimes
– [Edward Tufte’s blog](http://www.edwardtufte.com/tufte/) : The father of what we would now identify as our core visualization principles & practices.
– [Nathan Yau’s Flowing Data blog](http://flowingdata.com/) : Making visualization accessible, practical and repeatable.
– [Jay’s blog](http://beechplane.wordpress.com/)
– [My {this} blog](http://rud.is/b)

Tools Mentioned

– [R](http://www.r-project.org/) : Jay & I probably use this a bit too much as a hammer (i.e. treat ever data project as a nail) but it’s just far too flexible and powerful to not use as a go-to resource
– [RStudio](http://www.rstudio.com/) : An *amazing* IDE for R. I, personally, usually despise IDEs (yes, I even dislike Xcode), but RStudio truly improves workflow by several orders of magnitude. There are both desktop and server versions of it; the latter gives you the ability to setup a multi-user environment and use the IDE from practically anywhere you are. RStudio also makes generating [reproducible research](http://cran.r-project.org/web/views/ReproducibleResearch.html) a joy with built-in easy access to tools like [kintr](http://yihui.name/knitr/).
– [iPython](http://ipython.org/) : This version of Python takes an already amazing language and kicks it up a few notches. It brings it up to the level of R+RStudio, especially with it’s knitr-like [iPython Notebooks](http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html) for–again–reproducible research.
– [SecViz](http://secviz.org/) : Security-centric Visualization Site & Tools by @raffaelmarty
– [Mondrian](http://www.theusrus.de/Mondrian/) : This tool needs far more visibility. It enables extremely quick visualization of even very large data sets. The interface takes a bit of getting used to, but it’s faster then typing R commands or fumbling in Excel.
– [Tableau](http://www.tableausoftware.com/) : This tool may be one of the most accessible, fast & flexible ways to explore data sets to get an idea of where you need to/can do further analysis.
– [Processing](http://processing.org/) : A tool that was designed from the ground up to help journalists create powerful, interactive data visualizations that you can slipstream directly onto the web via the [Processing.js](http://processingjs.org/) library.
– [D3](http://d3js.org/) : The foundation of modern, data-driven visualization on the web.
– [Gephi](https://gephi.org/) : A very powerful tool when you need to explore networks & create beautiful, publication-worthy visualizations.
– [MongoDB](http://www.mongodb.org/) : NoSQL database that’s highly & easily scaleable without a steep learning curve.
– [CRUSH Tools by Google](https://code.google.com/p/crush-tools/) : Kicks up your command-line data munging.

Earlier this evening, I somewhat half-heartedly challenged @jayjacobs that he & I should be generating one data visualization per day. I didn’t specify anything else (well, at least that I can disclose publicly, for now) but I think I’m going to try to formalize a bit of the ‘rules’ before I get some shut-eye:

– The datavis _must_ be posted to either one of our blogs (i.e it and the data behind it must be shareable). Alternative: we setup a blog just for this.
– The data behind the datavis _must_ also be public data and either referenced or published with the datavis.
– The datavis _must_ answer a question. No random generation of numbers for a lazy bar chart, etc. Said question must be posed with the datavis and (hopefully) a bit of a short story/explanation with it and the datavis in the blog post.
– The datavis cannot be a blatant repeat of a previous datavis.
– The datavis does not have to break new ground (i.e. bar charts are #spiffy).
– The datavis _must_ be open for comments.
– There are no restrictions on what tools/languages can be used (i.e. Jay can cheat and make Tableau robocharts).
– There are no restrictions on the type of data being analyzed & visualized. Ideally, it will be from infosec or IT, but restricting it to those areas might make the challenge more difficult (the ‘public’ bit).

I’ll sleep on that and, perhaps, reduce the requirement to one per week after talking to Jay again this week.

Your thoughts & input on this challenge are most welcome in the comments, especially if you want to suggest things we can visualize. Also, feel free to volunteer to join us in this, once we start it.

Now that I’m back in the US and relaxing, I can take time for one final blather on the [PC Maker Slopegraph](http://rud.is/b/2013/04/11/ugly-tables-vs-slopegraphs-pc-maker-shipments-marketshare/) post from earlier in the week.

Slopegraphs can be quite long depending on the increment between discrete entries (as I’ve [pointed out before](http://rud.is/b/2012/06/07/slopegraphs-in-python-exploring-binningrounding/)). You either need to do binning/rounding, change the scale or add some annotations to the chart to make up for the length. Binning/rounding seems to make the most sense since you can add a table for precision but give the reader a good view of what you’re trying to communicate in as compact a fashion as possible.

I’ll, again, ask the reader, what tells you which PC maker is on top: this table:

Screen-Shot-2013-04-10-at-6.14.56-PM

or these slopegraphs:

PC Maker Shipments (in thousands, rounded to nearest thousand)
pcs

PC Maker Market Share (rounded to nearest %)
pcs-share

Labeled properly, the rounding makes for a much more compact chart and doesn’t detract from the message, especially when I also include a much prettier, quick precision reference via Google Fusion Tables:

(though the column sort feature seems a bit wonky for some reason…).

Given that the focus was on the top individual maker, the “Other” category is just noise, so excluding it is also not an issue. If we wanted to tell the story of how well individual makers are performing against that bucket of contenders or point-players, then we would include that data and use other visualizations to communicate whatever conclusions we want to lead the reader to.

Remember, data tables and visualizations should be there to help tell your story, not detract from it or require real work/effort to grok (unless you’re breaking new visualization ground, which is most definitely not happening in the Ars story).

The basic technique of cybercrime statistics—measuring the incidence of a given phenomenon (DDoS, trojan, APT) as a percentage of overall population size—had entered the mainstream of cybersecurity thought only in the previous decade. Cybersecurity as a science was still in its infancy, as many of its basic principles had yet to be established.

At the same time, the scientific method rarely intersected with the development and testing of new detection & prevention regimens. When you read through that endless stream of quack cybercures published daily on the Internet and at conferences like RSA, what strikes you most is not that they are all, almost without exception, based on anecdotal or woefully inadequately small evidence. What’s striking is that they never apologize for the shortcoming. They never pause to say, “Of course, this is all based on anecdotal evidence, but hear me out.” There’s no shame in these claims, no awareness of the imperfection of the methods, precisely because it seems to eminently reasonable that the local observation of a handful of minuscule cases might serve the silver bullet for cybercrime, if you look hard enough.


But, cybercrime couldn’t be studied in isolation. It was as much a product of the internet expansion as news and social media, where it was so uselessly anatomized. To understand the beast, you needed to think on the scale of the enterprise, from the hacker’s-eye view. You needed to look at the problem from the perspective of Henry Mayhew’s balloon. And you needed a way to persuade others to join you there.

Sadly, that’s not a modern story. It’s an adapted quote from chapter 4 (pp. 97-98, paperback) of The Ghost Map, by Steven Johnson, a book on the cholera epidemic of 1854.

I won’t ruin the book nor continue my attempt at analogy any further. Suffice it to say, you should read the book—if you haven’t already—and join me in calling out for the need for the John Snow of our cyber-time to arrive.

Given my [obsession](http://rud.is/b/?s=slopegraphs) with slopegraphs, I’m not sure how I missed this [post](http://neoformix.com/2013/ObesitySlopegraph.html) back in late February by @JeffClark that includes a very nicely executed [interactive sloepgraph](http://neoformix.com/Projects/ObesitySlope/) on the global obesity problem. He used [Processing](http://processing.org/) & [Processing JS](http://processingjs.org/) to build the visualization and I think it illustrates how well animation/interaction and slopegraphs work together. It would be even spiffier if demographic & obesity details (perhaps even a dynamic map) were displayed as you select a country/region.

You can try your hand at an alternate implementation by [grabbing the data](https://www.google.com/fusiontables/DataSource?snapid=S887706wZVv) and playing along at home.