Skip navigation

Category Archives: Data Visualization

UPDATE: Added some extra visualization elements since this post went live. New select menu and hover text for individual job impact detail lines in the table.

I was reviewing RSS feeds when I came across this story about “ObamaCare Employer Mandate: A List Of Cuts To Work Hours, Jobs” over on Investors.com. Efficacy of the law notwithstanding, I thought it might be interesting to visualize the data since the folks over at Investors.com provided a handy spreadsheet that they seem to maintain pretty well (link is in the article).

The spreadsheet is organized by date and lists each state where the jobs were impacted along with the employer, employer type (public/private), reason and number of jobs impacted (if available). They also have links to news stories related to each entry.

My first thought was to compare impact across states by date, so I threw together a quick R script to build a faceted bar chart:

library(ggplot2)
library(plyr)
 
# Source for job impact data:
# http://news.investors.com/politics-obamacare/092513-669013-obamacare-employer-mandate-a-list-of-cuts-to-work-hours-jobs.htm
 
emp.f <- read.csv("~/employers.csv", stringsAsFactors=FALSE)
colnames(emp.f) <- c("State","Employer","Type","Action","Jobs.Cut","Action.Date")
emp.f[is.na(emp.f$Jobs.Cut),]$Jobs.Cut = median(emp.f$Jobs.Cut, na.rm=TRUE)
emp.f[emp.f$State=="Virgina", ]$State = "Virginia"
emp.f[emp.f$State=="Washington DC", ]$State = "District of Columbia"

Yes, they really spelled “Virginia” wrong, at least in the article text where I initially scraped the data from before I saw there was a spreadsheet available. Along with fixing “Virginia”, I also changed the name of “Washington DC” to “District of Columbia” for reasons you’ll see later on in this post. I’m finding it very helpful to do as much of the data cleanup in-code (R or Python) whenever possible since it makes the process far more repeatable than performing the same tasks by hand in a text editor and is essential if you know the data is going to change/expand.

After reading in the data, it was trivial to get a ggplot of the job impacts by state (click image for larger version):

p <- ggplot(emp.f, aes(x=Action.Date, y=Jobs.Cut))
p <- p + geom_bar(aes(fill=State), stat="identity")
p <- p + facet_wrap(~State)
p <- p + theme_bw()
p <- p + theme(legend.position=0, axis.text.x = element_text(angle = 90))
p <- p + labs(x="Action Date", y="# Jobs Cut")
p

oc-facet

That visualization provided some details, but I decided to expand the scope a bit and wanted to make an interactive “bubble chart” (since folks seem to love bubbles) with circle size relative to the total job cuts per state and circle color reflecting the conservative/liberal leaning of each state (i.e. ‘red’ vs ‘blue’) to see if there was any visual correlation by that attribute. I found the political data over at Gallup and went to work prepping the data with some additional R code. (NOTE: The Gallup data was the reason for the “DC” name change since Gallup uses “District of Columbia” in their data set.)

# aggregate state data
emp.state.sum.df <- count(emp.f,c("State"),c("Jobs.Cut"))
colnames(emp.state.sum.df) <- c("State","Total.Jobs.Cut")
 
# get total (estimated) jobs impacted
total.jobs <- sum(emp.state.sum.df$Total.Jobs.Cut)
 
# Source for the red v blue state data:
# http://www.gallup.com/poll/125066/State-States.aspx
# read political leanings
red.blue.df <- read.csv("~/red-blue.csv", stringsAsFactors=FALSE)
 
# join the jobs and leaning data together
s <- join(emp.state.sum.df, red.blue.df, by="State")
 
# cheat and get leaning range for manual input into the datavis
leaning.range <- range(s$Conservative.Advantage)
 
# build the JSON data file. store state summary data for the bubbles, but also include
# the detail level for extra data for the viz
# need to clean up this file post-write and definitely run it through http://jsonlint.com/
jsfile = file("states.tmp","w")
by(s, 1:nrow(s), function(row) {
  writeLines(sprintf('      {"name": "%s", "size":%d, "leaning":%2.1f, "detail":[',row$State,row$Total.Jobs.Cut,row$Conservative.Advantage),jsfile)
  employers = emp.f[emp.f$State == row$State,]
  by(employers, 1:nrow(employers), function(emp.row) {
    writeLines(sprintf('          { "employer":"%s", "emptype":"%s", "actiondetail":"%s", "jobsimpacted":%d, "when":"%s"},',
                       emp.row$Employer, emp.row$Type, gsub('"',"'",emp.row$Action), emp.row$Jobs.Cut, emp.row$Action.Date),jsfile)
 
  })
  writeLines("]},\n",jsfile)   
})
close(jsfile)

I know the comments point out the need to tweak the resulting JSON a bit (mostly to remove “errant” commas, which is one of the annoying bits about JSON), but I wanted to re-emphasize the huge utility of JSONlint as it can save you a great deal of time debugging large amounts of gnarly JSON data.

With the data prepped, I threw together a D3 visualization that shows the bubbles on the left and details by date and employer on the right.

oc-snap.png

Since it’s D3, there’s no need to put the source code in the blog post. Just do a “view-source” on the resulting visualization or poke around the github repository. I will, however, point out a couple useful/interesting bits from the code.

First, coloring circles by political leaning took exactly one line of code since D3 provides a means to map a range of values to colors:

var ramp = d3.scale.linear().domain([-21,36]).range(["#253494","#B30000"]);

I chose the colors with Color Brewer but cheated (as I indicated in the R code) by pre-computing the range of the values for the palette. You can see the tiny District of Columbia’s very blue circle in the lower-left of the field of circles. Hopefully Investors.com will maintain the data set and we can look at changes over a larger period of time.

Second, you get rudimentary “popups” for free via element “title” tags on the SVG circles, so no need for custom tooltip code:

node.append("title")
   .text(function(d) { return d.stateName + ": " + format(d.value) + " jobs impacted"; });

I could have tweaked the display a bit more, added links to the stories and provided a means to sort the “# Jobs” column by count or date, but I took enough time away from the book to scratch this visualization itch and it came out pretty much the way I wanted it to.

If you do hack at it and build something better (which should not be terribly difficult), drop a note in the comments or over at github.

I’ve been doing a bit of graphing (with real, non-honeypot network data) as part of the research for the book I’m writing with @jayjacobs and thought one of the images was worth sharing (especially since it may not make it into the book :-).

Threat_View
Click image for larger view

This is a static screen capture of a D3 force-directed graph made with R, igraph & Vega of four ZeroAccess infected nodes desperately (each node tried ~200K times over a couple days) trying to break free of a firewall over the course of 11 days. The red nodes are unique destination IPs and purple ones are in the AlienVault IP Reputation database. Jay & I have read and blogged a great deal about ZeroAccess over the past year and finally had the chance to see a live slice of how pervasive (and, noisy) the network is even with just a view from a few infected nodes.

While the above graphic is the composite view of all 11 days, the following one is from just a single day with only two infected nodes trying to communicate out (this is a pure, hastily-crafted R/igraph image):

Two ZeroAccess Infected Nodes
Click image for larger view

There are some common destinations among the two, but each has a large list of unique ones; even the best, open IP reputation database on the planet only included a handful of the malicious endpoints, which means you really need to be looking at holistic behavior modeling vs port/destination alone (I filtered out legit destination traffic for these views) if you’re trying to find egressing badness (but you hopefully already knew that).

We infosec folk eat up industry reports and most of us have no doubt already gobbled up @panda_security’s recently released [Q1 2013 Report](http://press.pandasecurity.com/wp-content/uploads/2010/05/PandaLabs-Quaterly-Report.pdf) [PDF]. It’s a good read (so go ahead and read it, we’ll still be here!) and I was really happy to see a nicely stylized chart in the early pages:

Screenshot_5_24_13_8_14_AM

However, I quickly became a #sadpanda when I happened across some explosive 3D pie charts later on. Rather than deride, I thought a re-imagining would be a better use of time and let you decide which visualizations both communicate better and are more appealing.

I chose to use @Datawrapper to showcase how easy it is to build and publish pleasing and informative visualizations without even leaving your browser.

Figure 4, Original:

Panda Labs Q1 2013 Report Fig 5 (Orig)

Figure 4, Alternative:

Figure 5, Original

Fig 4: New malware strains In Q1 2013, by Type (orig)

Figure 5, Alternative (horizontal vs vertical, just to mix it up a bit):

If the charts had been closer together in the report, I would have opted for vertical design for both and probably kept malware-type ordering vs sort by highest percentage.

How would you re-imagine the pie charts? Post a link to your creations in the comments and I’ll make sure they show up embedded with the post.

Many thanks to all who attended the talk @jayjacobs & I gave at @Secure360 on Wednesday, May 15, 2013. As promised, here are the [slides](https://dl.dropboxusercontent.com/u/43553/Secure360-2013.pdf).

We’ve enumerated quite a bit of non-slide-but-in-presentation information that we wanted to aggregate into a blog post so you can vi[sz] along at home. If you need more of a guided path, I strongly encourage you to take a look at some of the free courses over at [Coursera](https://www.coursera.org/).

For starters, here’s a bit.ly bundle of data analysis & visualization bookmarks that @dseverski & I maintain. We’ve been doing (IMO) a pretty good job adding new resources as they come up and may have some duplicates to the ones below.

People Mentioned

– [Stephen Few’s Perceptual Edge blog](http://www.perceptualedge.com/) : Start from the beginning to learn from a giant in information visualization
– [Andy Kirk’s Visualising Data blog](http://www.visualisingdata.com/) (@visualisingdata) : Perhaps the quintessential leader in the modern visualization movement.
– [Mike Bostock’s blog](http://bost.ocks.org/mike/) (@mbostock) : Creator of D3 and producer of amazing, interactive graphics for the @NYTimes
– [Edward Tufte’s blog](http://www.edwardtufte.com/tufte/) : The father of what we would now identify as our core visualization principles & practices.
– [Nathan Yau’s Flowing Data blog](http://flowingdata.com/) : Making visualization accessible, practical and repeatable.
– [Data Stories Podcast](http://datastori.es/) : Yes, you can learn much about data visualization from an audio podacst (@datastories)
– [storytelling with data](http://www.storytellingwithdata.com/) (@storywithdata) : Extremely practical blog by Cole Nussbaumer that will especially help folks “stuck” in Excel
– [Jay’s blog](http://beechplane.wordpress.com/)
– [My {this} blog](http://rud.is/b)

Tools Mentioned

– [R](http://www.r-project.org/) : Jay & I probably use this a bit too much as a hammer (i.e. treat every data project as a nail) but it’s just far too flexible and powerful to not use as a go-to resource
– [RStudio](http://www.rstudio.com/) : An *amazing* IDE for R. I, personally, usually despise IDEs (yes, I even dislike Xcode), but RStudio truly improves workflow by several orders of magnitude. There are both desktop and server versions of it; the latter gives you the ability to setup a multi-user environment and use the IDE from practically anywhere you are. RStudio also makes generating [reproducible research](http://cran.r-project.org/web/views/ReproducibleResearch.html) a joy with built-in easy access to tools like [kintr](http://yihui.name/knitr/).
– [iPython](http://ipython.org/) : This version of Python takes an already amazing language and kicks it up a few notches. It brings it up to the level of R+RStudio, especially with it’s knitr-like [iPython Notebooks](http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html) for–again–reproducible research.
– [SecViz](http://secviz.org/) : Security-centric Visualization Site & Tools by @raffaelmarty
– [Mondrian](http://www.theusrus.de/Mondrian/) : This tool needs far more visibility. It enables extremely quick visualization of even very large data sets. The interface takes a bit of getting used to, but it’s faster then typing R commands or fumbling in Excel.
– [Tableau](http://www.tableausoftware.com/) : This tool may be one of the most accessible, fast & flexible ways to explore data sets to get an idea of where you need to/can do further analysis.
– [Processing](http://processing.org/) : A tool that was designed from the ground up to help journalists create powerful, interactive data visualizations that you can slipstream directly onto the web via the [Processing.js](http://processingjs.org/) library.
– [D3](http://d3js.org/) : The foundation of modern, data-driven visualization on the web.
– [Gephi](https://gephi.org/) : A very powerful tool when you need to explore networks & create beautiful, publication-worthy visualizations.
– [MongoDB](http://www.mongodb.org/) : NoSQL database that’s highly & easily scaleable without a steep learning curve.
– [CRUSH Tools by Google](https://code.google.com/p/crush-tools/) : Kicks up your command-line data munging.

@adammontville [posited](http://www.tripwire.com/state-of-security/it-security-data-protection/quick-thoughts-on-verizons-dbir-and-20-critical-security-control-mappings/) that Figure 15 from this year’s [DBIR](http://www.verizonenterprise.com/DBIR/2013/) could use some slopegraph love. As I am not one to back down from a reasonable challenge, I obliged.

Here’s the original chart (produced by @jayjacobs):

figure15-orig

and, here’s a _very_ _quick_ slopegraph version of it:

figure15-slope

You can click on both/either for a larger version. If I had more time, I could have made the slopegraph version nicer, but it conveys a story fairly well the way it is, especially with the highlight on the two biggest changes between 2008 & 2012.

Two problems with the modified visualization are (a) multi-column slopegraphs blend into a [parallel coordinate](http://www.juiceanalytics.com/writing/parallel-coordinates/) or plain old line graph pretty quickly (thus, reducing their slopegraph-y goodness); and, (b) the diversity of the year-over-year DBIR data set makes the comparison between years almost pointless (as the DBIR itself points out).

I also generated a proper/traditional slopegraph, comparing 2008 to 2012:

figure15-true-slope

The visualization is far more compact and, if the goal was to show the change between 2008 and 2012, it provides a much clearer view of what has and has not changed.

Many thanks to all who attended the talk @jayjacobs & I gave at @SOURCEconf on Thursday, April 18, 2013. As promised, here are the [slides](https://dl.dropboxusercontent.com/u/43553/SOURCE-Boston-2013.pdf) which should be much less washed out than the projector version :-)

We’ve enumerated quite a bit of non-slide-but-in-presentation information that we wanted to aggregate into a blog post so you can viz along at home. If you need more of a guided path, I strongly encourage you to take a look at some of the free courses over at [Coursera](https://www.coursera.org/).

For starters, here’s a bit.ly bundle of data analysis & visualization bookmarks that @dseverski & I maintain. We’ve been doing (IMO) a pretty good job adding new resources as they come up and may have some duplicates to the ones below.

People Mentioned

– [Stephen Few’s Perceptual Edge blog](http://www.perceptualedge.com/) : Start from the beginning to learn from a giant in information visualization
– [Andy Kirk’s Visualising Data blog](http://www.visualisingdata.com/) (@visualisingdata) : Perhaps the quintessential leader in the modern visualization movement.
– [Mike Bostock’s blog](http://bost.ocks.org/mike/) (@mbostock) : Creator of D3 and producer of amazing, interactive graphics for the @NYTimes
– [Edward Tufte’s blog](http://www.edwardtufte.com/tufte/) : The father of what we would now identify as our core visualization principles & practices.
– [Nathan Yau’s Flowing Data blog](http://flowingdata.com/) : Making visualization accessible, practical and repeatable.
– [Jay’s blog](http://beechplane.wordpress.com/)
– [My {this} blog](http://rud.is/b)

Tools Mentioned

– [R](http://www.r-project.org/) : Jay & I probably use this a bit too much as a hammer (i.e. treat ever data project as a nail) but it’s just far too flexible and powerful to not use as a go-to resource
– [RStudio](http://www.rstudio.com/) : An *amazing* IDE for R. I, personally, usually despise IDEs (yes, I even dislike Xcode), but RStudio truly improves workflow by several orders of magnitude. There are both desktop and server versions of it; the latter gives you the ability to setup a multi-user environment and use the IDE from practically anywhere you are. RStudio also makes generating [reproducible research](http://cran.r-project.org/web/views/ReproducibleResearch.html) a joy with built-in easy access to tools like [kintr](http://yihui.name/knitr/).
– [iPython](http://ipython.org/) : This version of Python takes an already amazing language and kicks it up a few notches. It brings it up to the level of R+RStudio, especially with it’s knitr-like [iPython Notebooks](http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html) for–again–reproducible research.
– [SecViz](http://secviz.org/) : Security-centric Visualization Site & Tools by @raffaelmarty
– [Mondrian](http://www.theusrus.de/Mondrian/) : This tool needs far more visibility. It enables extremely quick visualization of even very large data sets. The interface takes a bit of getting used to, but it’s faster then typing R commands or fumbling in Excel.
– [Tableau](http://www.tableausoftware.com/) : This tool may be one of the most accessible, fast & flexible ways to explore data sets to get an idea of where you need to/can do further analysis.
– [Processing](http://processing.org/) : A tool that was designed from the ground up to help journalists create powerful, interactive data visualizations that you can slipstream directly onto the web via the [Processing.js](http://processingjs.org/) library.
– [D3](http://d3js.org/) : The foundation of modern, data-driven visualization on the web.
– [Gephi](https://gephi.org/) : A very powerful tool when you need to explore networks & create beautiful, publication-worthy visualizations.
– [MongoDB](http://www.mongodb.org/) : NoSQL database that’s highly & easily scaleable without a steep learning curve.
– [CRUSH Tools by Google](https://code.google.com/p/crush-tools/) : Kicks up your command-line data munging.

Earlier this evening, I somewhat half-heartedly challenged @jayjacobs that he & I should be generating one data visualization per day. I didn’t specify anything else (well, at least that I can disclose publicly, for now) but I think I’m going to try to formalize a bit of the ‘rules’ before I get some shut-eye:

– The datavis _must_ be posted to either one of our blogs (i.e it and the data behind it must be shareable). Alternative: we setup a blog just for this.
– The data behind the datavis _must_ also be public data and either referenced or published with the datavis.
– The datavis _must_ answer a question. No random generation of numbers for a lazy bar chart, etc. Said question must be posed with the datavis and (hopefully) a bit of a short story/explanation with it and the datavis in the blog post.
– The datavis cannot be a blatant repeat of a previous datavis.
– The datavis does not have to break new ground (i.e. bar charts are #spiffy).
– The datavis _must_ be open for comments.
– There are no restrictions on what tools/languages can be used (i.e. Jay can cheat and make Tableau robocharts).
– There are no restrictions on the type of data being analyzed & visualized. Ideally, it will be from infosec or IT, but restricting it to those areas might make the challenge more difficult (the ‘public’ bit).

I’ll sleep on that and, perhaps, reduce the requirement to one per week after talking to Jay again this week.

Your thoughts & input on this challenge are most welcome in the comments, especially if you want to suggest things we can visualize. Also, feel free to volunteer to join us in this, once we start it.

Now that I’m back in the US and relaxing, I can take time for one final blather on the [PC Maker Slopegraph](http://rud.is/b/2013/04/11/ugly-tables-vs-slopegraphs-pc-maker-shipments-marketshare/) post from earlier in the week.

Slopegraphs can be quite long depending on the increment between discrete entries (as I’ve [pointed out before](http://rud.is/b/2012/06/07/slopegraphs-in-python-exploring-binningrounding/)). You either need to do binning/rounding, change the scale or add some annotations to the chart to make up for the length. Binning/rounding seems to make the most sense since you can add a table for precision but give the reader a good view of what you’re trying to communicate in as compact a fashion as possible.

I’ll, again, ask the reader, what tells you which PC maker is on top: this table:

Screen-Shot-2013-04-10-at-6.14.56-PM

or these slopegraphs:

PC Maker Shipments (in thousands, rounded to nearest thousand)
pcs

PC Maker Market Share (rounded to nearest %)
pcs-share

Labeled properly, the rounding makes for a much more compact chart and doesn’t detract from the message, especially when I also include a much prettier, quick precision reference via Google Fusion Tables:

(though the column sort feature seems a bit wonky for some reason…).

Given that the focus was on the top individual maker, the “Other” category is just noise, so excluding it is also not an issue. If we wanted to tell the story of how well individual makers are performing against that bucket of contenders or point-players, then we would include that data and use other visualizations to communicate whatever conclusions we want to lead the reader to.

Remember, data tables and visualizations should be there to help tell your story, not detract from it or require real work/effort to grok (unless you’re breaking new visualization ground, which is most definitely not happening in the Ars story).