Skip navigation

Category Archives: Data Analysis

Jay Jacobs (@jayjacobs)—my co-author of the soon-to-be-released book [Data-Driven Security](http://amzn.to/ddsec)—& I have been hard at work over at the book’s [sister-blog](http://dds.ec/blog) cranking out code to help security domain experts delve into the dark art of data science.

We’ve covered quite a bit of ground since January 1st, but I’m using this post to focus more on what we’ve produced using R, since that’s our go-to language.

Jay used the blog to do a [long-form answer](http://datadrivensecurity.info/blog/posts/2014/Jan/severski/) to a question asked by @dseverski on the [SIRA](http://societyinforisk.org) mailing list and I piled on by adding a [Shiny app](http://datadrivensecurity.info/blog/posts/2014/Jan/solvo-mediocris/) into the mix (both posts make for a pretty `#spiffy` introduction to expert-opinion risk analyses in R).

Jay continued by [releasing a new honeypot data set](http://datadrivensecurity.info/blog/data/2014/01/marx.gz) and corresponding two-part[[1](http://datadrivensecurity.info/blog/posts/2014/Jan/blander-part1/),[2](http://datadrivensecurity.info/blog/posts/2014/Jan/blander-part2/)] post series to jump start analyses on that data. (There’s a D3 geo-visualization stuck in-between those posts if you’re into that sort of thing).

I got it into my head to start a project to build a [password dump analytics tool](http://datadrivensecurity.info/blog/posts/2014/Feb/ripal/) in R (with **much** more coming soon on that, including a full-on R package + Shiny app combo) and also continue the discussion we started in the book on the need for the infusion of reproducible research principles and practices in the information security domain by building off of @sucuri_security’s [Darkleech botnet](http://datadrivensecurity.info/blog/posts/2014/Feb/reproducible-research-sucuri-darkleech-data/) research.

You can follow along at home with the blog via it’s [RSS feed](http://datadrivensecurity.info/blog/feeds/all.atom.xml) or via the @ddsecblog Twitter account. You can also **play** along at home if you feel you have something to contribute. It’s as simple as a github pull request and some really straightforward markdown. Take a look the blog’s [github repo](https://github.com/ddsbook/blog) and hit me up (@hrbrmstr) for details if you’ve got something to share.

Data-Driven-SecurityIf I made a Venn diagram of the cross-section of readers of this blog and the [Data Driven Security](http://dds.ec/) web sites it might be indistinguishable from a pure circle. However, just in case there are a few stragglers out there, I figured one more post on the fact that the new book by @jayjacobs & me is available _now_ in electronic form (not pre-order) wouldn’t hurt. The print book is still making it’s way from dead trees to store shelves and should be ready for the expected February 17th debut.

Here’s the list of links to e-tailers (man, I hate that term) who have it available for the various e-readers out there.

– [Amazon/Kindle](http://www.amazon.com/gp/product/B00I1Y7THY/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B00I1Y7THY&linkCode=as2&tag=rudisdotnet-20)
– [B&N/Nook](http://www.barnesandnoble.com/w/data-driven-security-jay-jacobs/1117239036?ean=9781118793725)
– [Google Books](https://play.google.com/store/books/details?id=LQigAgAAQBAJ)
– [Kobo](http://store.kobobooks.com/en-US/ebook/data-driven-security)

If you happen to catch it out in the wild and not on this list, drop me (@hrbrmstr) a note `#pls`.

And, a huge thank you! to everyone for their kind accolades yesterday (esp to those who’ve purchased the book :-)

This tweet by @moorehn (who usually is a superb economic journalist) really bugged me:

I grabbed the raw data from EPI: (http://www.epi.org/files/2012/data-swa/jobs-data/Employment%20to%20population%20ratio%20(EPOPs).xls) and properly started the graph at 0 for the y-axis and also broke out men & women (since the Excel spreadsheet had the data). It’s a really different picture:

empToPop

I’m not saying employment is great right now, but it’s nowhere near a “ski jump”. So much for the state of data journalism at the start of 2014.

Here’s the hastily crafted R-code:

library(ggplot2)
library(ggthemes)
library(reshape2)
 
a <- read.csv("empvyear.csv")
b <- melt(a, id.vars="Year")
 
gg <- ggplot(data=b, aes(x=Year, y=value, group=variable))
gg <- gg + geom_line(aes(color=variable))
gg <- gg + ylim(0, 100)
gg <- gg + theme_economist()
gg <- gg + labs(x="Year", y="Employment as share of population (%)", 
                title="Employment-to-population ratio, age 25–54, 1975–2011")
gg <- gg + theme(legend.title = element_blank())
gg

And, here’s the data extracted from the Excel file:

Year,Men,Women
1975,89.0,51.0
1976,89.5,52.9
1977,90.1,54.8
1978,91.0,57.3
1979,91.1,59.0
1980,89.4,60.1
1981,89.0,61.2
1982,86.5,61.2
1983,86.1,62.0
1984,88.4,63.9
1985,88.7,65.3
1986,88.5,66.6
1987,89.0,68.2
1988,89.5,69.3
1989,89.9,70.4
1990,89.1,70.6
1991,87.5,70.1
1992,86.8,70.1
1993,87.0,70.4
1994,87.2,71.5
1995,87.6,72.2
1996,87.9,72.8
1997,88.4,73.5
1998,88.8,73.6
1999,89.0,74.1
2000,89.0,74.2
2001,87.9,73.4
2002,86.6,72.3
2003,85.9,72.0
2004,86.3,71.8
2005,86.9,72.0
2006,87.3,72.5
2007,87.5,72.5
2008,86.0,72.3
2009,81.5,70.2
2010,81,69.3
2011,81.4,69
dds-header-image

While you’re waiting for the [book](http://amzn.to/ddsec) by @jayjacobs & @hrbrmstr to hit the shelves, why not head on over to the inaugural post of the [Data Driven Security Blog](http://datadrivensecurity.info/blog) & give a listen to the first episode of the [Data Driven Security Podcast](http://datadrivensecurity.info/podcast).

The Data Driven Security Blog aspires to be your go-to “how to” and “what’s new?” resource for all facets of security data science and the Data Driven Security Podcast aims to expand on the blog and showcase all the practitioners leading the way in, well, data-driven security.

We start the blog with some D3 visualizations of our book’s [github logs](http://datadrivensecurity.info/blog/posts/2014/Jan/dds-github/), and the [inaugural podcast episode](http://datadrivensecurity.info/podcast/data-driven-security-episode-0.html) gives a bit of background on Jay, Bob, the book and the goal of the podcast.

Join in the discussion on @ddsecblog, @ddsecpodcast and on [Google+](https://plus.google.com/+DatadrivensecurityInfo1/posts). If you prefer e-mail updates over Twitter and/or RSS/Atom feeds, you can also sign up for our non-spammy [newsletter](http://datadrivensecurity.info/blog/pages/subscribe.html).

We welcome all feedback, suggestions and ideas, so don’t be shy!

I’ve been getting a huge uptick in views of my Slopegraphs in Python post and I think it’s due to @edwardtufte’s recent slopegraph contest announcement.

The original Python code is crufty and a mess mostly due to the intermittent attention to it, wanting to reduce dependencies and hacking vs programming. I’ve been wanting to do a D3 version for a while, so I went a bit overboard once I learned of Mr Tufte’s challenge and made more of a “workbench” for making slopegraphs:

D3_Slopegraph_Workshop

It’s all in D3/HTML5/javascrpt/CSS and requires no server-side components at all.

You can play with a live, alpha-quality version and check out the rest of the components on github.

It needs work, but it should be a good starting point for folks.

As my track record for “winning” things is scant, if you do end up using the code, passing on word of my upcoming book with @jayjacobs would be
#spiffy :-)

Data Driven Security launches in February 2014. @jayjacobs & I have seen half of the book in PDF form so far and it’s almost unbelievable that this journey is almost over.

Data_Driven_Security___Amazon_Sales_Rank_Tracker

We setup a live Amazon “sales rank” tracker over at the book’s web site and provided some Python and JavaScript code to show folks how use the AWS API in conjunction with the dygraphs charting library to do the same for any ISBN. In the coming weeks, we’ll have a Google App Engine component you can clone to setup something similar without the need for your own server(s).

Since @jayjacobs & I are down to the home stretch on Data Driven Security, I thought it would be interesting to do some post-writing pseudo-analyses of the book itself. I won’t have exact page or word counts for a bit, but I wanted to see how many R packages we ended up relying on for the examples in the chapters. It was fairly straightforward to run a grep for calls to library() or require() across all the source files, and I grouped the results into four categories: “analysis”, “core”, “munging” and “visualization”.

Since I <3 D3 circular dendrograms, I figured that would be a fun way to show the groupings. For those who dislike spinning your noggin around, a more traditional one is also presented. You'll need an SVG-capable browser to see the visualizations (below). Stay on the lookout for more "behind the scenes" posts.

visualizationaplpackcolorspaceggdendroggplot2ggthemesgridExtraigraphmapsmaptoolsRColorBrewervcdanalysisbinomcareffectsportfoliosplinesscalesverisrzoorgdalcoredevtoolsstatsmungingbitopsgdatareshapeplyrrjsonRJSONIO

visualizationaplpackcolorspaceggdendroggplot2ggthemesgridExtraigraphmapsmaptoolsRColorBrewervcdanalysisbinomcareffectsportfoliosplinesscalesverisrzoorgdalcoredevtoolsstatsmungingbitopsgdatareshapeplyrrjsonRJSONIO

The #spiffy @dseverski gave me this posit the other day:

and, I obliged shortly thereafter, but figured I’d toss a post up on the blog before heading to Strata.

To rephrase the tweet a bit, Mr. Severski asked me what alternate encoding I’d use for this grouped bar chart (larger version at the link in David’s tweet):

linkedinq31

I have almost as much disdain for grouped bar charts as I do for pie or donut charts, so appreciated the opportunity to try a makeover. However, I ran into an immediate problem: the usually #spiffy 451 Group folks did not include raw data. So, I reverse engineered the graph with WebPlotDigitizer, cleaned up the result and made a CSV from it. Then, I headed to RStudio with a plan in mind.

The old chart and data screamed faceted dot plot. The only trick necessary was to manually order the factor levels.

library(ggplot)
 
# read in the CSV file
nosql.df <- read.csv("nosql.csv", header=TRUE)
# manually order facets
nosql.df$Database <- factor(nosql.df$Database,
                            levels=c("MongoDB","Cassandra","Redis","HBase","CouchDB",
                                     "Neo4j","Riak","MarkLogic","Couchbase","DynamoDB"))
 
# start the plot
gg <- ggplot(data=nosql.df, aes(x=Quarter, y=Index))
# use points, colored by Quarter
gg <- gg + geom_point(aes(color=Quarter), size=3)
# make strips by nosql db factor
gg <- gg + facet_grid(Database~.)
# rotate the plot
gg <- gg + coord_flip()
# get rid of most of the junk
gg <- gg + theme_bw()
# add a title
gg <- gg + labs(x="", title="NoSQL LinkedIn Skills Index\nSeptember 2013")
# get rid of the legend
gg <- gg + theme(legend.position = "none")
# ensure the strip is gone
gg <- gg + theme(strip.text.x = element_blank())
gg

The result is below in SVG form (install a proper browser if you can’t see it, or run the R code :-) I think it conveys the data in a much more informative way. How would you encode the data to make it more informative and accessible?

Full source & data over at github.