Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

I had no intention to blog this, but @jayjacobs convinced me otherwise. I was curious about the recent (end of March, 2014) [California earthquake](http://www.latimes.com/local/lanow/la-me-ln-an-estimated-17-million-people-felt-51-earthquake-in-california-20140331,0,2465821.story#axzz2xfGBteq0) “storm” and did a quick plot for “fun” and personal use using `ggmap`/`ggplot`.

I used data from the [Southern California Earthquake Center](http://www.data.scec.org/recent/recenteqs/Maps/Los_Angeles.html) (that I cleaned up a bit and that you can find [here](/dl/quakes.dat)) but would have used the USGS quake data if the site hadn’t been down when I tried to get it from there.

The code/process isn’t exactly rocket-science, but if you’re looking for a simple way to layer some data on a “real” map (vs handling shapefiles on your own) then this is a really compact/self-contained tutorial/example.

You can find the code & data over at [github](https://gist.github.com/hrbrmstr/9921419) as well.

There’s lots of ‘splainin in the comments (which are prbly easier to read on the github site) but drop a note in the comments or on Twitter if it needs any further explanation. The graphic is SVG, so use a proper browser :-) or run the code in R if you can’t see it here.


(click for larger version)

library(ggplot2)
library(ggmap)
library(plyr)
library(grid)
library(gridExtra)
 
# read in cleaned up data
dat <- read.table("quakes.dat", header=TRUE, stringsAsFactors=FALSE)
 
# map decimal magnitudes into an integer range
dat$m <- cut(dat$MAG, c(0:10))
 
# convert to dates
dat$DATE <- as.Date(dat$DATE)
 
# so we can re-order the data frame
dat <- dat[order(dat$DATE),]
 
# not 100% necessary, but get just the numeric portion of the cut factor
dat$Magnitude <- factor(as.numeric(dat$m))
 
# sum up by date for the barplot
dat.sum <- count(dat, .(DATE, Magnitude))
 
# start the ggmap bit
# It's super-handy that it understands things like "Los Angeles" #spoffy
# I like the 'toner' version. Would also use a stamen map but I can't get 
# to it consistently from behind a proxy server
la <- get_map(location="Los Angeles", zoom=10, color="bw", maptype="toner")
 
# get base map layer
gg <- ggmap(la) 
 
# add points. Note that the plot will produce warnings for all points not in the
# lat/lon range of the base map layer. Also note that i'm encoding magnitude by
# size and color and using alpha for depth. because of the way the data is sorted
# the most recent quakes in the set should be on top
gg <- gg + geom_point(data=dat,
                      mapping=aes(x=LON, y=LAT, 
                                  size=MAG, fill=m, alpha=DEPTH), shape=21, color="black")
 
# this takes the magnitude domain and maps it to a better range of values (IMO)
gg <- gg + scale_size_continuous(range=c(1,15))
 
# this bit makes the right size color ramp. i like the reversed view better for this map
gg <- gg + scale_fill_manual(values=rev(terrain.colors(length(levels(dat$Magnitude)))))
gg <- gg + ggtitle("Recent Earthquakes in CA & NV")
 
# no need for a legend as the bars are pretty much the legend
gg <- gg + theme(legend.position="none")
 
 
# now for the bars. we work with the summarized data frame
gg.1 <- ggplot(dat.sum, aes(x=DATE, y=freq, group=Magnitude))
 
# normally, i dislike stacked bar charts, but this is one time i think they work well
gg.1 <- gg.1 + geom_bar(aes(fill=Magnitude), position="stack", stat="identity")
 
# fancy, schmanzy color mapping again
gg.1 <- gg.1 + scale_fill_manual(values=rev(terrain.colors(length(levels(dat$Magnitude)))))
 
# show the data source!
gg.1 <- gg.1 + labs(x="Data from: http://www.data.scec.org/recent/recenteqs/Maps/Los_Angeles.html", y="Quake Count")
gg.1 <- gg.1 + theme_bw() #stopthegray
 
# use grid.arrange to make the sizes work well
grid.arrange(gg, gg.1, nrow=2, ncol=1, heights=c(3,1))

Andy Kirk (@visualisingdata) & Lynn Cherny (@arnicas) tweeted about the Guardian Word Count service/archive site, lamenting the lack of visualizations:

This gave me a chance to bust out another [Shiny](http://www.rstudio.com/shiny/) app over on our [Data Driven Security](http://datadrivensecurity.info) [shiny server](http://shiny.dds.ec/guardian-words/):

I used my trusty “`Google-Drive-spreadsheet-IMPORTHTML-to-CSV`” workflow (you can access the automagically updated data [here](https://docs.google.com/spreadsheets/d/10CZhMhpFxTPWcLauam-ydKeFrdNgHEIehKznVMHFRM0/pubhtml)) to make the CSV that updates daily on the site and is referenced by the Shiny/R code.

The code has been [gist-ified](https://gist.github.com/hrbrmstr/9570488.js), and I’ll be re-visiting it to refactor the `data.frame` creation bits and add some more charts as the data set gets larger.


(Don’t forget to take a peek at our new book, [Data-Driven Security](http://bit.ly/ddsec)!)

I shot a quick post over at the [Data Driven Security blog](http://bit.ly/1hyqJiT) explaining how to separate Twitter data gathering from R code via the Ruby `t` ([github repo](https://github.com/sferik/t)) command. Using `t` frees R code from having to be a Twitter processor and lets the analyst focus on analysis and visualization, plus you can use `t` as a substitute for Twitter GUIs if you’d rather play at the command-line:

$ t timeline ddsecblog
   @DDSecBlog
   Monitoring Credential Dumps Plus Using Twitter As a Data Source http://t.co/ThYbjRI9Za
 
   @DDSecBlog
   Nice intro to R + stats // Data Analysis and Statistical Inference free @datacamp_com course
   http://t.co/FC44FF9DSp
 
   @DDSecBlog
   Very accessible paper & cool approach to detection // Nazca: Detecting Malware Distribution in
   Large-Scale Networks http://t.co/fqrSaFvUK2
 
   @DDSecBlog
   Start of a new series by new contributing blogger @spttnnh! // @AlienVault rep db Longitudinal
   Study Part 1 : http://t.co/XM7m4zP0tr
 
   ...

The DDSec post shows how to mine the well-formatted output from the @dumpmon Twitter bot to visualize dump trends over time:

and has the code in-line and over at the [DDSec github repo](https://github.com/ddsbook/blog/blob/master/extra/src/R/dumpmon.R) [R].

I’m posting this mostly to show how to:

– use the Google spreadsheet data-munging “hack” from the [previous post](http://rud.is/b/2014/02/11/live-google-spreadsheet-for-keeping-track-of-sochi-medals/) in a Shiny context
– include it seamlessly into a web page, and
– run it locally without a great deal of wrangling

The code for the app is [in this gist](https://gist.github.com/hrbrmstr/8949172). It is unsurprisingly just like [some spiffy other code](http://www.r-bloggers.com/winter-olympic-medal-standings-presented-by-r/) you’ve seen apart from my aesthetic choices (Sochi blue! lines+dots! and, current rankings next to country names).

I won’t regurgitate the code here since it’s just as easy to view on [github](https://gist.github.com/hrbrmstr/8949172). You’re seeing the live results of the app below (unless you’ve been more conservative than most folks with your browser security settings),

but the app is actually hosted over at [Data Driven Security](http://shiny.dds.ec/sochi2014/), a blog and (woefully underpowered so reload if it coughs up blood, pls) Shiny server that I run with @jayjacobs. It appears in this WordPress post with the help of an `IFRAME`. It’s essentially the same technique the RStudio/Shiny folks use in many of their own examples.

The app uses [bootstrapPage()](http://www.rdocumentation.org/packages/shiny/functions/bootstrapPage) to help make a more responsive layout which will react nicely in an `IFRAME` setting (since you won’t know the width of the browser area you’re trying to fit the Shiny output into).

In the `ui.R` file, I have the [plotOutput()](http://www.rdocumentation.org/packages/shiny/functions/plotOutput) configured to scale to 100% of container width:

plotOutput("medalsPlot", width="100%")

and then create a seamless `IFRAME` that also sizes to max-width:

<iframe src="http://shiny.dds.ec/sochi2014/" 
        style="max-width:100%" 
        width="100%"
        height="500px"
        scrolling="no" 
        frameborder="no" 
        seamless="seamless">
</iframe>

The *really cool* part (IMO) about many Shiny apps is that you don’t need to rely on the external server to work with the visualization/output. Provided that:

– the authors have coded their app to support local execution…
– and presented the necessary `ui.R`, `server.R`, `global.R`, HTML/CSS & data files either as a github gist or a zip/gz/tar.gz file…
– and **you** have the necessary libraries installed

then, you can start the app with a simple [Rscript](http://www.rdocumentation.org/packages/utils/functions/Rscript) one-liner:

Rscript -e "shiny::runGist(8949172, launch.browser=TRUE)"

or

Rscript -e "shiny::runUrl('http://dds.ec/apps/sochi2014.tar.gz', launch.browser=TRUE)"

There is *some* danger doing this if you haven’t read through the R code prior, since it’s possible to stick some fairly malicious operations in an R script (hey, I’m an infosec professional, so we’re always paranoid :-). But, if you stick with using a gist and do examine the code, you should be fine.

The “medals” R post by [TRInker](http://trinkerrstuff.wordpress.com/2014/02/09/sochi-olympic-medals-2/) and re-blogged by [Revolutions](http://blog.revolutionanalytics.com/2014/02/winter-olympic-medal-standings-presented-by-r.html) were both spiffy and a live example why there’s no point in not publishing raw data.

You don’t need to have R (or any other language) do the scraping, though. The “`IMPORTHTML`” function (yes, function names seem to be ALL CAPS now over at Google Drive) in Google Drive Spreadsheets can easily do the scraping with just s simple:

=IMPORTHTML("http://www.sochi2014.com/en/medal-standings","table",0)

that will refresh on demand and every hour.

Here’s a [live URL](https://docs.google.com/spreadsheets/d/1Al7I7nS0BP50IfThs55OKv5UPI9u-ctZgZRyDQma_G8/export?format=csv&gid=0) that will give back a CSV of the results which can easily be used in R thusly:

library(RCurl)
 
sochi.medals.URL = "https://docs.google.com/spreadsheets/d/1Al7I7nS0BP50IfThs55OKv5UPI9u-ctZgZRyDQma_G8/export?format=csv&gid=0"
medals <- read.csv(textConnection(getURL(sochi.medals.URL)), 
                   stringsAsFactors = FALSE)
 
str(medals)
 
'data.frame':  89 obs. of  6 variables:
$ Rank   : chr  "1" "2" "3" "4" ...
$ Country: chr  "Norway" "Canada" "Netherlands" "United States" ...
$ Gold   : int  4 4 3 2 2 1 1 1 1 1 ...
$ Silver : int  3 3 2 1 0 3 2 0 0 0 ...
$ Bronze : int  4 2 3 3 0 3 0 1 0 0 ...
$ Total  : int  11 9 8 6 2 7 3 2 1 1 ...
 
print(medals)
 
   Rank                                   Country Gold Silver Bronze Total
1     1                                    Norway    4      3      4    11
2     2                                    Canada    4      3      2     9
3     3                               Netherlands    3      2      3     8
4     4                             United States    2      1      3     6
5     5                                   Germany    2      0      0     2
6     6                              Russian Fed.    1      3      3     7
7     7                                   Austria    1      2      0     3
8     8                                    France    1      0      1     2
9    =9                                   Belarus    1      0      0     1
10   =9                                     Korea    1      0      0     1
11   =9                                    Poland    1      0      0     1
12   =9                                  Slovakia    1      0      0     1
13   =9                               Switzerland    1      0      0     1
14   14                                    Sweden    0      3      1     4
15   15                            Czech Republic    0      2      1     3
16   16                                  Slovenia    0      1      2     3
17   17                                     Italy    0      1      1     2
18  =18                                     China    0      1      0     1
19  =18                                   Finland    0      1      0     1
20  =20                             Great Britain    0      0      1     1
21  =20                                   Ukraine    0      0      1     1
22    -                                   Albania    0      0      0     0
23    -                                   Andorra    0      0      0     0
24    -                                 Argentina    0      0      0     0
25    -                                   Armenia    0      0      0     0
...
87    -                             Virgin Isl, B    0      0      0     0
88    -                            Virgin Isl, US    0      0      0     0
89    -                                  Zimbabwe    0      0      0     0

Which frees you up from dealing with the scraping and lets you focus solely on the data.

You can set it up in your own Google Docs as well, just make sure to publish the spreadhseet to the web (with ‘everyone’ read permisssions), strip off the `pubhtml` at the end of the published URL and add `export?format=csv&gid=0` in its place.

Jay Jacobs (@jayjacobs)—my co-author of the soon-to-be-released book [Data-Driven Security](http://amzn.to/ddsec)—& I have been hard at work over at the book’s [sister-blog](http://dds.ec/blog) cranking out code to help security domain experts delve into the dark art of data science.

We’ve covered quite a bit of ground since January 1st, but I’m using this post to focus more on what we’ve produced using R, since that’s our go-to language.

Jay used the blog to do a [long-form answer](http://datadrivensecurity.info/blog/posts/2014/Jan/severski/) to a question asked by @dseverski on the [SIRA](http://societyinforisk.org) mailing list and I piled on by adding a [Shiny app](http://datadrivensecurity.info/blog/posts/2014/Jan/solvo-mediocris/) into the mix (both posts make for a pretty `#spiffy` introduction to expert-opinion risk analyses in R).

Jay continued by [releasing a new honeypot data set](http://datadrivensecurity.info/blog/data/2014/01/marx.gz) and corresponding two-part[[1](http://datadrivensecurity.info/blog/posts/2014/Jan/blander-part1/),[2](http://datadrivensecurity.info/blog/posts/2014/Jan/blander-part2/)] post series to jump start analyses on that data. (There’s a D3 geo-visualization stuck in-between those posts if you’re into that sort of thing).

I got it into my head to start a project to build a [password dump analytics tool](http://datadrivensecurity.info/blog/posts/2014/Feb/ripal/) in R (with **much** more coming soon on that, including a full-on R package + Shiny app combo) and also continue the discussion we started in the book on the need for the infusion of reproducible research principles and practices in the information security domain by building off of @sucuri_security’s [Darkleech botnet](http://datadrivensecurity.info/blog/posts/2014/Feb/reproducible-research-sucuri-darkleech-data/) research.

You can follow along at home with the blog via it’s [RSS feed](http://datadrivensecurity.info/blog/feeds/all.atom.xml) or via the @ddsecblog Twitter account. You can also **play** along at home if you feel you have something to contribute. It’s as simple as a github pull request and some really straightforward markdown. Take a look the blog’s [github repo](https://github.com/ddsbook/blog) and hit me up (@hrbrmstr) for details if you’ve got something to share.

Data-Driven-SecurityIf I made a Venn diagram of the cross-section of readers of this blog and the [Data Driven Security](http://dds.ec/) web sites it might be indistinguishable from a pure circle. However, just in case there are a few stragglers out there, I figured one more post on the fact that the new book by @jayjacobs & me is available _now_ in electronic form (not pre-order) wouldn’t hurt. The print book is still making it’s way from dead trees to store shelves and should be ready for the expected February 17th debut.

Here’s the list of links to e-tailers (man, I hate that term) who have it available for the various e-readers out there.

– [Amazon/Kindle](http://www.amazon.com/gp/product/B00I1Y7THY/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B00I1Y7THY&linkCode=as2&tag=rudisdotnet-20)
– [B&N/Nook](http://www.barnesandnoble.com/w/data-driven-security-jay-jacobs/1117239036?ean=9781118793725)
– [Google Books](https://play.google.com/store/books/details?id=LQigAgAAQBAJ)
– [Kobo](http://store.kobobooks.com/en-US/ebook/data-driven-security)

If you happen to catch it out in the wild and not on this list, drop me (@hrbrmstr) a note `#pls`.

And, a huge thank you! to everyone for their kind accolades yesterday (esp to those who’ve purchased the book :-)

This tweet by @moorehn (who usually is a superb economic journalist) really bugged me:

I grabbed the raw data from EPI: (http://www.epi.org/files/2012/data-swa/jobs-data/Employment%20to%20population%20ratio%20(EPOPs).xls) and properly started the graph at 0 for the y-axis and also broke out men & women (since the Excel spreadsheet had the data). It’s a really different picture:

empToPop

I’m not saying employment is great right now, but it’s nowhere near a “ski jump”. So much for the state of data journalism at the start of 2014.

Here’s the hastily crafted R-code:

library(ggplot2)
library(ggthemes)
library(reshape2)
 
a <- read.csv("empvyear.csv")
b <- melt(a, id.vars="Year")
 
gg <- ggplot(data=b, aes(x=Year, y=value, group=variable))
gg <- gg + geom_line(aes(color=variable))
gg <- gg + ylim(0, 100)
gg <- gg + theme_economist()
gg <- gg + labs(x="Year", y="Employment as share of population (%)", 
                title="Employment-to-population ratio, age 25–54, 1975–2011")
gg <- gg + theme(legend.title = element_blank())
gg

And, here’s the data extracted from the Excel file:

Year,Men,Women
1975,89.0,51.0
1976,89.5,52.9
1977,90.1,54.8
1978,91.0,57.3
1979,91.1,59.0
1980,89.4,60.1
1981,89.0,61.2
1982,86.5,61.2
1983,86.1,62.0
1984,88.4,63.9
1985,88.7,65.3
1986,88.5,66.6
1987,89.0,68.2
1988,89.5,69.3
1989,89.9,70.4
1990,89.1,70.6
1991,87.5,70.1
1992,86.8,70.1
1993,87.0,70.4
1994,87.2,71.5
1995,87.6,72.2
1996,87.9,72.8
1997,88.4,73.5
1998,88.8,73.6
1999,89.0,74.1
2000,89.0,74.2
2001,87.9,73.4
2002,86.6,72.3
2003,85.9,72.0
2004,86.3,71.8
2005,86.9,72.0
2006,87.3,72.5
2007,87.5,72.5
2008,86.0,72.3
2009,81.5,70.2
2010,81,69.3
2011,81.4,69