Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Andy Kirk (@visualisingdata) & Lynn Cherny (@arnicas) tweeted about the Guardian Word Count service/archive site, lamenting the lack of visualizations:

This gave me a chance to bust out another [Shiny](http://www.rstudio.com/shiny/) app over on our [Data Driven Security](http://datadrivensecurity.info) [shiny server](http://shiny.dds.ec/guardian-words/):

I used my trusty “`Google-Drive-spreadsheet-IMPORTHTML-to-CSV`” workflow (you can access the automagically updated data [here](https://docs.google.com/spreadsheets/d/10CZhMhpFxTPWcLauam-ydKeFrdNgHEIehKznVMHFRM0/pubhtml)) to make the CSV that updates daily on the site and is referenced by the Shiny/R code.

The code has been [gist-ified](https://gist.github.com/hrbrmstr/9570488.js), and I’ll be re-visiting it to refactor the `data.frame` creation bits and add some more charts as the data set gets larger.


(Don’t forget to take a peek at our new book, [Data-Driven Security](http://bit.ly/ddsec)!)

I shot a quick post over at the [Data Driven Security blog](http://bit.ly/1hyqJiT) explaining how to separate Twitter data gathering from R code via the Ruby `t` ([github repo](https://github.com/sferik/t)) command. Using `t` frees R code from having to be a Twitter processor and lets the analyst focus on analysis and visualization, plus you can use `t` as a substitute for Twitter GUIs if you’d rather play at the command-line:

$ t timeline ddsecblog
   @DDSecBlog
   Monitoring Credential Dumps Plus Using Twitter As a Data Source http://t.co/ThYbjRI9Za
 
   @DDSecBlog
   Nice intro to R + stats // Data Analysis and Statistical Inference free @datacamp_com course
   http://t.co/FC44FF9DSp
 
   @DDSecBlog
   Very accessible paper & cool approach to detection // Nazca: Detecting Malware Distribution in
   Large-Scale Networks http://t.co/fqrSaFvUK2
 
   @DDSecBlog
   Start of a new series by new contributing blogger @spttnnh! // @AlienVault rep db Longitudinal
   Study Part 1 : http://t.co/XM7m4zP0tr
 
   ...

The DDSec post shows how to mine the well-formatted output from the @dumpmon Twitter bot to visualize dump trends over time:

and has the code in-line and over at the [DDSec github repo](https://github.com/ddsbook/blog/blob/master/extra/src/R/dumpmon.R) [R].

I’m posting this mostly to show how to:

– use the Google spreadsheet data-munging “hack” from the [previous post](http://rud.is/b/2014/02/11/live-google-spreadsheet-for-keeping-track-of-sochi-medals/) in a Shiny context
– include it seamlessly into a web page, and
– run it locally without a great deal of wrangling

The code for the app is [in this gist](https://gist.github.com/hrbrmstr/8949172). It is unsurprisingly just like [some spiffy other code](http://www.r-bloggers.com/winter-olympic-medal-standings-presented-by-r/) you’ve seen apart from my aesthetic choices (Sochi blue! lines+dots! and, current rankings next to country names).

I won’t regurgitate the code here since it’s just as easy to view on [github](https://gist.github.com/hrbrmstr/8949172). You’re seeing the live results of the app below (unless you’ve been more conservative than most folks with your browser security settings),

but the app is actually hosted over at [Data Driven Security](http://shiny.dds.ec/sochi2014/), a blog and (woefully underpowered so reload if it coughs up blood, pls) Shiny server that I run with @jayjacobs. It appears in this WordPress post with the help of an `IFRAME`. It’s essentially the same technique the RStudio/Shiny folks use in many of their own examples.

The app uses [bootstrapPage()](http://www.rdocumentation.org/packages/shiny/functions/bootstrapPage) to help make a more responsive layout which will react nicely in an `IFRAME` setting (since you won’t know the width of the browser area you’re trying to fit the Shiny output into).

In the `ui.R` file, I have the [plotOutput()](http://www.rdocumentation.org/packages/shiny/functions/plotOutput) configured to scale to 100% of container width:

plotOutput("medalsPlot", width="100%")

and then create a seamless `IFRAME` that also sizes to max-width:

<iframe src="http://shiny.dds.ec/sochi2014/" 
        style="max-width:100%" 
        width="100%"
        height="500px"
        scrolling="no" 
        frameborder="no" 
        seamless="seamless">
</iframe>

The *really cool* part (IMO) about many Shiny apps is that you don’t need to rely on the external server to work with the visualization/output. Provided that:

– the authors have coded their app to support local execution…
– and presented the necessary `ui.R`, `server.R`, `global.R`, HTML/CSS & data files either as a github gist or a zip/gz/tar.gz file…
– and **you** have the necessary libraries installed

then, you can start the app with a simple [Rscript](http://www.rdocumentation.org/packages/utils/functions/Rscript) one-liner:

Rscript -e "shiny::runGist(8949172, launch.browser=TRUE)"

or

Rscript -e "shiny::runUrl('http://dds.ec/apps/sochi2014.tar.gz', launch.browser=TRUE)"

There is *some* danger doing this if you haven’t read through the R code prior, since it’s possible to stick some fairly malicious operations in an R script (hey, I’m an infosec professional, so we’re always paranoid :-). But, if you stick with using a gist and do examine the code, you should be fine.

The “medals” R post by [TRInker](http://trinkerrstuff.wordpress.com/2014/02/09/sochi-olympic-medals-2/) and re-blogged by [Revolutions](http://blog.revolutionanalytics.com/2014/02/winter-olympic-medal-standings-presented-by-r.html) were both spiffy and a live example why there’s no point in not publishing raw data.

You don’t need to have R (or any other language) do the scraping, though. The “`IMPORTHTML`” function (yes, function names seem to be ALL CAPS now over at Google Drive) in Google Drive Spreadsheets can easily do the scraping with just s simple:

=IMPORTHTML("http://www.sochi2014.com/en/medal-standings","table",0)

that will refresh on demand and every hour.

Here’s a [live URL](https://docs.google.com/spreadsheets/d/1Al7I7nS0BP50IfThs55OKv5UPI9u-ctZgZRyDQma_G8/export?format=csv&gid=0) that will give back a CSV of the results which can easily be used in R thusly:

library(RCurl)
 
sochi.medals.URL = "https://docs.google.com/spreadsheets/d/1Al7I7nS0BP50IfThs55OKv5UPI9u-ctZgZRyDQma_G8/export?format=csv&gid=0"
medals <- read.csv(textConnection(getURL(sochi.medals.URL)), 
                   stringsAsFactors = FALSE)
 
str(medals)
 
'data.frame':  89 obs. of  6 variables:
$ Rank   : chr  "1" "2" "3" "4" ...
$ Country: chr  "Norway" "Canada" "Netherlands" "United States" ...
$ Gold   : int  4 4 3 2 2 1 1 1 1 1 ...
$ Silver : int  3 3 2 1 0 3 2 0 0 0 ...
$ Bronze : int  4 2 3 3 0 3 0 1 0 0 ...
$ Total  : int  11 9 8 6 2 7 3 2 1 1 ...
 
print(medals)
 
   Rank                                   Country Gold Silver Bronze Total
1     1                                    Norway    4      3      4    11
2     2                                    Canada    4      3      2     9
3     3                               Netherlands    3      2      3     8
4     4                             United States    2      1      3     6
5     5                                   Germany    2      0      0     2
6     6                              Russian Fed.    1      3      3     7
7     7                                   Austria    1      2      0     3
8     8                                    France    1      0      1     2
9    =9                                   Belarus    1      0      0     1
10   =9                                     Korea    1      0      0     1
11   =9                                    Poland    1      0      0     1
12   =9                                  Slovakia    1      0      0     1
13   =9                               Switzerland    1      0      0     1
14   14                                    Sweden    0      3      1     4
15   15                            Czech Republic    0      2      1     3
16   16                                  Slovenia    0      1      2     3
17   17                                     Italy    0      1      1     2
18  =18                                     China    0      1      0     1
19  =18                                   Finland    0      1      0     1
20  =20                             Great Britain    0      0      1     1
21  =20                                   Ukraine    0      0      1     1
22    -                                   Albania    0      0      0     0
23    -                                   Andorra    0      0      0     0
24    -                                 Argentina    0      0      0     0
25    -                                   Armenia    0      0      0     0
...
87    -                             Virgin Isl, B    0      0      0     0
88    -                            Virgin Isl, US    0      0      0     0
89    -                                  Zimbabwe    0      0      0     0

Which frees you up from dealing with the scraping and lets you focus solely on the data.

You can set it up in your own Google Docs as well, just make sure to publish the spreadhseet to the web (with ‘everyone’ read permisssions), strip off the `pubhtml` at the end of the published URL and add `export?format=csv&gid=0` in its place.

Jay Jacobs (@jayjacobs)—my co-author of the soon-to-be-released book [Data-Driven Security](http://amzn.to/ddsec)—& I have been hard at work over at the book’s [sister-blog](http://dds.ec/blog) cranking out code to help security domain experts delve into the dark art of data science.

We’ve covered quite a bit of ground since January 1st, but I’m using this post to focus more on what we’ve produced using R, since that’s our go-to language.

Jay used the blog to do a [long-form answer](http://datadrivensecurity.info/blog/posts/2014/Jan/severski/) to a question asked by @dseverski on the [SIRA](http://societyinforisk.org) mailing list and I piled on by adding a [Shiny app](http://datadrivensecurity.info/blog/posts/2014/Jan/solvo-mediocris/) into the mix (both posts make for a pretty `#spiffy` introduction to expert-opinion risk analyses in R).

Jay continued by [releasing a new honeypot data set](http://datadrivensecurity.info/blog/data/2014/01/marx.gz) and corresponding two-part[[1](http://datadrivensecurity.info/blog/posts/2014/Jan/blander-part1/),[2](http://datadrivensecurity.info/blog/posts/2014/Jan/blander-part2/)] post series to jump start analyses on that data. (There’s a D3 geo-visualization stuck in-between those posts if you’re into that sort of thing).

I got it into my head to start a project to build a [password dump analytics tool](http://datadrivensecurity.info/blog/posts/2014/Feb/ripal/) in R (with **much** more coming soon on that, including a full-on R package + Shiny app combo) and also continue the discussion we started in the book on the need for the infusion of reproducible research principles and practices in the information security domain by building off of @sucuri_security’s [Darkleech botnet](http://datadrivensecurity.info/blog/posts/2014/Feb/reproducible-research-sucuri-darkleech-data/) research.

You can follow along at home with the blog via it’s [RSS feed](http://datadrivensecurity.info/blog/feeds/all.atom.xml) or via the @ddsecblog Twitter account. You can also **play** along at home if you feel you have something to contribute. It’s as simple as a github pull request and some really straightforward markdown. Take a look the blog’s [github repo](https://github.com/ddsbook/blog) and hit me up (@hrbrmstr) for details if you’ve got something to share.

Data-Driven-SecurityIf I made a Venn diagram of the cross-section of readers of this blog and the [Data Driven Security](http://dds.ec/) web sites it might be indistinguishable from a pure circle. However, just in case there are a few stragglers out there, I figured one more post on the fact that the new book by @jayjacobs & me is available _now_ in electronic form (not pre-order) wouldn’t hurt. The print book is still making it’s way from dead trees to store shelves and should be ready for the expected February 17th debut.

Here’s the list of links to e-tailers (man, I hate that term) who have it available for the various e-readers out there.

– [Amazon/Kindle](http://www.amazon.com/gp/product/B00I1Y7THY/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B00I1Y7THY&linkCode=as2&tag=rudisdotnet-20)
– [B&N/Nook](http://www.barnesandnoble.com/w/data-driven-security-jay-jacobs/1117239036?ean=9781118793725)
– [Google Books](https://play.google.com/store/books/details?id=LQigAgAAQBAJ)
– [Kobo](http://store.kobobooks.com/en-US/ebook/data-driven-security)

If you happen to catch it out in the wild and not on this list, drop me (@hrbrmstr) a note `#pls`.

And, a huge thank you! to everyone for their kind accolades yesterday (esp to those who’ve purchased the book :-)

This tweet by @moorehn (who usually is a superb economic journalist) really bugged me:

I grabbed the raw data from EPI: (http://www.epi.org/files/2012/data-swa/jobs-data/Employment%20to%20population%20ratio%20(EPOPs).xls) and properly started the graph at 0 for the y-axis and also broke out men & women (since the Excel spreadsheet had the data). It’s a really different picture:

empToPop

I’m not saying employment is great right now, but it’s nowhere near a “ski jump”. So much for the state of data journalism at the start of 2014.

Here’s the hastily crafted R-code:

library(ggplot2)
library(ggthemes)
library(reshape2)
 
a <- read.csv("empvyear.csv")
b <- melt(a, id.vars="Year")
 
gg <- ggplot(data=b, aes(x=Year, y=value, group=variable))
gg <- gg + geom_line(aes(color=variable))
gg <- gg + ylim(0, 100)
gg <- gg + theme_economist()
gg <- gg + labs(x="Year", y="Employment as share of population (%)", 
                title="Employment-to-population ratio, age 25–54, 1975–2011")
gg <- gg + theme(legend.title = element_blank())
gg

And, here’s the data extracted from the Excel file:

Year,Men,Women
1975,89.0,51.0
1976,89.5,52.9
1977,90.1,54.8
1978,91.0,57.3
1979,91.1,59.0
1980,89.4,60.1
1981,89.0,61.2
1982,86.5,61.2
1983,86.1,62.0
1984,88.4,63.9
1985,88.7,65.3
1986,88.5,66.6
1987,89.0,68.2
1988,89.5,69.3
1989,89.9,70.4
1990,89.1,70.6
1991,87.5,70.1
1992,86.8,70.1
1993,87.0,70.4
1994,87.2,71.5
1995,87.6,72.2
1996,87.9,72.8
1997,88.4,73.5
1998,88.8,73.6
1999,89.0,74.1
2000,89.0,74.2
2001,87.9,73.4
2002,86.6,72.3
2003,85.9,72.0
2004,86.3,71.8
2005,86.9,72.0
2006,87.3,72.5
2007,87.5,72.5
2008,86.0,72.3
2009,81.5,70.2
2010,81,69.3
2011,81.4,69

RStudio is my R development environment of choice and I work primarily on/in Mac OS X. While it’s great that Apple provides a built-in Terminal application, I prefer to use [iTerm 2](http://www.iterm2.com/#/section/home) when I need to do work at a shell. The fine folks at RStudio provide a handy `Shell`… menu item off the `Tools` menu, but it (rightly) defaults to using Apple’s Terminal.app for functionality since they can’t assume what terminal application you are using.

To change it to iTerm (or whatever your favorite terminal application is) you need to fire up a text editor and change (make a backup first!) `/Applications/RStudio.app/Contents/MacOS/mac-terminal` to contain the following modified AppleScript:

#!/usr/bin/osascript
on run argv
  set dir to quoted form of (first item of argv)
  tell app "iTerm"
    activate
    tell the first terminal
      launch session "Default Session"
      tell the last session
        set name to "RStudio Session"
        write text "cd " & dir
      end tell
    end tell
  end tell
end run

That will open a new tab in iTerm, set the session (tab) name to “RStudio Session” and change the directory to the current working directory in RStudio.

More often than not, I’m just using [Alfred](http://www.alfredapp.com/) to kick up iTerm and doing the `cd` myself, but this added tweak (which you have to do **every** time you upgrade RStudio) reduces the churn when I do end up using the feature within RStudio.