hrbrmstr, Author at rud.is

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Data Crunching Competitions

2013-01-31 – 13:49
Posted in Big Data, Charts & Graphs, Data Analysis, Data Visualization, DataVis, DataViz
Leave a Comment

I came across a list of data crunching and/or visualization competition sites today. I had heard of Kaggle & NITRD before but not the other ones. If you’re looking to get into data analytics/visualization or get better at it, the only way to do so is to practice. Using other data sets (outside your expertise field) may also help you do in-field analysis a little better (or differently).

* Kaggle — “Kaggle is an arena where you can match your data science skills against a global cadre of experts in statistics, mathematics, and machine learning. Whether you’re a world-class algorithm wizard competing for prize money or a novice looking to learn from the best, here’s your chance to jump in and geek out, for fame, fortune, or fun.”

* CrowdAnalytix — Crowdsourced predictive analytics platform

* Innocentive — “he Challenge Center is where InnoCentive connects Seekers and Solvers with a myriad of the world’s toughest Challenges. Seeker organizations post their Challenges in the Challenge Center and offer registered Solvers significant financial awards for the best solutions. Solvers can search for Challenges based on their interests and expertise.”

* TunedIT — “a platform for hosting data competitions for educational, scientific and business purposes”

* KDD Cup — “KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miner”

* NITRD Big Data Challenge Series — “The Big Data Challenge is an effort by the U.S. government to conceptualize new and novel approaches to extracting value from “Big Data” information sets residing in various agency silos and delivering impactful value while remaining consistent with individual agency missions This data comes from the fields of health, energy and Earth science. Competitors will be tasked with imagining analytical techniques, and describe how they may be shared as universal, cross-agency solutions that transcend the limitations of individual agencies.”

Once More Into The [PRC Aggregated] Breaches

2013-01-27 – 18:29
Posted in Breach, Charts & Graphs, DataVis, DataViz, Information Security
Leave a Comment

If you’re not on the SecurityMetrics.org mailing list you missed an interaction about the Privacy Rights Clearinghouse Chronology of Data Breaches data source started by Lance Spitzner (@lspitzner). You’ll need to subscribe to the list see the thread, but one innocent question put me down the path to taking a look at the aggregated data with the intent of helping folks understand the overall utility/efficacy of it when trying to craft messages from it.

Before delving into the data, please note that PRC does an excellent job detailing source material for the data. They fully acknowledge some of the challenges with it, but a picture (or two) is worth a thousand caveats. (NOTE: Charts & numbers have been produced from January 20th, 2013 data).

The first thing I did was try to get a feel for overall volume:

Total breach record entries across all years (2005-present): 3573
Number of entries with ‘Total Records Lost’ filled in: 751
% of entries with ‘Total Records Lost’ filled in: 21.0%

Takeaway #1: Be very wary of using any “Total Records Breached” data from this data set.

It may help to see that computation broken down by reporting source over the years that the data file spans:

This view also gives us:

Takeaway #2: Not all data sources span all years and some have very little data.

However, Lance’s original goal was to compare “human error” vs “technical hack”. To do this, he combined DISC, PHYS, PORT & STAT into one category (accidental/human :: ACC-HUM) and HACK, CARD & INSD into another (malicious/attack :: MAL-ATT). Here’s what that looks like when broken down across reporting sources across time:

(click to enlarge)

This view provides another indicator that one might not want to place a great deal of faith on the PRC’s aggregation efforts. Why? It’s highly unlikely that DatalossDB had virtually no breach recordings in 2011 (in fact, it’s more than unlikely, it’s not true). Further views will show some other potential misses from DatalossDB.

Takeaway #3: Do not assume the components of this aggregated data set are complete.

We can get a further feel for data quality and also which reporting sources are weighted more heavily (i.e. which ones have more records, thus implicitly placing a greater reliance on them for any calculations) by looking at how many records they each contributed to the aggregated population each year:

(click to enlarge)

I’m not sure why 2008 & 2009 have such small bars for Databreaches.net and PHIPrivacy.net, and you can see the 2011 gap in the DatalossDB graph.

At this point, I’d (maybe) trust some aggregate analysis of the HHS (via PHI), CA Attorney General & Media data, but would need to caveat any conclusions with the obvious biases introduced by each.

Even with these issues, I really wanted a “big picture” view of the entire set and ended up creating the following two charts:

(click to enlarge)

(You’ll probably want to view the PDF documents of each : [1] [2] given how big they are.)

Those charts show the number of breaches-by-type by month across the 2005-2013 span by reporting source. The only difference between the two is that the latter one is grouped by Lance’s “meta type” definition. These views enable us to see gaps in reporting by month (note the additional aggregation issue at the tail end of 2010 for DatalossDB) and also to get a feel for the trends of each band (note the significant increase in “unknown” in 2012 for DatalossDB).

Takeaway #4: Do not ignore the “unknown” classification when performing analysis with this data set.

We can see other data issues if we look at it from other angles, such as the state the breach was recorded in:

(click to enlarge)

We can see at least three issues (missing value and occurrences recorded not in the US) from this view, but it seems the number of breaches mostly aligns with population (discrepancies make sense given the lack of uniform breach reporting requirements).

It’s also very difficult to do any organizational analysis (I’m a big fan of looking at “repeat offenders” in general) with this data set without some significant data cleansing/normalization. For example, all of these are “Bank of America“:

[1] "Bank of America"                                                             
[2] "Wachovia, Bank of America, PNC Financial Services Group and Commerce Bancorp"
[3] "Bank of America Corp."                                                       
[4] "Citigroup, Inc., Bank of America, Corp."

Without any cleansing, here are the orgs with two or more reported breaches since 2005:

(apologies for the IFRAME but Google’s Fusion Tables are far too easy to use when embedding data tables)

Takeaway #5: Do not assume that just because a data set has been aggregated by someone and published that it’s been scrubbed well.

Even if the above sets of issues were resolved, the real details are in the “breach details” field, which is a free-form text field providing more information on who/what/when/where/why/how (with varying degrees of consistency & completeness). This is actually the information you really need. The HACK attribute is all well-and-good, but what kind of hack was it? This is one area VERIS shines. What advice are you going to give financial services (BSF) orgs from this extract:

(click to enlarge)

HACKs are up in 2012 from 2010 & 2011, but what type of HACKs against what size organizations? Should smaller orgs be worried about desktop security and/or larger orgs start focusing more on web app security? You can’t make that determination without mining that free form text field. (NOTE: as I have time, I’m trying to craft a repeatable text analysis function I can perform on that field to see what can be automatically extracted)

Takeaway #6: This data set is pretty much what the PRC says it is: a chronology of data breaches. More in-depth analysis is not advised without a massive clean-up effort.

Finally, hypothesizing that the PRC’s aggregation could have resulted in duplicate records, I created a subset of the records based solely on breach “Date Made Public” + “Organization Name” and then sifted manually through the breach text details, 6 duplicate entries were found. Interestingly enough, only one instance of duplicate records was found across reporting databases (my hunch was that DatalossDB or DataBreaches.NET would have had records other, smaller databases included; however, this particular duplicate detection mechanism does not rule this out given the quality of the data).

Conclusion/Acknowledgements

Despite the criticisms above, the efforts by the PRC and their sources for aggregation are to be commended. Without their work to document breaches we would only have the mega-media-frenzy stories and labor-intensive artifacts like the DBIR to work with. Just because the data isn’t perfect right now doesn’t mean we won’t get to a point where we have the ability to record and share this breach information like the CDC does diseases (which also ins’t perfect, btw).

I leave you with another column of numbers that shows—if broken down by organization type and breach type—there is an average of 2 breaches per-breach/org-type-per-year (according to this data):

(The complete table includes the mean, median and standard deviation for each type.)

Lance’s final question to me (on the list) was “Bob, what do recommended as the next step to answer the question – What percentage of publicly known data breaches are deliberate cyber attacks, and what percentage are human based accidental loss/disclosure?”

I’d first start with a look at the DBIR (especially this year’s) and then see if I could get a set of grad students to convert a complete set of DatalossDB records (and, perhaps, the other sources) into VERIS format for proper analysis. If any security vendors are reading this, I guarantee you’ll gain significant capital/accolades within/from the security practitioner community if you were to sponsor such an effort.

Comments, corrections & constructive criticisms are heartily welcomed. Data crunching & graphing scripts available both on request and perhaps uploaded to my github repository once I clean them up a bit.

River Post Dissected

2013-01-20 – 14:50
Posted in Charts & Graphs, DataVis, DataViz
Leave a Comment

Good Stats, Bad Stats has a really good critique of my Mississippi River post that you should read (TL;DR: my graphs & analysis need work).

SHODAN API in R (With Examples)

2013-01-17 – 15:21
Posted in Charts & Graphs, DataVis, DataViz, Information Security, Programming, R, Vulnerabilities
Leave a Comment

Folks may debate the merits of the SHODAN tool, but in my opinion it’s a valuable resource, especially if used for “good”. What is SHODAN? I think ThreatPost summed it up nicely:

“Shodan is a Web based search engine that discovers Internet facing computers, including desktops, servers and routers. The engine, created by programmer John Matherly, allows users to filter searches for systems running a specific type of application (say, Apache Web servers or FTP) and filter results by geographic region. The search engine indexes host ’banners,’ which include meta-data sent between a server and client and includes information such as the type of software run, what services are available and so on.”

I’m in R quite a bit these days and thought it would be useful to have access to the SHODAN API in R. I have a very rudimentary version of the API (search only) up on github which can be integrated into your R environment thus:

library(devtools)
install_github("Rshodan","hrbrmstr")
library(shodan)
help(shodan) # you don't really need to do this cmd

It’ll eventually be in CRAN, but I have some cleanup work to do before the maintainers will accept the submission. If you are new to R, there are a slew of dependencies you’ll need to add to the base R installation. Here’s a good description of how to do that on pretty much every platform.

After I tweeted the above reference, @shawnmer asked the following:

https://twitter.com/shawnmer/status/290904140782137344

That is not an unreasonable request, especially if one is new to R (or SHODAN). I had been working on this post and a more extended example and finally able to get enough code done to warrant publishing it. You can do far more in R than these simple charts & graphs. Imagine taking data from multiple searches–either across time or across ports–and doing a statistical comparison. Or, use some the image processing & recognition libraries within R as well as a package such as RCurl to fetch images from open webcams and attempt to identify people or objects. The following should be enough for most folks to get started.

You can cut/paste the source code here or download the whole source file.

The fundamental shortcut this library provides over just trying to code it yourself is taking the JSON response from SHODAN and turning it into an R data frame. That is not as overtly trivial as you might think and you may want to look at the source code for the library to see where I grabbed some of that code from. I’m also not 100% convinced it’s going to work under all circumstances (hence part of the 0.1 status).

library(shodan)
library(ggplot2)
library(xtable)
library(maps)
library(rworldmap)
library(ggthemes)
 
 
# if you're behind a proxy, setting this will help
# but it's strongly suggested/encouraged that you stick the values in a file and 
# read them in vs paste them in a script
# options(RCurlOptions = list(proxy="proxyhost", proxyuserpwd="user:pass"))
 
setSHODANKey("~/.shodankey")
 
# query example taken from Michael “theprez98” Schearer's DEFCON 18 presentation
# https://www.defcon.org/images/defcon-18/dc-18-presentations/Schearer/DEFCON-18-Schearer-SHODAN.pdf
 
# find all Cisco IOS devies that may have an unauthenticated admin login
# setting trace to be TRUE to see the progress of the query
result = SHODANQuery(query="cisco last-modified www-authenticate",trace=TRUE)
 
#find the first 100 found memcached instances
#result = SHODANQuery(query='port:11211',limit=100,trace=TRUE)
 
df = result$matches
 
# aggregate result by operating system
# you can use this one if you want to filter out NA's completely
#df.summary.by.os = ddply(df, .(os), summarise, N=sum(as.numeric(factor(os))))
#this one provides count of NA's (i.e. unidentified systems)
df.summary.by.os = ddply(df, .(os), summarise, N=length(os))
 
# sort & see the results in a text table
df.summary.by.os = transform(df.summary.by.os, os = reorder(os, -N))
df.summary.by.os

That will yield:

FALSE                 os   N
FALSE 1      Linux 2.4.x  60
FALSE 2      Linux 2.6.x   6
FALSE 3 Linux recent 2.4   2
FALSE 4     Windows 2000   2
FALSE 5   Windows 7 or 8  10
FALSE 6       Windows XP   8
FALSE 7             <NA> 112

You can plot it with:

# plot a bar chart of them
(ggplot(df.summary.by.os,aes(x=os,y=N,fill=os)) + 
   geom_bar(stat="identity") + 
   theme_few() +
   labs(y="Count",title="SHODAN Search Results by OS"))

to yield:

and:

world = map_data("world")
(ggplot() +
   geom_polygon(data=world, aes(x=long, y=lat, group=group)) +
   geom_point(data=df, aes(x=longitude, y=latitude), colour="#EE760033",size=1.75) +
   labs(x="",y="") +
   theme_few())

You can easily do the same by country:

# sort & view the results by country
# see above if you don't want to filter out NA's
df.summary.by.country_code = ddply(df, .(country_code, country_name), summarise, N=sum(!is.na(country_code)))
df.summary.by.country_code = transform(df.summary.by.country_code, country_code = reorder(country_code, -N))
 
df.summary.by.country_code

##    country_code              country_name  N
## 1            AR                 Argentina  2
## 2            AT                   Austria  2
## 3            AU                 Australia  2
## 4            BE                   Belgium  2
## 5            BN         Brunei Darussalam  2
## 6            BR                    Brazil 14
## 7            CA                    Canada 16
## 8            CN                     China  6
## 9            CO                  Colombia  4
## 10           CZ            Czech Republic  2
## 11           DE                   Germany 12
## 12           EE                   Estonia  4
## 13           ES                     Spain  4
## 14           FR                    France 10
## 15           HK                 Hong Kong  2
## 16           HU                   Hungary  2
## 17           IN                     India 10
## 18           IR Iran, Islamic Republic of  4
## 19           IT                     Italy  4
## 20           LV                    Latvia  4
## 21           MX                    Mexico  2
## 22           PK                  Pakistan  4
## 23           PL                    Poland 16
## 24           RU        Russian Federation 14
## 25           SG                 Singapore  2
## 26           SK                  Slovakia  2
## 27           TW                    Taiwan  6
## 28           UA                   Ukraine  2
## 29           US             United States 28
## 30           VE                 Venezuela  2
## 31         <NA>                      <NA>  0

(ggplot(df.summary.by.country_code,aes(x=country_code,y=N)) + 
  geom_bar(stat="identity") +
  theme_few() +
  labs(y="Count",x="Country",title="SHODAN Search Results by Country"))

And, easily generate the must-have choropleth:

# except make a choropleth
# using the very simple rworldmap process
shodanChoropleth = joinCountryData2Map( df.summary.by.country_code, joinCode = "ISO2", nameJoinColumn = "country_code")
par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i")
mapCountryData(shodanChoropleth, nameColumnToPlot="N",colourPalette="terrain",catMethod="fixedWidth")

Again, producing pretty pictures is all well-and-good, but it’s best to start with some good questions you need answering to make any visualization worthwhile. In the coming weeks, I’ll do some posts that show what types of questions you may want to ask/answer with R & SHODAN.

I encourage folks that have issues, concerns or requests to use github vs post in the comments, but I’ll try to respond to either as quickly as possible.

How Low Can It [The Mississsippi River] Go?

2013-01-12 – 22:35
Posted in Charts & Graphs, DataVis, DataViz
Tagged drought, mississippi
Comments (1)

Good Stats, Bad Stats has a really good critique of this post that you should read after this (so you know how to avoid the mistakes I made :-)

I’ve heard quite a bit about the current problems with the Mississippi River and wanted to see for myself (with data) just how bad it is.

St Louis seems to be quite indicative of the severity of the situation, so I pulled the USGS “stream” records for it and also the historic low water level records for it and put them both into R for some analysis & graphing:

click for larger version

click for larger version

They are both in PDF format as well [1] [2]

As you can see, there have only been four other (recorded) times when the river was this low and it has just come off of multi-year severely high points with a fairly rapid trend downwards. I’m sure the residents along the Mississippi do not need this data to tell them just how bad things are, but it has helped me understand just how bad the situation is.

For those interested, the R code uses ggplot2 for time-series charting along with a custom theme and various annotation aids that might be useful to others learning their way around the grammar of graphics in R (so the code is below).

#
# stream.R - graph the recent history of the Mississippi River at St Louis
#
 
library(ggplot2)
require(scales)
library(ggthemes)
 
# read in st louis, mo USGS stream data
 
df.raw = read.csv("~/Desktop/stream.txt")
 
# need date/time as an R Date for ggplot2 time series plot
 
df = data.frame(as.POSIXct(df.raw$datetime,format = "%Y-%m-%d %H:%M"),df.raw$gauge)
df = df[!is.na(df.raw$gauge),]
 
# pretty up the column names
 
colnames(df) = c("datetime","gauge")
 
# we uses these a few times
 
maxdate = max(df$datetime)
mindate = min(df$datetime)
mingauge = min(df$gauge)
 
# do the plot
 
st1 = ggplot(df, aes(datetime, gauge)) +
 
  theme_economist() + # pretty theme
 
  # background bands for various water level stages
 
  geom_rect(data=df,aes(xmin=mindate, xmax=maxdate, ymin=28, ymax=30), alpha=1, fill="khaki1") +
  geom_rect(data=df,aes(xmin=mindate, xmax=maxdate, ymin=30, ymax=35), alpha=1, fill="gold1") +
  geom_rect(data=df,aes(xmin=mindate, xmax=maxdate, ymin=35, ymax=40), alpha=1, fill="firebrick") +
 
  geom_rect(data=df,aes(xmin=mindate, xmax=maxdate, ymin=mingauge, ymax=0), alpha=1, fill="white") +
 
  # labels for the bands
 
  geom_text(data=data.frame(x=maxdate,y=29), aes(x=x,y=y,label="Action Stage "), size=3, hjust=1) +
  geom_text(data=data.frame(x=maxdate,y=32), aes(x=x,y=y,label="Flood Stage "), size=3, hjust=1) +
  geom_text(data=data.frame(x=maxdate,y=37), aes(x=x,y=y,label="Moderate Flood Stage "), size=3, hjust=1, colour="white") +
 
  geom_text(data=data.frame(x=mindate,y=mingauge/2), aes(x=x,y=y,label=" Below gauge"), size=3, hjust=0, colour="black") +
 
  # the raw stream data
 
  geom_line(size=0.15) +
 
  # change the x label to just years
 
  scale_x_datetime(breaks=date_breaks("years"), labels=date_format("%Y")) + 
 
  # labels
 
  labs(title="Mississipi River Depth at St Louis, MO", x="", y="Gauge Height (in.)") +
 
  # add a smoothed trend line
 
  geom_smooth() +
 
  # remove the legend
 
  theme(legend.position = "none")
 
# make a PDF
 
ggsave("~/Desktop/mississippi.pdf",st1,w=8,h=5,dpi=150)

#
# low.R - graph the historic low records for the Mississippi River at St Louis
#
 
library(ggplot2)
require(scales)
library(ggthemes)
 
# read in historic low records
 
df.raw = read.csv("~/Desktop/low.csv")
 
# need date/time as an R Date for ggplot2 time series plot
 
df = data.frame(as.POSIXct(df.raw$date,format = "%m/%d/%Y"),df.raw$gauge)
colnames(df) = c("date","gauge")
 
# pretty up the column names
 
maxdate = max(df$date)
mindate = min(df$date)
 
# do the plot
 
low1 = ggplot(data=df,aes(date,gauge)) + 
  geom_rect(data=df,aes(xmin=mindate, xmax=maxdate, ymin=-4.55, ymax=-4.55), alpha=1, color="firebrick",fill="firebrick") +
  geom_text(data=data.frame(x=mindate,y=-4.75), aes(x=x,y=y,label="January 2013 :: -4.55in"), size=3, hjust=0, colour="firebrick") +
  geom_line(size=0.15) + 
  labs(title="Historic Low Water Depths at St Louis, MO", x="", y="Gauge Height (in.)") +
  theme_economist()
 
ggsave("~/Desktop/low.pdf",low1,w=8,h=5,dpi=150)

Slopegraphs in R

2013-01-11 – 20:59
Posted in Charts & Graphs, DataVis, DataViz, Python, R
Leave a Comment

I updated the code to use ggsave and tweaked some of the font & line size values for more consistent (and pretty) output. This also means that I really need to get this up on github.

If you even remotely follow this blog, you’ll see that I’m kinda obsessed with slopegraphs. While I’m pretty happy with my Python implementation, I do quite a bit of data processing & data visualization in R these days and had a few free hours on a recent trip to Seattle, so I whipped up some R code to do traditional and multi-column rank-order slopegraphs in R, mostly due to a post over at Microsoft’s security blog.

#
# multicolumn-rankorder-slopegraph.R
#
# 2013-01-12 - formatting tweaks
# 2013-01-10 - Initial version - boB Rudis - @hrbrmstr
#
# Pretty much explained by the script title. This is an R script which is designed to produce
# 2+ column rank-order slopegraphs with the ability to highlight meaningful patterns
#
 
library(ggplot2)
library(reshape2)
 
# transcription of table from:
# http://blogs.technet.com/b/security/archive/2013/01/07/operating-system-infection-rates-the-most-common-malware-families-on-each-platform.aspx
#
# You can download it from: 
# https://docs.google.com/spreadsheet/ccc?key=0AlCY1qfmPPZVdHpwYk0xYkh3d2xLN0lwTFJrWXppZ2c
 
df = read.csv("~/Desktop/malware.csv")
 
# For this slopegraph, we care that #1 is at the top and that higher value #'s are at the bottom, so we 
# negate the rank values in the table we just read in
 
df$Rank.Win7.SP1 = -df$Rank.Win7.SP1
df$Rank.Win7.RTM = -df$Rank.Win7.RTM
df$Rank.Vista = -df$Rank.Vista
df$Rank.XP = -df$Rank.XP
 
# Also, we are really comparing the end state (ultimately) so sort the list by the end state.
# In this case, it's the Windows 7 malware data.
 
df$Family = with(df, reorder(Family, Rank.Win7.SP1))
 
# We need to take the multi-columns and make it into 3 for line-graph processing 
 
dfm = melt(df)
 
# We need to take the multi-columns and make it into 3 for line-graph processing 
 
dfm = melt(df)
 
# We define our color palette manually so we can highlight the lines that "matter".
# This means you'll need to generate the slopegraph at least one time prior to determine
# which lines need coloring. This should be something we pick up algorithmically, eventually
 
sgPalette = c("#990000", "#990000",  "#CCCCCC", "#CCCCCC", "#CCCCCC","#CCCCCC", "#990000", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC")
#sgPalette = c("#CCCCCC", "#CCCCCC",  "#CCCCCC", "#CCCCCC", "#CCCCCC","#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC")
#sgPalette = c("#000000", "#000000",  "#000000", "#000000", "#000000","#000000", "#000000", "#000000", "#000000", "#000000", "#000000", "#000000", "#000000", "#000000", "#000000")
 
 
# start the plot
#
# we do a ton of customisations to the plain ggplot canvas, but it's not rocket science
 
sg = ggplot(dfm, aes(factor(variable), value, 
                     group = Family, 
                     colour = Family, 
                     label = Family)) +
  scale_colour_manual(values=sgPalette) +
  theme(legend.position = "none", 
        axis.text.x = element_text(size=5),
        axis.text.y=element_blank(), 
        axis.title.x=element_blank(),
        axis.title.y=element_blank(),
        axis.ticks=element_blank(),
        axis.line=element_blank(),
        panel.grid.major = element_line("black", size = 0.1),
        panel.grid.major = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.background = element_blank())
 
# plot the right-most labels
 
sg1 = sg + geom_line(size=0.15) + 
  geom_text(data = subset(dfm, variable == "Rank.Win7.SP1"), 
            aes(x = factor(variable), label=sprintf(" %-2d %s",-(value),Family)), size = 1.75, hjust = 0) 
 
# plot the left-most labels
 
sg1 = sg1 + geom_text(data = subset(dfm, variable == "Rank.XP"), 
                     aes(x = factor(variable), label=sprintf("%s %2d ",Family,-(value))), size = 1.75, hjust = 1)
 
# this ratio seems to work well for png output
# you'll need to tweak font size for PDF output, but PDF will make post-processing in 
# Illustrator or Inkscape much easier.
 
ggsave("~/Desktop/malware.pdf",sg1,w=8,h=5,dpi=150)

Click for larger version

I really didn’t think the table told a story well and I truly believe slopegraphs are pretty good at telling stories.

This bit of R code is far from generic and requires the data investigator to do some work to make it an effective visualization, but (I think) it’s one of the better starts at a slopegraph library in R. It suffers from the same issues I’ve pointed out before, but it’s far fewer lines of code than my Python version and it handles multi-column slopegraphs quite nicely.

To be truly effective, you’ll need to plot the slopegraph first and then figure out which nodes to highlight and change the sgPalette accordingly to help the reader/viewer focus on what’s important.

I’ll post everything on github once I recover from cross-country travel and—as always–welcome feedback and fixes.

CVE Queries Right From Your Browser’s Address Bar

2012-12-28 – 21:09
Posted in Browsers, Chrome, Vulnerabilities
Leave a Comment

I’m not sure why I never did this earlier, but a post on LifeHacker gave me an idea to add location bar quick search of CVEs (Common Vulnerabilities and Exposures), no doubt due to their example on searching LifeHacker for “security”.

My two favorite sites for searching CVE specifics are, at present, Risk IO’s and CVE Details.

I’m fairly certain anyone in security reading this can figure out the rest, but as I’m ever a slave to minutiae, here are the two shortcuts I’ve setup in Chrome:

Title: CVE Details
Search URL: http://cvedetails.com/cve-details.php?cve_id=%s
Shortcut: cved

Title: Risk I/O Vulnerability Search
Search URL: https://db.risk.io/?q=%s
Shortcut: cvedb

Here’s what the location bar changes to when I use cvedb to search for 2012‑4774

In reality, this is only saving a scroll and a click since entering CVE‑2012‑4774 into an unoptimized location bar would have just searched Google and given me most of the usual suspects in the first few links. Still, it saves some time and immediately gets me the vulnerability data from the sites I prefer.

I may start poking to see what other security-related searches I can setup in the location bar.

Easier HTML Table-scraping For Scripts With Google Drive

2012-12-17 – 15:21
Posted in Google Docs, HTML5, Programming, Python
Leave a Comment

We had our first, real, snowfall of the season in Maine today and that usually means school delays/closings. Our “local” station – @WCHS6 – has a Storm Center Closings page as well as an SMS notification service. I decided this morning that I needed a command line version (and, eventually, a version that sends me a Twitter DM), but I also was tight for time (a lunchtime meeting ending early is responsible for this blog post).

While I’ve consumed my share of Beautiful Soup and can throw down some mechanize with the best of them, it came to me that there may be an even easier way, and one that may also help with the eventual blocking of such a scraping service.

I setup a Google Drive spreadsheet to use the importHTML formula to read in the closings table on the page:

=importHTML("http://www.wcsh6.com/weather/severe_weather/cancellations_closings/default.aspx","table",0)

Then did a File→Publish to the web and setup up Sheet 1 to “Automatically republish when changes are made” and also to have the link be to the CSV version of the data:

The raw output looks a bit like:

Name,Status,Last Updated
,,
Westbook Seniors,Luncheon PPD to January 7th,12/17/2012 5:22:51
,,
Allied Wheelchair Van Services,Closed,12/17/2012 6:49:47
,,
American Legion - Dixfield,Bingo cancelled,12/17/2012 11:44:12
,,
American Legion Post 155 - Naples,Closed,12/17/2012 12:49:00

The conversion has some “blank” lines but that’s easy enough to filter out with some quick bash:

curl --silent "https://docs.google.com/spreadsheet/pub?key=0AlCY1qfmPPZVdFBsX3kzLUVHZl9Mdmw3bS1POWNsWnc&single=true&gid=0&outpu
t=csv" | grep -v "^,,"

And, looking for the specific school(s) of our kids is an easy grep as well.

The reason this is interesting is that the importHTML is dynamic and will re-convert the HTML table each time the code retrieves the CSV URL. Couple that with the fact that it’s far less likely that Google will be blocked than it is my IP address(es) and this seems to be a pretty nice alternative to traditional parsing.

If I get some time over the break, I’ll do a quick benchmark of using this method over some python and perl scraping/parsing methods.