Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7
dat<- data.frame(t=seq(0, 2*pi, by=0.1) )
xhrt <- function(t) 16*sin(t)^3
yhrt <- function(t) 13*cos(t)-5*cos(2*t)-2*cos(3*t)-cos(4*t)
dat$y=yhrt(dat$t)
dat$x=xhrt(dat$t)
with(dat, polygon(x,y, col="hotpink"))

i heaRt you!

hearRt

(R code inspired by/lifted from: DWin on StackOverflow)

So, I’ve had some quick, consecutive blog posts around this R package I’m working on, and this one is more of an answer to my own, self-identified question of “so what?”. As I was working on an importer for AlienValut’s IP reputation database, I thought it might be interesting to visualize aspects of that data using some of the meta-information gained from the other “netintel” (my working name for the package) functions.

Acting on that impulse, I extracted all IPs that were uniquely identified as “Malicious Host“s (it’s a category in their database), did ASN & peer lookups for them and made two DrL graphs from them (I did a test singular graph but it would require a Times Square monitor to view).

h1

h2

You’ll need to select both images to make them bigger to view them more easily. Red nodes are hosts, blue ones are the ASNs they belong to.

While some of the visualized data was pretty obvious from the data table (nigh consecutive IP addresses in some cases), seeing the malicious clusters (per ASN) was (to me) pretty interesting. I don’t perform malicious host/network analysis as part of the day job, so the apparent clustering (and, also the “disconnected” ones) may not be interesting to anyone but me, but it gave me a practical example to test for the library I’m working on and may be interesting to others. It also shows you can make pretty graphs with R.

I’ve got the crufty R code up on github now and will keep poking at it as I have time. I’ll add the code that made the above image to the repository over the weekend.

The small igraph visualization in the previous post shows the basics of what you can do with the BulkOrigin & BulkPeer functions, and I thought a larger example with some basic D3 tossed in might be even more useful.

Assuming you have the previous functions in your environment, the following builds a larger graph structure (the IPs came from an overnight sample of pcap captured communication between my MacBook Pro & cloud services) and plots a similar circular graph:

library(igraph)
 
ips = c("100.43.81.11","100.43.81.7","107.20.39.216","108.166.87.63","109.152.4.217","109.73.79.58","119.235.237.17","128.12.248.13","128.221.197.57","128.221.197.60","128.221.224.57","129.241.249.6","134.226.56.7","137.157.8.253","137.69.117.58","142.56.86.35","146.255.96.169","150.203.4.24","152.62.109.57","152.62.109.62","160.83.30.185","160.83.30.202","160.83.72.205","161.69.220.1","168.159.192.57","168.244.164.254","173.165.182.190","173.57.120.151","175.41.236.5","176.34.78.244","178.85.44.139","184.172.0.214","184.72.187.192","193.164.138.35","194.203.96.184","198.22.122.158","199.181.136.59","204.191.88.251","204.4.182.15","205.185.121.149","206.112.95.181","206.47.249.246","207.189.121.46","207.54.134.4","209.221.90.250","212.36.53.166","216.119.144.209","216.43.0.10","23.20.117.241","23.20.204.157","23.20.9.81","23.22.63.190","24.207.64.10","24.64.233.203","37.59.16.223","49.212.154.200","50.16.130.169","50.16.179.34","50.16.29.33","50.17.13.221","50.17.43.219","50.18.234.67","63.71.9.108","64.102.249.7","64.31.190.1","65.210.5.50","65.52.1.12","65.60.80.199","66.152.247.114","66.193.16.162","66.249.71.143","66.249.71.47","66.249.72.76","66.41.34.181","69.164.221.186","69.171.229.245","69.28.149.29","70.164.152.31","71.127.49.50","71.41.139.254","71.87.20.2","74.112.131.127","74.114.47.11","74.121.22.10","74.125.178.81","74.125.178.82","74.125.178.88","74.125.178.94","74.176.163.56","76.118.2.138","76.126.174.105","76.14.60.62","76.168.198.238","76.22.130.45","77.79.6.37","81.137.59.193","82.132.239.186","82.132.239.97","8.28.16.254","83.111.54.154","83.251.15.145","84.61.15.10","85.90.76.149","88.211.53.36","89.204.182.67","93.186.30.114","96.27.136.169","97.107.138.192","98.158.20.231","98.158.20.237")
origin = BulkOrigin(ips)
peers = BulkPeer(ips)
 
g = graph.empty() + vertices(ips,size=10,color="red",group=1)
g = g + vertices(unique(c(peers$Peer.AS, origin$AS)),size=10,color="lightblue",group=2)
V(g)$label = c(ips, unique(c(peers$Peer.AS, origin$AS)))
ip.edges = lapply(ips,function(x) {
  c(x,origin[origin$IP==x,]$AS)
})
bgp.edges = lapply(unique(origin$BGP.Prefix),function(x) {
  startAS = unique(origin[origin$BGP.Prefix==x,]$AS)
  pAS = peers[peers$BGP.Prefix==x,]$Peer.AS
  lapply(pAS,function(y) {
    c(startAS,y)
  })
})
g = g + edges(unlist(ip.edges))
g = g + edges(unlist(bgp.edges))
E(g)$weight = 1
g = simplify(g, edge.attr.comb=list(weight="sum"))
E(g)$arrow.size = 0
g$layout = layout.circle
plot(g)

I’ll let you run that to see how horrid a large, style-/layout-unmodified circular layout graph looks.

Thanks to a snippet on StackOverflow, it’s really easy to get this into D3:

library(RJSONIO) 
temp<-cbind(V(g)$name,V(g)$group)
colnames(temp)<-c("name","group")
js1<-toJSON(temp)
write.graph(g,"/tmp/edgelist.csv",format="edgelist")
edges<-read.csv("/tmp/edgelist.csv",sep=" ",header=F)
colnames(edges)<-c("source","target")
edges<-as.matrix(edges)
js2<-toJSON(edges)
asn<-paste('{"nodes":',js1,',"links":',js2,'}',sep="")
write(asn,file="/tmp/asn.json")

We can take the resulting asn.json file and use it as a drop-in replacement for one of the example D3 force-directed layout building blocks and produce this:

Click for larger

Click for larger

Rather than view a static image, you can view the resulting D3 visualization (warning: it’s fairly big).

Both the conversion snippet and the D3 code can be easily tweaked to add more detail and be a tad more interactive/informative, but I’m hoping this larger example provides further inspiration for folks looking to do computer network analysis & visualization with R and may also help some others build more linkages between R & D3.

This is part of a larger project I’m working on, but it’s useful enough to share (github version coming soon).

The fine folks at @TeamCymru have a great service to map IP addresses to ASN/BGP information en masse.

There are libraries for Python, Perl and other languages but none for R (that I could find). So, I threw together a quick set of functions to interface to @TeamCymru’s service. Unlike many other modern services, this one isn’t XML or JSON over a RESTful interface, so the code uses a socketConnection() over the standard WHOIS TCP port to post and retrieve simple text lists.

#
# bulkorigin.R - perform bulk IP to ASN mapping via Team Cymru whois service
#
# Author: @hrbrmstr
# Version: 0.1
# Date: 2013-02-07
#
# Copyright 2013 Bob Rudis
# 
# Permission is hereby granted, free of charge, to any person obtaining
# a copy of this software and associated documentation files (the
# "Software"), to deal in the Software without restriction, including
# without limitation the rights to use, copy, modify, merge, publish,
# distribute, sublicense, and/or sell copies of the Software, and to
# permit persons to whom the Software is furnished to do so, subject to
# the following conditions:
#   
# The above copyright notice and this permission notice shall be
# included in all copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
# LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
# OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
# WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.#
 
library(plyr)
 
# short function to trim leading/trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
 
BulkOrigin <- function(ip.list,host="v4.whois.cymru.com",port=43) {
 
  # Retrieves BGP Origin ASN info for a list of IP addresses
  #
  # NOTE: IPv4 version
  #
  # NOTE: The Team Cymru's service is NOT a GeoIP service!
  # Do not use this function for that as your results will not
  # be accurate.
  #
  # Args:
  #   ip.list : character vector of IP addresses
  #   host: which server to hit for lookup (defaults to Team Cymru's server)
  #   post: TCP port to use (defaults to 43)
  #
  # Returns:
  #   data frame of BGP Origin ASN lookup results
 
 
  # setup query
  cmd = "begin\nverbose\n" 
  ips = paste(unlist(ip.list), collapse="\n")
  cmd = sprintf("%s%s\nend\n",cmd,ips)
 
  # setup connection and post query
  con = socketConnection(host=host,port=port,blocking=TRUE,open="r+")  
  cat(cmd,file=con)
  response = readLines(con)
  close(con)
 
  # trim header, split fields and convert results
  response = response[2:length(response)]
  response = lapply(response,function(n) {
    sapply(strsplit(n,"|",fixed=TRUE),trim)
  })  
  response = adply(response,c(1))
  response = response[,2:length(response)]
  names(response) = c("AS","IP","BGP.Prefix","CC","Registry","Allocated","AS.Name")
 
  return(response)
 
}
 
BulkPeer <- function(ip.list,host="v4-peer.whois.cymru.com",port=43) {
 
  # Retrieves BGP Peer ASN info for a list of IP addresses
  #
  # NOTE: IPv4 version
  #
  # NOTE: The Team Cymru's service is NOT a GeoIP service!
  # Do not use this function for that as your results will not
  # be accurate.
  #
  # Args:
  #   ip.list : character vector of IP addresses
  #   host: which server to hit for lookup (defaults to Team Cymru's server)
  #   post: TCP port to use (defaults to 43)
  #
  # Returns:
  #   data frame of BGP Peer ASN lookup results
 
 
  # setup query
  cmd = "begin\nverbose\n" 
  ips = paste(unlist(ip.list), collapse="\n")
  cmd = sprintf("%s%s\nend\n",cmd,ips)
 
  # setup connection and post query
  con = socketConnection(host=host,port=port,blocking=TRUE,open="r+")  
  cat(cmd,file=con)
  response = readLines(con)
  close(con)
 
  # trim header, split fields and convert results
  response = response[2:length(response)]
  response = lapply(response,function(n) {
    sapply(strsplit(n,"|",fixed=TRUE),trim)
  })  
  response = adply(response,c(1))
  response = response[,2:length(response)]
  names(response) = c("Peer.AS","IP","BGP.Prefix","CC","Registry","Allocated","Peer.AS.Name")
  return(response)
 
}

Take a list of IPs, make an IP connection, formulate a bulk query and convert the results. Here’s a small script to test it:

ips = c("100.43.81.11","100.43.81.7")
origin = BulkOrigin(ips)
str(origin)
peers = BulkPeer(ips)
str(peers)

That code outputs:

'data.frame':	2 obs. of  7 variables:
 $ AS        : chr  "13238" "13238"
 $ IP        : chr  "100.43.81.11" "100.43.81.7"
 $ BGP.Prefix: chr  "100.43.64.0/19" "100.43.64.0/19"
 $ CC        : chr  "US" "US"
 $ Registry  : chr  "arin" "arin"
 $ Allocated : chr  "2011-12-06" "2011-12-06"
 $ AS.Name   : chr  "YANDEX Yandex LLC" "YANDEX Yandex LLC"

and

'data.frame':	8 obs. of  7 variables:
 $ Peer.AS     : chr  "174" "3257" "9002" "10310" ...
 $ IP          : chr  "100.43.81.11" "100.43.81.11" "100.43.81.11" "100.43.81.11" ...
 $ BGP.Prefix  : chr  "100.43.64.0/19" "100.43.64.0/19" "100.43.64.0/19" "100.43.64.0/19" ...
 $ CC          : chr  "US" "US" "US" "US" ...
 $ Registry    : chr  "arin" "arin" "arin" "arin" ...
 $ Allocated   : chr  "2011-12-06" "2011-12-06" "2011-12-06" "2011-12-06" ...
 $ Peer.AS.Name: chr  "COGENT Cogent/PSI" "TINET-BACKBONE Tinet SpA" "RETN-AS ReTN.net Autonomous System" "YAHOO-1 - Yahoo!" ...

respectively for each str().

Nothing super-sexy, but it’s part of a mission I’m on to make IP addresses “first class citizens” in R. I’m starting with building some smaller functions that accumulate IP metadata and will ultimately collect them all into a compact R library.

In the interim, I thought these two routines might be useful to some folks.

With just these two functions, you can use various graphing libraries to get a picture of the network connectivity. Here’s a small sample to get you started:

library(igraph)
 
ips = c("100.43.81.11")
origin = BulkOrigin(ips)
peers = BulkPeer(ips)
 
g = graph.empty() + vertices(c(ips, peers$Peer.AS, origin$AS),size=30)
V(g)$label = c(ips, peers$Peer.AS, origin$AS)
e = lapply(peers$Peer.AS,function(x) {
  c(origin$AS,x)
})
g = g + edges(unlist(e))
g = g + edge(ips, origin$AS)
g$layout = layout.circle
plot(g)

asn

If you know of any other R libraries or code that provide functions that operate on IP addresses or interface to services that provide IP address metadata, please drop a note in the comments or ping me on Twitter.

Yesterday, I took a very short video capture from [wind map](http://hint.fm/wind/) of the massive wind flows buffeting the northeast. I did the same this morning and stitched them together to see what a difference a day makes.

Nothing earth-shattering here, but it is amazing to see how quickly the patterns can change (they changed intra-day yesterday as well, but I didn’t have time to do a series of videos this morning. the major change was that the very clear northerly pattern seen earlier in the day was replaced with a very clear easterly flow, with much more power than seen in today’s capture).

Winter Wind Flows

I came across a list of data crunching and/or visualization competition sites today. I had heard of Kaggle & NITRD before but not the other ones. If you’re looking to get into data analytics/visualization or get better at it, the only way to do so is to practice. Using other data sets (outside your expertise field) may also help you do in-field analysis a little better (or differently).

* Kaggle — “Kaggle is an arena where you can match your data science skills against a global cadre of experts in statistics, mathematics, and machine learning. Whether you’re a world-class algorithm wizard competing for prize money or a novice looking to learn from the best, here’s your chance to jump in and geek out, for fame, fortune, or fun.”

* CrowdAnalytix — Crowdsourced predictive analytics platform

* Innocentive — “he Challenge Center is where InnoCentive connects Seekers and Solvers with a myriad of the world’s toughest Challenges. Seeker organizations post their Challenges in the Challenge Center and offer registered Solvers significant financial awards for the best solutions. Solvers can search for Challenges based on their interests and expertise.”

* TunedIT — “a platform for hosting data competitions for educational, scientific and business purposes”

* KDD Cup — “KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miner”

* NITRD Big Data Challenge Series — “The Big Data Challenge is an effort by the U.S. government to conceptualize new and novel approaches to extracting value from “Big Data” information sets residing in various agency silos and delivering impactful value while remaining consistent with individual agency missions This data comes from the fields of health, energy and Earth science. Competitors will be tasked with imagining analytical techniques, and describe how they may be shared as universal, cross-agency solutions that transcend the limitations of individual agencies.”

If you’re not on the SecurityMetrics.org mailing list you missed an interaction about the Privacy Rights Clearinghouse Chronology of Data Breaches data source started by Lance Spitzner (@lspitzner). You’ll need to subscribe to the list see the thread, but one innocent question put me down the path to taking a look at the aggregated data with the intent of helping folks understand the overall utility/efficacy of it when trying to craft messages from it.

Before delving into the data, please note that PRC does an excellent job detailing source material for the data. They fully acknowledge some of the challenges with it, but a picture (or two) is worth a thousand caveats. (NOTE: Charts & numbers have been produced from January 20th, 2013 data).

The first thing I did was try to get a feel for overall volume:

Total breach record entries across all years (2005-present): 3573
Number of entries with ‘Total Records Lost’ filled in: 751
% of entries with ‘Total Records Lost’ filled in: 21.0%

Takeaway #1: Be very wary of using any “Total Records Breached” data from this data set.

It may help to see that computation broken down by reporting source over the years that the data file spans:

complete-records-by-source-across-years

This view also gives us:

Takeaway #2: Not all data sources span all years and some have very little data.

However, Lance’s original goal was to compare “human error” vs “technical hack”. To do this, he combined DISC, PHYS, PORT & STAT into one category (accidental/human :: ACC-HUM) and HACK, CARD & INSD into another (malicious/attack :: MAL-ATT). Here’s what that looks like when broken down across reporting sources across time:

breach-count-metatype-year

(click to enlarge)

This view provides another indicator that one might not want to place a great deal of faith on the PRC’s aggregation efforts. Why? It’s highly unlikely that DatalossDB had virtually no breach recordings in 2011 (in fact, it’s more than unlikely, it’s not true). Further views will show some other potential misses from DatalossDB.

Takeaway #3: Do not assume the components of this aggregated data set are complete.

We can get a further feel for data quality and also which reporting sources are weighted more heavily (i.e. which ones have more records, thus implicitly placing a greater reliance on them for any calculations) by looking at how many records they each contributed to the aggregated population each year:

(click to enlarge)

(click to enlarge)

I’m not sure why 2008 & 2009 have such small bars for Databreaches.net and PHIPrivacy.net, and you can see the 2011 gap in the DatalossDB graph.

At this point, I’d (maybe) trust some aggregate analysis of the HHS (via PHI), CA Attorney General & Media data, but would need to caveat any conclusions with the obvious biases introduced by each.

Even with these issues, I really wanted a “big picture” view of the entire set and ended up creating the following two charts:

(click to enlarge)

(click to enlarge)

(click to enlarge)

(click to enlarge)

(You’ll probably want to view the PDF documents of each : [1] [2] given how big they are.)

Those charts show the number of breaches-by-type by month across the 2005-2013 span by reporting source. The only difference between the two is that the latter one is grouped by Lance’s “meta type” definition. These views enable us to see gaps in reporting by month (note the additional aggregation issue at the tail end of 2010 for DatalossDB) and also to get a feel for the trends of each band (note the significant increase in “unknown” in 2012 for DatalossDB).

Takeaway #4: Do not ignore the “unknown” classification when performing analysis with this data set.

We can see other data issues if we look at it from other angles, such as the state the breach was recorded in:

(click to enlarge)

(click to enlarge)

We can see at least three issues (missing value and occurrences recorded not in the US) from this view, but it seems the number of breaches mostly aligns with population (discrepancies make sense given the lack of uniform breach reporting requirements).

It’s also very difficult to do any organizational analysis (I’m a big fan of looking at “repeat offenders” in general) with this data set without some significant data cleansing/normalization. For example, all of these are “Bank of America“:

[1] "Bank of America"                                                             
[2] "Wachovia, Bank of America, PNC Financial Services Group and Commerce Bancorp"
[3] "Bank of America Corp."                                                       
[4] "Citigroup, Inc., Bank of America, Corp."

Without any cleansing, here are the orgs with two or more reported breaches since 2005:

(apologies for the IFRAME but Google’s Fusion Tables are far too easy to use when embedding data tables)

Takeaway #5: Do not assume that just because a data set has been aggregated by someone and published that it’s been scrubbed well.

Even if the above sets of issues were resolved, the real details are in the “breach details” field, which is a free-form text field providing more information on who/what/when/where/why/how (with varying degrees of consistency & completeness). This is actually the information you really need. The HACK attribute is all well-and-good, but what kind of hack was it? This is one area VERIS shines. What advice are you going to give financial services (BSF) orgs from this extract:

(click to enlarge)

(click to enlarge)

HACKs are up in 2012 from 2010 & 2011, but what type of HACKs against what size organizations? Should smaller orgs be worried about desktop security and/or larger orgs start focusing more on web app security? You can’t make that determination without mining that free form text field. (NOTE: as I have time, I’m trying to craft a repeatable text analysis function I can perform on that field to see what can be automatically extracted)

Takeaway #6: This data set is pretty much what the PRC says it is: a chronology of data breaches. More in-depth analysis is not advised without a massive clean-up effort.

Finally, hypothesizing that the PRC’s aggregation could have resulted in duplicate records, I created a subset of the records based solely on breach “Date Made Public” + “Organization Name” and then sifted manually through the breach text details, 6 duplicate entries were found. Interestingly enough, only one instance of duplicate records was found across reporting databases (my hunch was that DatalossDB or DataBreaches.NET would have had records other, smaller databases included; however, this particular duplicate detection mechanism does not rule this out given the quality of the data).

Conclusion/Acknowledgements

Despite the criticisms above, the efforts by the PRC and their sources for aggregation are to be commended. Without their work to document breaches we would only have the mega-media-frenzy stories and labor-intensive artifacts like the DBIR to work with. Just because the data isn’t perfect right now doesn’t mean we won’t get to a point where we have the ability to record and share this breach information like the CDC does diseases (which also ins’t perfect, btw).

I leave you with another column of numbers that shows—if broken down by organization type and breach type—there is an average of 2 breaches per-breach/org-type-per-year (according to this data):

(The complete table includes the mean, median and standard deviation for each type.)

Lance’s final question to me (on the list) was “Bob, what do recommended as the next step to answer the question – What percentage of publicly known data breaches are deliberate cyber attacks, and what percentage are human based accidental loss/disclosure?

I’d first start with a look at the DBIR (especially this year’s) and then see if I could get a set of grad students to convert a complete set of DatalossDB records (and, perhaps, the other sources) into VERIS format for proper analysis. If any security vendors are reading this, I guarantee you’ll gain significant capital/accolades within/from the security practitioner community if you were to sponsor such an effort.

Comments, corrections & constructive criticisms are heartily welcomed. Data crunching & graphing scripts available both on request and perhaps uploaded to my github repository once I clean them up a bit.

Good Stats, Bad Stats has a really good critique of my Mississippi River post that you should read (TL;DR: my graphs & analysis need work).