Skip navigation

Category Archives: Information Security

This is a follow-up to my [Visualizing Risky Words](http://rud.is/b/2013/03/06/visualizing-risky-words/) post. You’ll need to read that for context if you’re just jumping in now. Full R code for the generated images (which are pretty large) is at the end.

Aesthetics are the primary reason for using a word cloud, though one can pretty quickly recognize what words were more important on well crafted ones. An interactive bubble chart is a tad better as it lets you explore the corpus elements that contained the terms (a feature I have not added yet).

I would posit that a simple bar chart can be of similar use if one is trying to get a feel for overall word use across a corpus:

freq-bars
(click for larger version)

It’s definitely not as sexy as a word cloud, but it may be a better visualization choice if you’re trying to do analysis vs just make a pretty picture.

If you are trying to analyze a corpus, you might want to see which elements influenced the term frequencies the most, primarily to see if there were any outliers (i.e. strong influencers). With that in mind, I took @bfist’s [corpus](http://securityblog.verizonbusiness.com/2013/03/06/2012-intsum-word-cloud/) and generated a heat map from the top terms/keywords:

risk-hm
(click for larger version)

There are some stronger influencers, but there is a pattern of general, regular usage of the terms across each corpus component. This is to be expected for this particular set as each post is going to be talking about the same types of security threats, vulnerabilities & issues.

The R code below is fully annotated, but it’s important to highlight a few items in it and on the analysis as a whole:

– The extra, corpus-specific stopword list : “week”, “report”, “security”, “weeks”, “tuesday”, “update”, “team” : was designed after manually inspecting the initial frequency breakdowns and inserting my opinion at the efficacy (or lack thereof) of including those terms. I’m sure another iteration would add more (like “released” and “reported”). Your expert view needs to shape the analysis and—in most cases—that analysis is far from a static/one-off exercise.
– Another area of opine was the choice of 0.7 in the removeSparseTerms(tdm, sparse=0.7) call. I started at 0.5 and worked up through 0.8, inspecting the results at each iteration. Playing around with that number and re-generating the heatmap might be an interesting exercise to perform (hint).
– Same as the above for the choice of 10 in subset(tf, tf>=10). Tweak the value and re-do the bar chart vis!
– After the initial “ooh! ahh!” from a word cloud or even the above bar chart (though, bar charts tend to not evoke emotional reactions) is to ask yourself “so what?”. There’s nothing inherently wrong with generating a visualization just to make one, but it’s way cooler to actually have a reason or a question in mind. One possible answer to a “so what?” for the bar chart is to take the high frequency terms and do a bigram/digraph breakdown on them and even do a larger cross-term frequency association analysis (both of which we’ll do in another post)
– The heat map would be far more useful as a D3 visualization where you could select a tile and view the corpus elements with the term highlighted or even select a term on the Y axis and view an extract from all the corpus elements that make it up. That might make it to the TODO list, but no promises.

I deliberately tried to make this as simple as possible for those new to R to show how straightforward and brief text corpus analysis can be (there’s less than 20 lines of code excluding library imports, whitespace, comments and the unnecessary expansion of some of the tm function calls that could have been combined into one). Furthermore, this is really just a basic demonstration of tm package functionality. The post/code is also aimed pretty squarely at the information security crowd as we tend to not like examples that aren’t in our domain. Hopefully it makes a good starting point for folks and, as always, questions/comments are heartily encouraged.

# need this NOAWT setting if you're running it on Mac OS; doesn't hurt on others
Sys.setenv(NOAWT=TRUE)
library(ggplot2)
library(ggthemes)
library(tm)
library(Snowball) 
library(RWeka) 
library(reshape)
 
# input the raw corpus raw text
# you could read directly from @bfist's source : http://l.rud.is/10tUR65
a = readLines("intext.txt")
 
# convert raw text into a Corpus object
# each line will be a different "document"
c = Corpus(VectorSource(a))
 
# clean up the corpus (function calls are obvious)
c = tm_map(c, tolower)
c = tm_map(c, removePunctuation)
c = tm_map(c, removeNumbers)
 
# remove common stopwords
c = tm_map(c, removeWords, stopwords())
 
# remove custom stopwords (I made this list after inspecting the corpus)
c = tm_map(c, removeWords, c("week","report","security","weeks","tuesday","update","team"))
 
# perform basic stemming : background: http://l.rud.is/YiKB9G
# save original corpus
c_orig = c
 
# do the actual stemming
c = tm_map(c, stemDocument)
c = tm_map(c, stemCompletion, dictionary=c_orig)
 
# create term document matrix : http://l.rud.is/10tTbcK : from corpus
tdm = TermDocumentMatrix(c, control = list(minWordLength = 1))
 
# remove the sparse terms (requires trial->inspection cycle to get sparse value "right")
tdm.s = removeSparseTerms(tdm, sparse=0.7)
 
# we'll need the TDM as a matrix
m = as.matrix(tdm.s)
 
# datavis time
 
# convert matri to data frame
m.df = data.frame(m)
 
# quick hack to make keywords - which got stuck in row.names - into a variable
m.df$keywords = rownames(m.df)
 
# "melt" the data frame ; ?melt at R console for info
m.df.melted = melt(m.df)
 
# not necessary, but I like decent column names
colnames(m.df.melted) = c("Keyword","Post","Freq")
 
# generate the heatmap
hm = ggplot(m.df.melted, aes(x=Post, y=Keyword)) + 
  geom_tile(aes(fill=Freq), colour="white") + 
  scale_fill_gradient(low="black", high="darkorange") + 
  labs(title="Major Keyword Use Across VZ RISK INTSUM 202 Corpus") + 
  theme_few() +
  theme(axis.text.x  = element_text(size=6))
ggsave(plot=hm,filename="risk-hm.png",width=11,height=8.5)
 
# not done yet
 
# better? way to view frequencies
# sum rows of the tdm to get term freq count
tf = rowSums(as.matrix(tdm))
# we don't want all the words, so choose ones with 10+ freq
tf.10 = subset(tf, tf>=10)
 
# wimping out and using qplot so I don't have to make another data frame
bf = qplot(names(tf.10), tf.10, geom="bar") + 
  coord_flip() + 
  labs(title="VZ RISK INTSUM Keyword Frequencies", x="Keyword",y="Frequency") + 
  theme_few()
ggsave(plot=bf,filename="freq-bars.png",width=8.5,height=11)

If you haven’t viewed/read Wendy Nather’s (@451Wendy) insightful [Living Below The Security Poverty Line](https://451research.com/t1r-insight-living-below-the-security-poverty-line) you really need to do that before continuing (we’ll still be here when you get back).

Unfortunately, the catalyst for this post came from two recent, real-world events: my returned exposure to the apparent ever-increasing homeless issue in San Francisco (a side effect of choosing a hotel 10 blocks away from Moscone) and the hacking of a [small, local establishment](http://www.tnhonline.com/works-bakery-customers-targeted-by-cyber-thieves-1.2988390#.UTMuF-tASS0) resulting in exposure of customer credit cards.

If you do any mom-and-pop, brick-and-mortar shopping you’ve seen it: the Windows-based point-of-sale terminal that is the *only* computer for the owners. Your credit card will be scanned on the same machine cat videos will be viewed and e-mail will be read. In many small shops, that machine is also where accounting functions are performed.

These truly small business (TSB) owners aren’t living below the security poverty line, they are security hobos. They *kinda* know they need to care about the safety of their data, but their focus is on their business or creative processes. When they do have time to care about security, that part of their world is so complex that it’s far too easy to make the choice to ignore it than to do something about it. If your immediate reaction was to disagree with my complexity posit, here are just a few tasks a TSB owner must face in a world of modern commerce:

– Updating operating system patches
– Updating browser software
– Updating Flash
– Updating Java
– Maintain web site/Twitter/Facebook securely
– Recognizing phishing e-mails/posts/tweets
– Understanding browser security
– Keeping signature anti-malware up-to-date
– Remember passwords for system, POS vendor, government sites, e-mail, etc.
– Maintain secure Wi-Fi and Internet firewall
– Maintain physical security (e.g. cameras)

Those tasks may be as autonomous as breathing for security folk and technically-savvy users, but they are extraneous tasks that are confusing for most TSBs and may often cause instability issues with the wretched POS software options out in the marketplace. These folks also cannot afford to hire security consultants to do this work for them.

Verizon’s 2012 DBIR & Trustwave’s 2012 report both showed that [these types of businesses](http://www.slate.com/articles/technology/technology/2012/03/verizon_s_data_breach_investigations_report_reveals_that_restaurants_are_the_easiest_target_for_hackers_.single.html#pagebreak_anchor_2) were part of the groups most targeted by criminals, yet the best our industry can do is dress up folks in schoolgirl costumes at @RSAConference whilst telling TSBs to keep their systems up-to-date and not re-use passwords. It’s the security equivalent of walking by a truly desperate person on the street without even making eye contact as your body language exudes the “get a job” vibe.

We have to do better than this.

Until software and hardware vendors start to—or are forced to—actually care about security, it will be up to security professionals to create the digital equivalent of a soup kitchen to make the situation better. What can you do?

– speak at local Chamber of Commerce meetings and provide practical take-aways for those who attend
– discuss security topics with friends or relatives who are TSB owners
– have your [ISSA|ISC2|NAISG] chapter setup a booth at conventions which attract TSBs (y’know…get out of the echo chamber, mebbe?)
– raise awareness through blogging and other media outlets
– produce & distribute awareness materials—a great example would be @Veracode’s non-domain [infographics](http://www.veracode.com/blog/category/infographics/)
– demand better (in general) out of your security vendors
– lobby government for better security standards

It may not seem like much, but we have to start somewhere if we’re going to find a way to help protect those that most vulnerable, especially since it will also mean helping to keep *our own* information safe.

In case you are a truly small business owner who is reading this post, there are some things you can do to help ensure you won’t be a victim:

– Use a dedicated machine for your POS work—an iPad with [Square](https://squareup.com/) is a good option but doesn’t work for everyone
– Do not perform any operations on the Internet on the system that you do accounting tasks on
– Use @1Password to create, store & manage all your passwords on all your systems/devices
– Use [Secunia PSI](http://secunia.com/vulnerability_scanning/personal/) to help keep your Windows systems up-to-date
– Set all operating system and anti-malware software to auto-update
– Do not put your security cameras on the Internet; if you do, password protect them
– Research what your responsibilities are and what actions you’ll need to take in the event you do discover that your business or customer information has been exposed

Many thanks to all who attended the talk @jayjacobs & I gave at RSA on Tuesday, February 26, 2013. It was really great to be able to talk to so many of you afterwards as well.

We’ve enumerated quite a bit of non-slide-but-in-presentation information that we wanted to aggregate into a blog post so you can viz along at home. If you need more of a guided path, I strongly encourage you to take a look at some of the free courses over at [Coursera](https://www.coursera.org/).

For starters, here’s a bit.ly bundle of data analysis & visualization bookmarks that @dseverski & I maintain. We’ve been doing (IMO) a pretty good job adding new resources as they come up and may have some duplicates to the ones below.

People Mentioned

– [Stephen Few’s Perceptual Edge blog](http://www.perceptualedge.com/) : Start from the beginning to learn from a giant in information visualization
– [Andy Kirk’s Visualising Data blog](http://www.visualisingdata.com/) (@visualisingdata) : Perhaps the quintessential leader in the modern visualization movement.
– [Mike Bostock’s blog](http://bost.ocks.org/mike/) (@mbostock) : Creator of D3 and producer of amazing, interactive graphics for the @NYTimes
– [Edward Tufte’s blog](http://www.edwardtufte.com/tufte/) : The father of what we would now identify as our core visualization principles & practices

Tools Mentioned

– [R](http://www.r-project.org/) : Jay & I probably use this a bit too much as a hammer (i.e. treat ever data project as a nail) but it’s just far too flexible and powerful to not use as a go-to resource
– [RStudio](http://www.rstudio.com/) : An *amazing* IDE for R. I, personally, usually despise IDEs (yes, I even dislike Xcode), but RStudio truly improves workflow by several orders of magnitude. There are both desktop and server versions of it; the latter gives you the ability to setup a multi-user environment and use the IDE from practically anywhere you are. RStudio also makes generating [reproducible research](http://cran.r-project.org/web/views/ReproducibleResearch.html) a joy with built-in easy access to tools like [kintr](http://yihui.name/knitr/).
– [iPython](http://ipython.org/) : This version of Python takes an already amazing language and kicks it up a few notches. It brings it up to the level of R+RStudio, especially with it’s knitr-like [iPython Notebooks](http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html) for–again–reproducible research.
– [Mondrian](http://www.theusrus.de/Mondrian/) : This tool needs far more visibility. It enables extremely quick visualization of even very large data sets. The interface takes a bit of getting used to, but it’s faster then typing R commands or fumbling in Excel.
– [Tableau](http://www.tableausoftware.com/) : This tool may be one of the most accessible, fast & flexible ways to explore data sets to get an idea of where you need to/can do further analysis.
– [Processing](http://processing.org/) : A tool that was designed from the ground up to help journalists create powerful, interactive data visualizations that you can slipstream directly onto the web via the [Processing.js](http://processingjs.org/) library.
– [D3](http://d3js.org/) : The foundation of modern, data-driven visualization on the web.
– [Gephi](https://gephi.org/) : A very powerful tool when you need to explore networks & create beautiful, publication-worthy visualizations.
– [MongoDB](http://www.mongodb.org/) : NoSQL database that’s highly & easily scaleable without a steep learning curve.
– [CRUSH Tools by Google](https://code.google.com/p/crush-tools/) : Kicks up your command-line data munging.

Here’s a quick example of couple additional ways to use the netintel R package I’ve been tinkering with. This could easily be done on the command line with other tools, but if you’re already doing scripting/analysis with R, this provides a quick way to tell if a list of IPs is in the @AlienVault IP reputation database. Zero revelations here for regular R users, but it might help some folks who are working to make R more of a first class scripting citizen.

I whipped up the following bit of code to check to see how many IP addresses in the @Mandiant APT-1 FQDN dump were already in the AlienVault database. Reverse resolution of the Mandiant APT-1 FQDN list is a bit dubious at this point so a cross-check with known current data is a good idea. I should also point out that not all the addresses resolved “well” (there are 2046 FQDNs and my quick dig only yielded 218 usable IPs).

library(netintel)
 
# get the @AlienVault reputation DB
av.rep = Alien.Vault.Reputation()
 
# read in resolved APT-1 FQDNs list
apt.1 = read.csv("apt-1-ips.csv")
 
# basic set operation
whats.left = intersect(apt.1$ip,av.rep$IP)
 
# how many were in the quickly resolved apt-1 ip list?
length(apt.1)
[1]218
 
# how many are common across the lists?
length(whats.left)
[1] 44
 
# take a quick look at them
whats.left
[1] "12.152.124.11"   "140.112.19.195"  "161.58.182.205"  "165.165.38.19"   "173.254.28.80"  
[6] "184.168.221.45"  "184.168.221.54"  "184.168.221.56"  "184.168.221.58"  "184.168.221.68" 
[11] "192.31.186.141"  "192.31.186.149"  "194.106.162.203" "199.59.166.109"  "203.170.198.56" 
[16] "204.100.63.18"   "204.93.130.138"  "205.178.189.129" "207.173.155.44"  "207.225.36.69"  
[21] "208.185.233.163" "208.69.32.230"   "208.73.210.87"   "213.63.187.70"   "216.55.83.12"   
[26] "50.63.202.62"    "63.134.215.218"  "63.246.147.10"   "64.12.75.1"      "64.12.79.57"    
[31] "64.126.12.3"     "64.14.81.30"     "64.221.131.174"  "66.228.132.20"   "66.228.132.53"  
[36] "68.165.211.181"  "69.43.160.186"   "69.43.161.167"   "69.43.161.178"   "70.90.53.170"   
[41] "74.14.204.147"   "74.220.199.6"    "74.93.92.50"     "8.5.1.34"

So, roughly a 20% overlap between (quickly-I’m sure there’s a more comprehensive list) resolved & “clean” APT-1 FQDNs IPs and the AlienVault reputation database.

For kicks, we can see where all the resolved APT-1 nodes live (BGP/network-wise) in relation to each other using some of the other library functions:

library(netintel)
library(igraph)
library(plyr)
 
apt.1 = read.csv("apt-1-ips.csv")
ips = apt.1$ip
 
# get BGP origin & peers
origin = BulkOrigin(ips)
peers = BulkPeer(ips)
 
# start graphing
g = graph.empty()
 
# Make IP vertices; IP endpoints are red
g = g + vertices(ips,size=1,color="red",group=1)
 
# Make BGP vertices ; BGP nodes are light blue
g = g + vertices(unique(c(peers$Peer.AS, origin$AS)),size=1.5,color="orange",group=2)
 
# no labels
V(g)$label = ""
 
# Make IP/BGP edges
ip.edges = lapply(ips,function(x) {
  iAS = origin[origin$IP==x,]$AS
  lapply(iAS,function(y){
    c(x,y)
  })
})
 
# Make BGP/peer edges
bgp.edges = lapply(unique(origin$BGP.Prefix),function(x) {
  startAS = unique(origin[origin$BGP.Prefix==x,]$AS)
  lapply(startAS,function(z) {
    pAS = peers[peers$BGP.Prefix==x,]$Peer.AS
    lapply(pAS,function(y) {
      c(z,y)
    })
  })
})
 
# get total graph node count
node.count = table(c(unlist(ip.edges),unlist(bgp.edges)))
 
# add edges 
g = g + edges(unlist(ip.edges))
g = g + edges(unlist(bgp.edges))
 
# base edge weight == 1
E(g)$weight = 1
 
# simplify the graph
g = simplify(g, edge.attr.comb=list(weight="sum"))
 
# no arrows
E(g)$arrow.size = 0
 
# best layout for this
L = layout.fruchterman.reingold(g)
 
# plot the graph
plot(g,margin=0)

apt-1

If we take out the BGP peer relationships from the graph (i.e. don’t add the bgp.edges in the above code) we can see the mal-host clusters even more clearly (the pseudo “Death Star” look is unintentional but appropro):

Rplot01

We can also determine which ASNs the bigger clusters belong to by checking out the degree. The “top” 5 clusters are:

16509 40676 36351 26496 15169 
    7     8     8    13    54

While my library doesn’t support direct ASN detail lookup yet (an oversight), we can take those ASN’s, check them out manually and see the results:

16509   | US | arin     | 2000-05-04 | AMAZON-02 - Amazon.com, Inc.
40676   | US | arin     | 2008-02-26 | PSYCHZ - Psychz Networks
36351   | US | arin     | 2005-12-12 | SOFTLAYER - SoftLayer Technologies Inc.
26496   | US | arin     | 2002-10-01 | AS-26496-GO-DADDY-COM-LLC - GoDaddy.com, LLC
15169   | US | arin     | 2000-03-30 | GOOGLE - Google Inc.

So Google servers are hosting the most mal-nodes from the resolved ASN-1 list, followed by GoDaddy. I actually expected Amazon to be higher up in the list.

I’ll be adding igraph and ASN lookup functions to the netintel library soon. Also, if anyone has a better APT-1 IP list, please shoot me a link.

I happened across [Between Hype and Understatement: Reassessing Cyber Risks as a Security Strategy](http://scholarcommons.usf.edu/cgi/viewcontent.cgi?article=1107&context=jss) [PDF] when looking for something else at the [Journal of Strategic Security](http://scholarcommons.usf.edu/jss/) site and thought it was a good enough primer to annoy everyone with a tweet about it.

The paper is—well—_kinda_ wordy and has a Flesch-Kincaid grade reading level of 16*, making well suited for academia, but not rapid consumption in this blog era we abide in. I promised some folks that I’d summarize it (that phrase always reminds me [of this](http://www.youtube.com/watch?v=uwAOc4g3K-g)) and so I shall (try).

The fundamental arguments are:

– we underrate & often overlook pre-existing software weaknesses (a.k.a. vulnerabilities)
– we undervalue the costs of cybercrime by focusing solely on breaches & not including preventative/deterrence costs
– we get distracted from identifying real threats by over-hyped ones
– we suck at information sharing (not enough of it; incomplete, at times; too many “standards”)
– we underreport incidents—and that this actually _enables_ attackers
– we need a centralized body to report incidents to
– we should develop a complete & uniform taxonomy
– we must pay particular attention to vulnerabilities in critical infrastructure
– we must pressure governments & vendors to take an active role in “encouraging” removing vulnerabilities from software during the SDLC, not after deployment

The author discusses specific media references (there are a plethora of links in the endnotes) when it comes to hype and notes specific government initiatives when it comes to other topics such as incident handling/threat sharing (the author has a definite UK slant).

I especially liked this quote on threat actors/actions/motives & information sharing:

> _[the] distributed nature of the Internet can make it difficult to clearly attribute some incidents [as] criminal, terrorist actions, or acts of war. Consequently, to affirm that “the principal difference” between [these] “is in the attacker’s intent” is far too simplistic when many cyber-attackers cannot be identified. It is also quite simplistic to attribute financial motivation only to cyber criminals since terrorists can be motivated by monetary gain in order to finance their political actions. [An] added difficulty is that a pattern of cyber incidents may not reveal itself unless information is shared between the different stakeholders. For example, taken in isolation, a bank’s website being temporarily unavailable may look innocuous and not worth reporting to the competent agencies. Yet, when associated with other cyber incidents in which the victims and timeframe are similar, it may reveal a concerted effort to target a particular type of business or e-government resources, a pattern of behavior that could amount to crime (fraud, espionage) or terrorism if the motive can be established. Detection thus may depend on information being shared._

She does spend quite a bit of ink on vulnerabilities. Some choice (shorter) quotes:

> _[the] economic analysis adopted by software companies does not take into account (or not sufficiently) that the costs of non- secure software are significant, that these costs will be borne by others on the network and ultimately by themselves in clean-up operations_

> _Of course, to fix the vulnerabilities after release is laudable; it is also commendable that those companies participate in huge clean-up operations of botnets like Microsoft did in 2010. However, there is nothing more paradoxical than Microsoft (and others) spending money to circumscribe the effects of the very vulnerabilities they contributed to create in the first place_

She then concludes with suggesting that governments work with ISPs to actually severely restrict or disable internet connections of users found to be infected and contributing to spam/botnets, positing that this will cause users to demand more out of software vendors or use the free market to shift their loyalties to other software providers who do more to build less vulnerable software.

Again, I think it’s a good primer on the subject (despite some dubious analogies peppered throughout), but I also think there is too much focus on vulnerabilities and not enough on threat actors/actions/motives. I do like how she mixes economic theory into a topic that is usually defined solely in terms of warfare without diminishing the potential impacts of either.

It would have been pretty evident to see the influence of Beck & Giddens even if her references to [Risk Society](http://en.wikipedia.org/wiki/Risk_society) didn’t bookend the prose. I’ll leave you with what might just be her own one-sentence summary of the entire paper and definitely apropo for our current “cyber” situation:

> _[the] risks that industrialization and modernization created tend to be global, systemic with a “boomerang effect,” and denied, overlooked, or overhyped._

*Ironically enough, this blog post comes out at F/K-level 22-23

So, I’ve had some quick, consecutive blog posts around this R package I’m working on, and this one is more of an answer to my own, self-identified question of “so what?”. As I was working on an importer for AlienValut’s IP reputation database, I thought it might be interesting to visualize aspects of that data using some of the meta-information gained from the other “netintel” (my working name for the package) functions.

Acting on that impulse, I extracted all IPs that were uniquely identified as “Malicious Host“s (it’s a category in their database), did ASN & peer lookups for them and made two DrL graphs from them (I did a test singular graph but it would require a Times Square monitor to view).

h1

h2

You’ll need to select both images to make them bigger to view them more easily. Red nodes are hosts, blue ones are the ASNs they belong to.

While some of the visualized data was pretty obvious from the data table (nigh consecutive IP addresses in some cases), seeing the malicious clusters (per ASN) was (to me) pretty interesting. I don’t perform malicious host/network analysis as part of the day job, so the apparent clustering (and, also the “disconnected” ones) may not be interesting to anyone but me, but it gave me a practical example to test for the library I’m working on and may be interesting to others. It also shows you can make pretty graphs with R.

I’ve got the crufty R code up on github now and will keep poking at it as I have time. I’ll add the code that made the above image to the repository over the weekend.

The small igraph visualization in the previous post shows the basics of what you can do with the BulkOrigin & BulkPeer functions, and I thought a larger example with some basic D3 tossed in might be even more useful.

Assuming you have the previous functions in your environment, the following builds a larger graph structure (the IPs came from an overnight sample of pcap captured communication between my MacBook Pro & cloud services) and plots a similar circular graph:

library(igraph)
 
ips = c("100.43.81.11","100.43.81.7","107.20.39.216","108.166.87.63","109.152.4.217","109.73.79.58","119.235.237.17","128.12.248.13","128.221.197.57","128.221.197.60","128.221.224.57","129.241.249.6","134.226.56.7","137.157.8.253","137.69.117.58","142.56.86.35","146.255.96.169","150.203.4.24","152.62.109.57","152.62.109.62","160.83.30.185","160.83.30.202","160.83.72.205","161.69.220.1","168.159.192.57","168.244.164.254","173.165.182.190","173.57.120.151","175.41.236.5","176.34.78.244","178.85.44.139","184.172.0.214","184.72.187.192","193.164.138.35","194.203.96.184","198.22.122.158","199.181.136.59","204.191.88.251","204.4.182.15","205.185.121.149","206.112.95.181","206.47.249.246","207.189.121.46","207.54.134.4","209.221.90.250","212.36.53.166","216.119.144.209","216.43.0.10","23.20.117.241","23.20.204.157","23.20.9.81","23.22.63.190","24.207.64.10","24.64.233.203","37.59.16.223","49.212.154.200","50.16.130.169","50.16.179.34","50.16.29.33","50.17.13.221","50.17.43.219","50.18.234.67","63.71.9.108","64.102.249.7","64.31.190.1","65.210.5.50","65.52.1.12","65.60.80.199","66.152.247.114","66.193.16.162","66.249.71.143","66.249.71.47","66.249.72.76","66.41.34.181","69.164.221.186","69.171.229.245","69.28.149.29","70.164.152.31","71.127.49.50","71.41.139.254","71.87.20.2","74.112.131.127","74.114.47.11","74.121.22.10","74.125.178.81","74.125.178.82","74.125.178.88","74.125.178.94","74.176.163.56","76.118.2.138","76.126.174.105","76.14.60.62","76.168.198.238","76.22.130.45","77.79.6.37","81.137.59.193","82.132.239.186","82.132.239.97","8.28.16.254","83.111.54.154","83.251.15.145","84.61.15.10","85.90.76.149","88.211.53.36","89.204.182.67","93.186.30.114","96.27.136.169","97.107.138.192","98.158.20.231","98.158.20.237")
origin = BulkOrigin(ips)
peers = BulkPeer(ips)
 
g = graph.empty() + vertices(ips,size=10,color="red",group=1)
g = g + vertices(unique(c(peers$Peer.AS, origin$AS)),size=10,color="lightblue",group=2)
V(g)$label = c(ips, unique(c(peers$Peer.AS, origin$AS)))
ip.edges = lapply(ips,function(x) {
  c(x,origin[origin$IP==x,]$AS)
})
bgp.edges = lapply(unique(origin$BGP.Prefix),function(x) {
  startAS = unique(origin[origin$BGP.Prefix==x,]$AS)
  pAS = peers[peers$BGP.Prefix==x,]$Peer.AS
  lapply(pAS,function(y) {
    c(startAS,y)
  })
})
g = g + edges(unlist(ip.edges))
g = g + edges(unlist(bgp.edges))
E(g)$weight = 1
g = simplify(g, edge.attr.comb=list(weight="sum"))
E(g)$arrow.size = 0
g$layout = layout.circle
plot(g)

I’ll let you run that to see how horrid a large, style-/layout-unmodified circular layout graph looks.

Thanks to a snippet on StackOverflow, it’s really easy to get this into D3:

library(RJSONIO) 
temp<-cbind(V(g)$name,V(g)$group)
colnames(temp)<-c("name","group")
js1<-toJSON(temp)
write.graph(g,"/tmp/edgelist.csv",format="edgelist")
edges<-read.csv("/tmp/edgelist.csv",sep=" ",header=F)
colnames(edges)<-c("source","target")
edges<-as.matrix(edges)
js2<-toJSON(edges)
asn<-paste('{"nodes":',js1,',"links":',js2,'}',sep="")
write(asn,file="/tmp/asn.json")

We can take the resulting asn.json file and use it as a drop-in replacement for one of the example D3 force-directed layout building blocks and produce this:

Click for larger

Click for larger

Rather than view a static image, you can view the resulting D3 visualization (warning: it’s fairly big).

Both the conversion snippet and the D3 code can be easily tweaked to add more detail and be a tad more interactive/informative, but I’m hoping this larger example provides further inspiration for folks looking to do computer network analysis & visualization with R and may also help some others build more linkages between R & D3.

If you’re not on the SecurityMetrics.org mailing list you missed an interaction about the Privacy Rights Clearinghouse Chronology of Data Breaches data source started by Lance Spitzner (@lspitzner). You’ll need to subscribe to the list see the thread, but one innocent question put me down the path to taking a look at the aggregated data with the intent of helping folks understand the overall utility/efficacy of it when trying to craft messages from it.

Before delving into the data, please note that PRC does an excellent job detailing source material for the data. They fully acknowledge some of the challenges with it, but a picture (or two) is worth a thousand caveats. (NOTE: Charts & numbers have been produced from January 20th, 2013 data).

The first thing I did was try to get a feel for overall volume:

Total breach record entries across all years (2005-present): 3573
Number of entries with ‘Total Records Lost’ filled in: 751
% of entries with ‘Total Records Lost’ filled in: 21.0%

Takeaway #1: Be very wary of using any “Total Records Breached” data from this data set.

It may help to see that computation broken down by reporting source over the years that the data file spans:

complete-records-by-source-across-years

This view also gives us:

Takeaway #2: Not all data sources span all years and some have very little data.

However, Lance’s original goal was to compare “human error” vs “technical hack”. To do this, he combined DISC, PHYS, PORT & STAT into one category (accidental/human :: ACC-HUM) and HACK, CARD & INSD into another (malicious/attack :: MAL-ATT). Here’s what that looks like when broken down across reporting sources across time:

breach-count-metatype-year

(click to enlarge)

This view provides another indicator that one might not want to place a great deal of faith on the PRC’s aggregation efforts. Why? It’s highly unlikely that DatalossDB had virtually no breach recordings in 2011 (in fact, it’s more than unlikely, it’s not true). Further views will show some other potential misses from DatalossDB.

Takeaway #3: Do not assume the components of this aggregated data set are complete.

We can get a further feel for data quality and also which reporting sources are weighted more heavily (i.e. which ones have more records, thus implicitly placing a greater reliance on them for any calculations) by looking at how many records they each contributed to the aggregated population each year:

(click to enlarge)

(click to enlarge)

I’m not sure why 2008 & 2009 have such small bars for Databreaches.net and PHIPrivacy.net, and you can see the 2011 gap in the DatalossDB graph.

At this point, I’d (maybe) trust some aggregate analysis of the HHS (via PHI), CA Attorney General & Media data, but would need to caveat any conclusions with the obvious biases introduced by each.

Even with these issues, I really wanted a “big picture” view of the entire set and ended up creating the following two charts:

(click to enlarge)

(click to enlarge)

(click to enlarge)

(click to enlarge)

(You’ll probably want to view the PDF documents of each : [1] [2] given how big they are.)

Those charts show the number of breaches-by-type by month across the 2005-2013 span by reporting source. The only difference between the two is that the latter one is grouped by Lance’s “meta type” definition. These views enable us to see gaps in reporting by month (note the additional aggregation issue at the tail end of 2010 for DatalossDB) and also to get a feel for the trends of each band (note the significant increase in “unknown” in 2012 for DatalossDB).

Takeaway #4: Do not ignore the “unknown” classification when performing analysis with this data set.

We can see other data issues if we look at it from other angles, such as the state the breach was recorded in:

(click to enlarge)

(click to enlarge)

We can see at least three issues (missing value and occurrences recorded not in the US) from this view, but it seems the number of breaches mostly aligns with population (discrepancies make sense given the lack of uniform breach reporting requirements).

It’s also very difficult to do any organizational analysis (I’m a big fan of looking at “repeat offenders” in general) with this data set without some significant data cleansing/normalization. For example, all of these are “Bank of America“:

[1] "Bank of America"                                                             
[2] "Wachovia, Bank of America, PNC Financial Services Group and Commerce Bancorp"
[3] "Bank of America Corp."                                                       
[4] "Citigroup, Inc., Bank of America, Corp."

Without any cleansing, here are the orgs with two or more reported breaches since 2005:

(apologies for the IFRAME but Google’s Fusion Tables are far too easy to use when embedding data tables)

Takeaway #5: Do not assume that just because a data set has been aggregated by someone and published that it’s been scrubbed well.

Even if the above sets of issues were resolved, the real details are in the “breach details” field, which is a free-form text field providing more information on who/what/when/where/why/how (with varying degrees of consistency & completeness). This is actually the information you really need. The HACK attribute is all well-and-good, but what kind of hack was it? This is one area VERIS shines. What advice are you going to give financial services (BSF) orgs from this extract:

(click to enlarge)

(click to enlarge)

HACKs are up in 2012 from 2010 & 2011, but what type of HACKs against what size organizations? Should smaller orgs be worried about desktop security and/or larger orgs start focusing more on web app security? You can’t make that determination without mining that free form text field. (NOTE: as I have time, I’m trying to craft a repeatable text analysis function I can perform on that field to see what can be automatically extracted)

Takeaway #6: This data set is pretty much what the PRC says it is: a chronology of data breaches. More in-depth analysis is not advised without a massive clean-up effort.

Finally, hypothesizing that the PRC’s aggregation could have resulted in duplicate records, I created a subset of the records based solely on breach “Date Made Public” + “Organization Name” and then sifted manually through the breach text details, 6 duplicate entries were found. Interestingly enough, only one instance of duplicate records was found across reporting databases (my hunch was that DatalossDB or DataBreaches.NET would have had records other, smaller databases included; however, this particular duplicate detection mechanism does not rule this out given the quality of the data).

Conclusion/Acknowledgements

Despite the criticisms above, the efforts by the PRC and their sources for aggregation are to be commended. Without their work to document breaches we would only have the mega-media-frenzy stories and labor-intensive artifacts like the DBIR to work with. Just because the data isn’t perfect right now doesn’t mean we won’t get to a point where we have the ability to record and share this breach information like the CDC does diseases (which also ins’t perfect, btw).

I leave you with another column of numbers that shows—if broken down by organization type and breach type—there is an average of 2 breaches per-breach/org-type-per-year (according to this data):

(The complete table includes the mean, median and standard deviation for each type.)

Lance’s final question to me (on the list) was “Bob, what do recommended as the next step to answer the question – What percentage of publicly known data breaches are deliberate cyber attacks, and what percentage are human based accidental loss/disclosure?

I’d first start with a look at the DBIR (especially this year’s) and then see if I could get a set of grad students to convert a complete set of DatalossDB records (and, perhaps, the other sources) into VERIS format for proper analysis. If any security vendors are reading this, I guarantee you’ll gain significant capital/accolades within/from the security practitioner community if you were to sponsor such an effort.

Comments, corrections & constructive criticisms are heartily welcomed. Data crunching & graphing scripts available both on request and perhaps uploaded to my github repository once I clean them up a bit.