Skip navigation

Category Archives: Data Visualization

This is a follow-up to my [Visualizing Risky Words](http://rud.is/b/2013/03/06/visualizing-risky-words/) post. You’ll need to read that for context if you’re just jumping in now. Full R code for the generated images (which are pretty large) is at the end.

Aesthetics are the primary reason for using a word cloud, though one can pretty quickly recognize what words were more important on well crafted ones. An interactive bubble chart is a tad better as it lets you explore the corpus elements that contained the terms (a feature I have not added yet).

I would posit that a simple bar chart can be of similar use if one is trying to get a feel for overall word use across a corpus:

freq-bars
(click for larger version)

It’s definitely not as sexy as a word cloud, but it may be a better visualization choice if you’re trying to do analysis vs just make a pretty picture.

If you are trying to analyze a corpus, you might want to see which elements influenced the term frequencies the most, primarily to see if there were any outliers (i.e. strong influencers). With that in mind, I took @bfist’s [corpus](http://securityblog.verizonbusiness.com/2013/03/06/2012-intsum-word-cloud/) and generated a heat map from the top terms/keywords:

risk-hm
(click for larger version)

There are some stronger influencers, but there is a pattern of general, regular usage of the terms across each corpus component. This is to be expected for this particular set as each post is going to be talking about the same types of security threats, vulnerabilities & issues.

The R code below is fully annotated, but it’s important to highlight a few items in it and on the analysis as a whole:

– The extra, corpus-specific stopword list : “week”, “report”, “security”, “weeks”, “tuesday”, “update”, “team” : was designed after manually inspecting the initial frequency breakdowns and inserting my opinion at the efficacy (or lack thereof) of including those terms. I’m sure another iteration would add more (like “released” and “reported”). Your expert view needs to shape the analysis and—in most cases—that analysis is far from a static/one-off exercise.
– Another area of opine was the choice of 0.7 in the removeSparseTerms(tdm, sparse=0.7) call. I started at 0.5 and worked up through 0.8, inspecting the results at each iteration. Playing around with that number and re-generating the heatmap might be an interesting exercise to perform (hint).
– Same as the above for the choice of 10 in subset(tf, tf>=10). Tweak the value and re-do the bar chart vis!
– After the initial “ooh! ahh!” from a word cloud or even the above bar chart (though, bar charts tend to not evoke emotional reactions) is to ask yourself “so what?”. There’s nothing inherently wrong with generating a visualization just to make one, but it’s way cooler to actually have a reason or a question in mind. One possible answer to a “so what?” for the bar chart is to take the high frequency terms and do a bigram/digraph breakdown on them and even do a larger cross-term frequency association analysis (both of which we’ll do in another post)
– The heat map would be far more useful as a D3 visualization where you could select a tile and view the corpus elements with the term highlighted or even select a term on the Y axis and view an extract from all the corpus elements that make it up. That might make it to the TODO list, but no promises.

I deliberately tried to make this as simple as possible for those new to R to show how straightforward and brief text corpus analysis can be (there’s less than 20 lines of code excluding library imports, whitespace, comments and the unnecessary expansion of some of the tm function calls that could have been combined into one). Furthermore, this is really just a basic demonstration of tm package functionality. The post/code is also aimed pretty squarely at the information security crowd as we tend to not like examples that aren’t in our domain. Hopefully it makes a good starting point for folks and, as always, questions/comments are heartily encouraged.

# need this NOAWT setting if you're running it on Mac OS; doesn't hurt on others
Sys.setenv(NOAWT=TRUE)
library(ggplot2)
library(ggthemes)
library(tm)
library(Snowball) 
library(RWeka) 
library(reshape)
 
# input the raw corpus raw text
# you could read directly from @bfist's source : http://l.rud.is/10tUR65
a = readLines("intext.txt")
 
# convert raw text into a Corpus object
# each line will be a different "document"
c = Corpus(VectorSource(a))
 
# clean up the corpus (function calls are obvious)
c = tm_map(c, tolower)
c = tm_map(c, removePunctuation)
c = tm_map(c, removeNumbers)
 
# remove common stopwords
c = tm_map(c, removeWords, stopwords())
 
# remove custom stopwords (I made this list after inspecting the corpus)
c = tm_map(c, removeWords, c("week","report","security","weeks","tuesday","update","team"))
 
# perform basic stemming : background: http://l.rud.is/YiKB9G
# save original corpus
c_orig = c
 
# do the actual stemming
c = tm_map(c, stemDocument)
c = tm_map(c, stemCompletion, dictionary=c_orig)
 
# create term document matrix : http://l.rud.is/10tTbcK : from corpus
tdm = TermDocumentMatrix(c, control = list(minWordLength = 1))
 
# remove the sparse terms (requires trial->inspection cycle to get sparse value "right")
tdm.s = removeSparseTerms(tdm, sparse=0.7)
 
# we'll need the TDM as a matrix
m = as.matrix(tdm.s)
 
# datavis time
 
# convert matri to data frame
m.df = data.frame(m)
 
# quick hack to make keywords - which got stuck in row.names - into a variable
m.df$keywords = rownames(m.df)
 
# "melt" the data frame ; ?melt at R console for info
m.df.melted = melt(m.df)
 
# not necessary, but I like decent column names
colnames(m.df.melted) = c("Keyword","Post","Freq")
 
# generate the heatmap
hm = ggplot(m.df.melted, aes(x=Post, y=Keyword)) + 
  geom_tile(aes(fill=Freq), colour="white") + 
  scale_fill_gradient(low="black", high="darkorange") + 
  labs(title="Major Keyword Use Across VZ RISK INTSUM 202 Corpus") + 
  theme_few() +
  theme(axis.text.x  = element_text(size=6))
ggsave(plot=hm,filename="risk-hm.png",width=11,height=8.5)
 
# not done yet
 
# better? way to view frequencies
# sum rows of the tdm to get term freq count
tf = rowSums(as.matrix(tdm))
# we don't want all the words, so choose ones with 10+ freq
tf.10 = subset(tf, tf>=10)
 
# wimping out and using qplot so I don't have to make another data frame
bf = qplot(names(tf.10), tf.10, geom="bar") + 
  coord_flip() + 
  labs(title="VZ RISK INTSUM Keyword Frequencies", x="Keyword",y="Frequency") + 
  theme_few()
ggsave(plot=bf,filename="freq-bars.png",width=8.5,height=11)

NOTE: Parts [2], [3] & [4] are also now up.

Inspired by a post by @bfist who created the following word cloud in Ruby from VZ RISK INTSUM posts (visit the link or select the visualization to go to the post):

intsumwordcld-copy-2

I ♥ word clouds as much as anyone and usually run Presidential proclamations & SOTU addresses through a word cloud generator just to see what the current year’s foci are.

However, word clouds rarely provide what the creator of the visualization intends. Without performing more strict corpus analysis, one is really just getting a font-based frequency counter. While pretty, it’s not a good idea to derive too much meaning from a simple frequency count since there are phrase & sentence structure components to consider as well as concepts such as stemming (e.g. “risks” & “risk” are most likely the same thing, one is just plural…that’s a simplistic definition/example, though).

I really liked Jim Vallandingham’s Building a Bubble Cloud walk through on how he made a version of @mbostock’s NYTimes convention word counts and decided to both run a rudimentary stem on the VZ RISK INTSUM corpus along with a principal component analysis [PDF] to find the core text influencers and feed the output to a modified version of the bubble cloud:

Screenshot_3_6_13_10_55_PM

You can select the graphic to go to the “interactive”/larger version of it. I had intended to make selecting a circle bring up the relevant documents from the post corpus, but that will have to be a task for another day.

It’s noteworthy that both @bfist’s work and this modified version share many of the same core “important” words. With some stemming refinement and further stopword removal (e.g. “week” was in the original run of this visualization and is of no value for this risk-oriented visualization, so I made it part of the stopword list), this could be a really good way to get an overview of what the risky year was all about.

I won’t promise anything, but I’ll try to get the R code cleaned up enough to post. It’s really basic tm & PCA work, so no rocket-science degree is required. Fork @vlandham’s github repo & follow the aforelinked tutorial for the crunchy D3-goodness bits.

Many thanks to all who attended the talk @jayjacobs & I gave at RSA on Tuesday, February 26, 2013. It was really great to be able to talk to so many of you afterwards as well.

We’ve enumerated quite a bit of non-slide-but-in-presentation information that we wanted to aggregate into a blog post so you can viz along at home. If you need more of a guided path, I strongly encourage you to take a look at some of the free courses over at [Coursera](https://www.coursera.org/).

For starters, here’s a bit.ly bundle of data analysis & visualization bookmarks that @dseverski & I maintain. We’ve been doing (IMO) a pretty good job adding new resources as they come up and may have some duplicates to the ones below.

People Mentioned

– [Stephen Few’s Perceptual Edge blog](http://www.perceptualedge.com/) : Start from the beginning to learn from a giant in information visualization
– [Andy Kirk’s Visualising Data blog](http://www.visualisingdata.com/) (@visualisingdata) : Perhaps the quintessential leader in the modern visualization movement.
– [Mike Bostock’s blog](http://bost.ocks.org/mike/) (@mbostock) : Creator of D3 and producer of amazing, interactive graphics for the @NYTimes
– [Edward Tufte’s blog](http://www.edwardtufte.com/tufte/) : The father of what we would now identify as our core visualization principles & practices

Tools Mentioned

– [R](http://www.r-project.org/) : Jay & I probably use this a bit too much as a hammer (i.e. treat ever data project as a nail) but it’s just far too flexible and powerful to not use as a go-to resource
– [RStudio](http://www.rstudio.com/) : An *amazing* IDE for R. I, personally, usually despise IDEs (yes, I even dislike Xcode), but RStudio truly improves workflow by several orders of magnitude. There are both desktop and server versions of it; the latter gives you the ability to setup a multi-user environment and use the IDE from practically anywhere you are. RStudio also makes generating [reproducible research](http://cran.r-project.org/web/views/ReproducibleResearch.html) a joy with built-in easy access to tools like [kintr](http://yihui.name/knitr/).
– [iPython](http://ipython.org/) : This version of Python takes an already amazing language and kicks it up a few notches. It brings it up to the level of R+RStudio, especially with it’s knitr-like [iPython Notebooks](http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html) for–again–reproducible research.
– [Mondrian](http://www.theusrus.de/Mondrian/) : This tool needs far more visibility. It enables extremely quick visualization of even very large data sets. The interface takes a bit of getting used to, but it’s faster then typing R commands or fumbling in Excel.
– [Tableau](http://www.tableausoftware.com/) : This tool may be one of the most accessible, fast & flexible ways to explore data sets to get an idea of where you need to/can do further analysis.
– [Processing](http://processing.org/) : A tool that was designed from the ground up to help journalists create powerful, interactive data visualizations that you can slipstream directly onto the web via the [Processing.js](http://processingjs.org/) library.
– [D3](http://d3js.org/) : The foundation of modern, data-driven visualization on the web.
– [Gephi](https://gephi.org/) : A very powerful tool when you need to explore networks & create beautiful, publication-worthy visualizations.
– [MongoDB](http://www.mongodb.org/) : NoSQL database that’s highly & easily scaleable without a steep learning curve.
– [CRUSH Tools by Google](https://code.google.com/p/crush-tools/) : Kicks up your command-line data munging.

Here’s a quick example of couple additional ways to use the netintel R package I’ve been tinkering with. This could easily be done on the command line with other tools, but if you’re already doing scripting/analysis with R, this provides a quick way to tell if a list of IPs is in the @AlienVault IP reputation database. Zero revelations here for regular R users, but it might help some folks who are working to make R more of a first class scripting citizen.

I whipped up the following bit of code to check to see how many IP addresses in the @Mandiant APT-1 FQDN dump were already in the AlienVault database. Reverse resolution of the Mandiant APT-1 FQDN list is a bit dubious at this point so a cross-check with known current data is a good idea. I should also point out that not all the addresses resolved “well” (there are 2046 FQDNs and my quick dig only yielded 218 usable IPs).

library(netintel)
 
# get the @AlienVault reputation DB
av.rep = Alien.Vault.Reputation()
 
# read in resolved APT-1 FQDNs list
apt.1 = read.csv("apt-1-ips.csv")
 
# basic set operation
whats.left = intersect(apt.1$ip,av.rep$IP)
 
# how many were in the quickly resolved apt-1 ip list?
length(apt.1)
[1]218
 
# how many are common across the lists?
length(whats.left)
[1] 44
 
# take a quick look at them
whats.left
[1] "12.152.124.11"   "140.112.19.195"  "161.58.182.205"  "165.165.38.19"   "173.254.28.80"  
[6] "184.168.221.45"  "184.168.221.54"  "184.168.221.56"  "184.168.221.58"  "184.168.221.68" 
[11] "192.31.186.141"  "192.31.186.149"  "194.106.162.203" "199.59.166.109"  "203.170.198.56" 
[16] "204.100.63.18"   "204.93.130.138"  "205.178.189.129" "207.173.155.44"  "207.225.36.69"  
[21] "208.185.233.163" "208.69.32.230"   "208.73.210.87"   "213.63.187.70"   "216.55.83.12"   
[26] "50.63.202.62"    "63.134.215.218"  "63.246.147.10"   "64.12.75.1"      "64.12.79.57"    
[31] "64.126.12.3"     "64.14.81.30"     "64.221.131.174"  "66.228.132.20"   "66.228.132.53"  
[36] "68.165.211.181"  "69.43.160.186"   "69.43.161.167"   "69.43.161.178"   "70.90.53.170"   
[41] "74.14.204.147"   "74.220.199.6"    "74.93.92.50"     "8.5.1.34"

So, roughly a 20% overlap between (quickly-I’m sure there’s a more comprehensive list) resolved & “clean” APT-1 FQDNs IPs and the AlienVault reputation database.

For kicks, we can see where all the resolved APT-1 nodes live (BGP/network-wise) in relation to each other using some of the other library functions:

library(netintel)
library(igraph)
library(plyr)
 
apt.1 = read.csv("apt-1-ips.csv")
ips = apt.1$ip
 
# get BGP origin & peers
origin = BulkOrigin(ips)
peers = BulkPeer(ips)
 
# start graphing
g = graph.empty()
 
# Make IP vertices; IP endpoints are red
g = g + vertices(ips,size=1,color="red",group=1)
 
# Make BGP vertices ; BGP nodes are light blue
g = g + vertices(unique(c(peers$Peer.AS, origin$AS)),size=1.5,color="orange",group=2)
 
# no labels
V(g)$label = ""
 
# Make IP/BGP edges
ip.edges = lapply(ips,function(x) {
  iAS = origin[origin$IP==x,]$AS
  lapply(iAS,function(y){
    c(x,y)
  })
})
 
# Make BGP/peer edges
bgp.edges = lapply(unique(origin$BGP.Prefix),function(x) {
  startAS = unique(origin[origin$BGP.Prefix==x,]$AS)
  lapply(startAS,function(z) {
    pAS = peers[peers$BGP.Prefix==x,]$Peer.AS
    lapply(pAS,function(y) {
      c(z,y)
    })
  })
})
 
# get total graph node count
node.count = table(c(unlist(ip.edges),unlist(bgp.edges)))
 
# add edges 
g = g + edges(unlist(ip.edges))
g = g + edges(unlist(bgp.edges))
 
# base edge weight == 1
E(g)$weight = 1
 
# simplify the graph
g = simplify(g, edge.attr.comb=list(weight="sum"))
 
# no arrows
E(g)$arrow.size = 0
 
# best layout for this
L = layout.fruchterman.reingold(g)
 
# plot the graph
plot(g,margin=0)

apt-1

If we take out the BGP peer relationships from the graph (i.e. don’t add the bgp.edges in the above code) we can see the mal-host clusters even more clearly (the pseudo “Death Star” look is unintentional but appropro):

Rplot01

We can also determine which ASNs the bigger clusters belong to by checking out the degree. The “top” 5 clusters are:

16509 40676 36351 26496 15169 
    7     8     8    13    54

While my library doesn’t support direct ASN detail lookup yet (an oversight), we can take those ASN’s, check them out manually and see the results:

16509   | US | arin     | 2000-05-04 | AMAZON-02 - Amazon.com, Inc.
40676   | US | arin     | 2008-02-26 | PSYCHZ - Psychz Networks
36351   | US | arin     | 2005-12-12 | SOFTLAYER - SoftLayer Technologies Inc.
26496   | US | arin     | 2002-10-01 | AS-26496-GO-DADDY-COM-LLC - GoDaddy.com, LLC
15169   | US | arin     | 2000-03-30 | GOOGLE - Google Inc.

So Google servers are hosting the most mal-nodes from the resolved ASN-1 list, followed by GoDaddy. I actually expected Amazon to be higher up in the list.

I’ll be adding igraph and ASN lookup functions to the netintel library soon. Also, if anyone has a better APT-1 IP list, please shoot me a link.

So, I’ve had some quick, consecutive blog posts around this R package I’m working on, and this one is more of an answer to my own, self-identified question of “so what?”. As I was working on an importer for AlienValut’s IP reputation database, I thought it might be interesting to visualize aspects of that data using some of the meta-information gained from the other “netintel” (my working name for the package) functions.

Acting on that impulse, I extracted all IPs that were uniquely identified as “Malicious Host“s (it’s a category in their database), did ASN & peer lookups for them and made two DrL graphs from them (I did a test singular graph but it would require a Times Square monitor to view).

h1

h2

You’ll need to select both images to make them bigger to view them more easily. Red nodes are hosts, blue ones are the ASNs they belong to.

While some of the visualized data was pretty obvious from the data table (nigh consecutive IP addresses in some cases), seeing the malicious clusters (per ASN) was (to me) pretty interesting. I don’t perform malicious host/network analysis as part of the day job, so the apparent clustering (and, also the “disconnected” ones) may not be interesting to anyone but me, but it gave me a practical example to test for the library I’m working on and may be interesting to others. It also shows you can make pretty graphs with R.

I’ve got the crufty R code up on github now and will keep poking at it as I have time. I’ll add the code that made the above image to the repository over the weekend.

The small igraph visualization in the previous post shows the basics of what you can do with the BulkOrigin & BulkPeer functions, and I thought a larger example with some basic D3 tossed in might be even more useful.

Assuming you have the previous functions in your environment, the following builds a larger graph structure (the IPs came from an overnight sample of pcap captured communication between my MacBook Pro & cloud services) and plots a similar circular graph:

library(igraph)
 
ips = c("100.43.81.11","100.43.81.7","107.20.39.216","108.166.87.63","109.152.4.217","109.73.79.58","119.235.237.17","128.12.248.13","128.221.197.57","128.221.197.60","128.221.224.57","129.241.249.6","134.226.56.7","137.157.8.253","137.69.117.58","142.56.86.35","146.255.96.169","150.203.4.24","152.62.109.57","152.62.109.62","160.83.30.185","160.83.30.202","160.83.72.205","161.69.220.1","168.159.192.57","168.244.164.254","173.165.182.190","173.57.120.151","175.41.236.5","176.34.78.244","178.85.44.139","184.172.0.214","184.72.187.192","193.164.138.35","194.203.96.184","198.22.122.158","199.181.136.59","204.191.88.251","204.4.182.15","205.185.121.149","206.112.95.181","206.47.249.246","207.189.121.46","207.54.134.4","209.221.90.250","212.36.53.166","216.119.144.209","216.43.0.10","23.20.117.241","23.20.204.157","23.20.9.81","23.22.63.190","24.207.64.10","24.64.233.203","37.59.16.223","49.212.154.200","50.16.130.169","50.16.179.34","50.16.29.33","50.17.13.221","50.17.43.219","50.18.234.67","63.71.9.108","64.102.249.7","64.31.190.1","65.210.5.50","65.52.1.12","65.60.80.199","66.152.247.114","66.193.16.162","66.249.71.143","66.249.71.47","66.249.72.76","66.41.34.181","69.164.221.186","69.171.229.245","69.28.149.29","70.164.152.31","71.127.49.50","71.41.139.254","71.87.20.2","74.112.131.127","74.114.47.11","74.121.22.10","74.125.178.81","74.125.178.82","74.125.178.88","74.125.178.94","74.176.163.56","76.118.2.138","76.126.174.105","76.14.60.62","76.168.198.238","76.22.130.45","77.79.6.37","81.137.59.193","82.132.239.186","82.132.239.97","8.28.16.254","83.111.54.154","83.251.15.145","84.61.15.10","85.90.76.149","88.211.53.36","89.204.182.67","93.186.30.114","96.27.136.169","97.107.138.192","98.158.20.231","98.158.20.237")
origin = BulkOrigin(ips)
peers = BulkPeer(ips)
 
g = graph.empty() + vertices(ips,size=10,color="red",group=1)
g = g + vertices(unique(c(peers$Peer.AS, origin$AS)),size=10,color="lightblue",group=2)
V(g)$label = c(ips, unique(c(peers$Peer.AS, origin$AS)))
ip.edges = lapply(ips,function(x) {
  c(x,origin[origin$IP==x,]$AS)
})
bgp.edges = lapply(unique(origin$BGP.Prefix),function(x) {
  startAS = unique(origin[origin$BGP.Prefix==x,]$AS)
  pAS = peers[peers$BGP.Prefix==x,]$Peer.AS
  lapply(pAS,function(y) {
    c(startAS,y)
  })
})
g = g + edges(unlist(ip.edges))
g = g + edges(unlist(bgp.edges))
E(g)$weight = 1
g = simplify(g, edge.attr.comb=list(weight="sum"))
E(g)$arrow.size = 0
g$layout = layout.circle
plot(g)

I’ll let you run that to see how horrid a large, style-/layout-unmodified circular layout graph looks.

Thanks to a snippet on StackOverflow, it’s really easy to get this into D3:

library(RJSONIO) 
temp<-cbind(V(g)$name,V(g)$group)
colnames(temp)<-c("name","group")
js1<-toJSON(temp)
write.graph(g,"/tmp/edgelist.csv",format="edgelist")
edges<-read.csv("/tmp/edgelist.csv",sep=" ",header=F)
colnames(edges)<-c("source","target")
edges<-as.matrix(edges)
js2<-toJSON(edges)
asn<-paste('{"nodes":',js1,',"links":',js2,'}',sep="")
write(asn,file="/tmp/asn.json")

We can take the resulting asn.json file and use it as a drop-in replacement for one of the example D3 force-directed layout building blocks and produce this:

Click for larger

Click for larger

Rather than view a static image, you can view the resulting D3 visualization (warning: it’s fairly big).

Both the conversion snippet and the D3 code can be easily tweaked to add more detail and be a tad more interactive/informative, but I’m hoping this larger example provides further inspiration for folks looking to do computer network analysis & visualization with R and may also help some others build more linkages between R & D3.

Yesterday, I took a very short video capture from [wind map](http://hint.fm/wind/) of the massive wind flows buffeting the northeast. I did the same this morning and stitched them together to see what a difference a day makes.

Nothing earth-shattering here, but it is amazing to see how quickly the patterns can change (they changed intra-day yesterday as well, but I didn’t have time to do a series of videos this morning. the major change was that the very clear northerly pattern seen earlier in the day was replaced with a very clear easterly flow, with much more power than seen in today’s capture).

Winter Wind Flows

I came across a list of data crunching and/or visualization competition sites today. I had heard of Kaggle & NITRD before but not the other ones. If you’re looking to get into data analytics/visualization or get better at it, the only way to do so is to practice. Using other data sets (outside your expertise field) may also help you do in-field analysis a little better (or differently).

* Kaggle — “Kaggle is an arena where you can match your data science skills against a global cadre of experts in statistics, mathematics, and machine learning. Whether you’re a world-class algorithm wizard competing for prize money or a novice looking to learn from the best, here’s your chance to jump in and geek out, for fame, fortune, or fun.”

* CrowdAnalytix — Crowdsourced predictive analytics platform

* Innocentive — “he Challenge Center is where InnoCentive connects Seekers and Solvers with a myriad of the world’s toughest Challenges. Seeker organizations post their Challenges in the Challenge Center and offer registered Solvers significant financial awards for the best solutions. Solvers can search for Challenges based on their interests and expertise.”

* TunedIT — “a platform for hosting data competitions for educational, scientific and business purposes”

* KDD Cup — “KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miner”

* NITRD Big Data Challenge Series — “The Big Data Challenge is an effort by the U.S. government to conceptualize new and novel approaches to extracting value from “Big Data” information sets residing in various agency silos and delivering impactful value while remaining consistent with individual agency missions This data comes from the fields of health, energy and Earth science. Competitors will be tasked with imagining analytical techniques, and describe how they may be shared as universal, cross-agency solutions that transcend the limitations of individual agencies.”