Archive for the ‘Charts & Graphs’ Category

Re-imagining @panda_security’s Q1 2013 Report Pie Charts

We infosec folk eat up industry reports and most of us have no doubt already gobbled up @panda_security’s recently released Q1 2013 Report [PDF]. It’s a good read (so go ahead and read it, we’ll still be here!) and I was really happy to see a nicely stylized chart in the early pages:

Screenshot_5_24_13_8_14_AM

However, I quickly became a #sadpanda when I happened across some explosive 3D pie charts later on. Rather than deride, I thought a re-imagining would be a better use of time and let you decide which visualizations both communicate better and are more appealing.

I chose to use @Datawrapper to showcase how easy it is to build and publish pleasing and informative visualizations without even leaving your browser.

Figure 4, Original:

Panda Labs Q1 2013 Report Fig 5 (Orig)



Figure 4, Alternative:


Figure 5, Original

Fig 4: New malware strains In Q1 2013, by Type (orig)



Figure 5, Alternative (horizontal vs vertical, just to mix it up a bit):

If the charts had been closer together in the report, I would have opted for vertical design for both and probably kept malware-type ordering vs sort by highest percentage.

How would you re-imagine the pie charts? Post a link to your creations in the comments and I’ll make sure they show up embedded with the post.

Secure360 (@Secure360) Data Analysis & Visualization Talk Resources #Sec360

Many thanks to all who attended the talk @jayjacobs & I gave at @Secure360 on Wednesday, May 15, 2013. As promised, here are the slides.

We’ve enumerated quite a bit of non-slide-but-in-presentation information that we wanted to aggregate into a blog post so you can vi[sz] along at home. If you need more of a guided path, I strongly encourage you to take a look at some of the free courses over at Coursera.

For starters, here’s a bit.ly bundle of data analysis & visualization bookmarks that @dseverski & I maintain. We’ve been doing (IMO) a pretty good job adding new resources as they come up and may have some duplicates to the ones below.

People Mentioned

Tools Mentioned

  • R : Jay & I probably use this a bit too much as a hammer (i.e. treat every data project as a nail) but it’s just far too flexible and powerful to not use as a go-to resource
  • RStudio : An amazing IDE for R. I, personally, usually despise IDEs (yes, I even dislike Xcode), but RStudio truly improves workflow by several orders of magnitude. There are both desktop and server versions of it; the latter gives you the ability to setup a multi-user environment and use the IDE from practically anywhere you are. RStudio also makes generating reproducible research a joy with built-in easy access to tools like kintr.
  • iPython : This version of Python takes an already amazing language and kicks it up a few notches. It brings it up to the level of R+RStudio, especially with it’s knitr-like iPython Notebooks for–again–reproducible research.
  • SecViz : Security-centric Visualization Site & Tools by @raffaelmarty
  • Mondrian : This tool needs far more visibility. It enables extremely quick visualization of even very large data sets. The interface takes a bit of getting used to, but it’s faster then typing R commands or fumbling in Excel.
  • Tableau : This tool may be one of the most accessible, fast & flexible ways to explore data sets to get an idea of where you need to/can do further analysis.
  • Processing : A tool that was designed from the ground up to help journalists create powerful, interactive data visualizations that you can slipstream directly onto the web via the Processing.js library.
  • D3 : The foundation of modern, data-driven visualization on the web.
  • Gephi : A very powerful tool when you need to explore networks & create beautiful, publication-worthy visualizations.
  • MongoDB : NoSQL database that’s highly & easily scaleable without a steep learning curve.
  • CRUSH Tools by Google : Kicks up your command-line data munging.

Slopegraph As A Service

@adammontville posited that Figure 15 from this year’s DBIR could use some slopegraph love. As I am not one to back down from a reasonable challenge, I obliged.

Here’s the original chart (produced by @jayjacobs):

figure15-orig

and, here’s a very quick slopegraph version of it:

figure15-slope

You can click on both/either for a larger version. If I had more time, I could have made the slopegraph version nicer, but it conveys a story fairly well the way it is, especially with the highlight on the two biggest changes between 2008 & 2012.

Two problems with the modified visualization are (a) multi-column slopegraphs blend into a parallel coordinate or plain old line graph pretty quickly (thus, reducing their slopegraph-y goodness); and, (b) the diversity of the year-over-year DBIR data set makes the comparison between years almost pointless (as the DBIR itself points out).

I also generated a proper/traditional slopegraph, comparing 2008 to 2012:

figure15-true-slope

The visualization is far more compact and, if the goal was to show the change between 2008 and 2012, it provides a much clearer view of what has and has not changed.

SOURCE Boston (@SOURCEConf) Data Analysis & Visualization Talk Resources #srcbos13

Many thanks to all who attended the talk @jayjacobs & I gave at @SOURCEconf on Thursday, April 18, 2013. As promised, here are the slides which should be much less washed out than the projector version :-)

We’ve enumerated quite a bit of non-slide-but-in-presentation information that we wanted to aggregate into a blog post so you can viz along at home. If you need more of a guided path, I strongly encourage you to take a look at some of the free courses over at Coursera.

For starters, here’s a bit.ly bundle of data analysis & visualization bookmarks that @dseverski & I maintain. We’ve been doing (IMO) a pretty good job adding new resources as they come up and may have some duplicates to the ones below.

People Mentioned

Tools Mentioned

  • R : Jay & I probably use this a bit too much as a hammer (i.e. treat ever data project as a nail) but it’s just far too flexible and powerful to not use as a go-to resource
  • RStudio : An amazing IDE for R. I, personally, usually despise IDEs (yes, I even dislike Xcode), but RStudio truly improves workflow by several orders of magnitude. There are both desktop and server versions of it; the latter gives you the ability to setup a multi-user environment and use the IDE from practically anywhere you are. RStudio also makes generating reproducible research a joy with built-in easy access to tools like kintr.
  • iPython : This version of Python takes an already amazing language and kicks it up a few notches. It brings it up to the level of R+RStudio, especially with it’s knitr-like iPython Notebooks for–again–reproducible research.
  • SecViz : Security-centric Visualization Site & Tools by @raffaelmarty
  • Mondrian : This tool needs far more visibility. It enables extremely quick visualization of even very large data sets. The interface takes a bit of getting used to, but it’s faster then typing R commands or fumbling in Excel.
  • Tableau : This tool may be one of the most accessible, fast & flexible ways to explore data sets to get an idea of where you need to/can do further analysis.
  • Processing : A tool that was designed from the ground up to help journalists create powerful, interactive data visualizations that you can slipstream directly onto the web via the Processing.js library.
  • D3 : The foundation of modern, data-driven visualization on the web.
  • Gephi : A very powerful tool when you need to explore networks & create beautiful, publication-worthy visualizations.
  • MongoDB : NoSQL database that’s highly & easily scaleable without a steep learning curve.
  • CRUSH Tools by Google : Kicks up your command-line data munging.

PC Maker Slopegraph + More Functional Table (One Last Time)

Now that I’m back in the US and relaxing, I can take time for one final blather on the PC Maker Slopegraph post from earlier in the week.

Slopegraphs can be quite long depending on the increment between discrete entries (as I’ve pointed out before). You either need to do binning/rounding, change the scale or add some annotations to the chart to make up for the length. Binning/rounding seems to make the most sense since you can add a table for precision but give the reader a good view of what you’re trying to communicate in as compact a fashion as possible.

I’ll, again, ask the reader, what tells you which PC maker is on top: this table:

Screen-Shot-2013-04-10-at-6.14.56-PM

or these slopegraphs:

PC Maker Shipments (in thousands, rounded to nearest thousand)
pcs

PC Maker Market Share (rounded to nearest %)
pcs-share

Labeled properly, the rounding makes for a much more compact chart and doesn’t detract from the message, especially when I also include a much prettier, quick precision reference via Google Fusion Tables:

(though the column sort feature seems a bit wonky for some reason…).

Given that the focus was on the top individual maker, the “Other” category is just noise, so excluding it is also not an issue. If we wanted to tell the story of how well individual makers are performing against that bucket of contenders or point-players, then we would include that data and use other visualizations to communicate whatever conclusions we want to lead the reader to.

Remember, data tables and visualizations should be there to help tell your story, not detract from it or require real work/effort to grok (unless you’re breaking new visualization ground, which is most definitely not happening in the Ars story).

Use @datawrapper For Slopegraphs

While not perfect, I noticed that it was possible to make a pretty decent slopegraph over at Datawrapper as I was poking at some new features they announced recently. As an example, I ran one of the charts from my most recent blog post as an example.

If they had an option to do away with the gray horizontal lines, it wouldn’t be a bad slopegraph at all. I’m not sure how it’d handle overlaps, but if you have some basic data and don’t feel like messing with my Python or R code (and don’t want to do any machinations in Excel), Datawrapper might not be a bad choice.

Ugly Tables vs Slopegraphs : PC Maker Shipments & Marketshare

Andrew Cunningham (@IT_AndrewC) posted an article—If you make PCs and you’re not Lenovo, you might be in trouble—on the always #spiffy @arstechnica that had this horrid table in it:

Screen-Shot-2013-04-10-at-6.14.56-PM

That table was not made by Andrew (so, don’t blame him) but Ars graphics folk could have made the post a bit more readable.

I’m not going to bother making a prettier table (it’s possible, but table formatting is not the point of this post), but I am going to show two slopegraphs that communicate the point of the post (that Lenovo is sitting pretty) much better:

PC Maker Market Share
pcs-share


PC Maker Shipments (in thousands)
pcs

They’re a little long (a problem I’ve noted with slopegraphs previously) but I think they are much better at conveying message intended by the story. I may try to tweak them a bit or even finish the D3 port of my slopegraph library when I’m back from Bahrain.

Interactive Slopegraphs With Processing

Given my obsession with slopegraphs, I’m not sure how I missed this post back in late February by @JeffClark that includes a very nicely executed interactive sloepgraph on the global obesity problem. He used Processing & Processing JS to build the visualization and I think it illustrates how well animation/interaction and slopegraphs work together. It would be even spiffier if demographic & obesity details (perhaps even a dynamic map) were displayed as you select a country/region.

You can try your hand at an alternate implementation by grabbing the data and playing along at home.

Visualizing Risky Words — Part 2

This is a follow-up to my Visualizing Risky Words post. You’ll need to read that for context if you’re just jumping in now. Full R code for the generated images (which are pretty large) is at the end.

Aesthetics are the primary reason for using a word cloud, though one can pretty quickly recognize what words were more important on well crafted ones. An interactive bubble chart is a tad better as it lets you explore the corpus elements that contained the terms (a feature I have not added yet).

I would posit that a simple bar chart can be of similar use if one is trying to get a feel for overall word use across a corpus:

freq-bars
(click for larger version)

It’s definitely not as sexy as a word cloud, but it may be a better visualization choice if you’re trying to do analysis vs just make a pretty picture.

If you are trying to analyze a corpus, you might want to see which elements influenced the term frequencies the most, primarily to see if there were any outliers (i.e. strong influencers). With that in mind, I took @bfist’s corpus and generated a heat map from the top terms/keywords:

risk-hm
(click for larger version)

There are some stronger influencers, but there is a pattern of general, regular usage of the terms across each corpus component. This is to be expected for this particular set as each post is going to be talking about the same types of security threats, vulnerabilities & issues.

The R code below is fully annotated, but it’s important to highlight a few items in it and on the analysis as a whole:

  • The extra, corpus-specific stopword list : “week”, “report”, “security”, “weeks”, “tuesday”, “update”, “team” : was designed after manually inspecting the initial frequency breakdowns and inserting my opinion at the efficacy (or lack thereof) of including those terms. I’m sure another iteration would add more (like “released” and “reported”). Your expert view needs to shape the analysis and—in most cases—that analysis is far from a static/one-off exercise.
  • Another area of opine was the choice of 0.7 in the removeSparseTerms(tdm, sparse=0.7) call. I started at 0.5 and worked up through 0.8, inspecting the results at each iteration. Playing around with that number and re-generating the heatmap might be an interesting exercise to perform (hint).
  • Same as the above for the choice of 10 in subset(tf, tf>=10). Tweak the value and re-do the bar chart vis!
  • After the initial “ooh! ahh!” from a word cloud or even the above bar chart (though, bar charts tend to not evoke emotional reactions) is to ask yourself “so what?”. There’s nothing inherently wrong with generating a visualization just to make one, but it’s way cooler to actually have a reason or a question in mind. One possible answer to a “so what?” for the bar chart is to take the high frequency terms and do a bigram/digraph breakdown on them and even do a larger cross-term frequency association analysis (both of which we’ll do in another post)
  • The heat map would be far more useful as a D3 visualization where you could select a tile and view the corpus elements with the term highlighted or even select a term on the Y axis and view an extract from all the corpus elements that make it up. That might make it to the TODO list, but no promises.

I deliberately tried to make this as simple as possible for those new to R to show how straightforward and brief text corpus analysis can be (there’s less than 20 lines of code excluding library imports, whitespace, comments and the unnecessary expansion of some of the tm function calls that could have been combined into one). Furthermore, this is really just a basic demonstration of tm package functionality. The post/code is also aimed pretty squarely at the information security crowd as we tend to not like examples that aren’t in our domain. Hopefully it makes a good starting point for folks and, as always, questions/comments are heartily encouraged.

# need this NOAWT setting if you're running it on Mac OS; doesn't hurt on others
Sys.setenv(NOAWT=TRUE)
library(ggplot2)
library(ggthemes)
library(tm)
library(Snowball) 
library(RWeka) 
library(reshape)
 
# input the raw corpus raw text
# you could read directly from @bfist's source : http://l.rud.is/10tUR65
a = readLines("intext.txt")
 
# convert raw text into a Corpus object
# each line will be a different "document"
c = Corpus(VectorSource(a))
 
# clean up the corpus (function calls are obvious)
c = tm_map(c, tolower)
c = tm_map(c, removePunctuation)
c = tm_map(c, removeNumbers)
 
# remove common stopwords
c = tm_map(c, removeWords, stopwords())
 
# remove custom stopwords (I made this list after inspecting the corpus)
c = tm_map(c, removeWords, c("week","report","security","weeks","tuesday","update","team"))
 
# perform basic stemming : background: http://l.rud.is/YiKB9G
# save original corpus
c_orig = c
 
# do the actual stemming
c = tm_map(c, stemDocument)
c = tm_map(c, stemCompletion, dictionary=c_orig)
 
# create term document matrix : http://l.rud.is/10tTbcK : from corpus
tdm = TermDocumentMatrix(c, control = list(minWordLength = 1))
 
# remove the sparse terms (requires trial->inspection cycle to get sparse value "right")
tdm.s = removeSparseTerms(tdm, sparse=0.7)
 
# we'll need the TDM as a matrix
m = as.matrix(tdm.s)
 
# datavis time
 
# convert matri to data frame
m.df = data.frame(m)
 
# quick hack to make keywords - which got stuck in row.names - into a variable
m.df$keywords = rownames(m.df)
 
# "melt" the data frame ; ?melt at R console for info
m.df.melted = melt(m.df)
 
# not necessary, but I like decent column names
colnames(m.df.melted) = c("Keyword","Post","Freq")
 
# generate the heatmap
hm = ggplot(m.df.melted, aes(x=Post, y=Keyword)) + 
  geom_tile(aes(fill=Freq), colour="white") + 
  scale_fill_gradient(low="black", high="darkorange") + 
  labs(title="Major Keyword Use Across VZ RISK INTSUM 202 Corpus") + 
  theme_few() +
  theme(axis.text.x  = element_text(size=6))
ggsave(plot=hm,filename="risk-hm.png",width=11,height=8.5)
 
# not done yet
 
# better? way to view frequencies
# sum rows of the tdm to get term freq count
tf = rowSums(as.matrix(tdm))
# we don't want all the words, so choose ones with 10+ freq
tf.10 = subset(tf, tf>=10)
 
# wimping out and using qplot so I don't have to make another data frame
bf = qplot(names(tf.10), tf.10, geom="bar") + 
  coord_flip() + 
  labs(title="VZ RISK INTSUM Keyword Frequencies", x="Keyword",y="Frequency") + 
  theme_few()
ggsave(plot=bf,filename="freq-bars.png",width=8.5,height=11)

Follow up/Resources :: GRC-T18 – Data Analysis and Visualization for Security Professionals #RSAC

Many thanks to all who attended the talk @jayjacobs & I gave at RSA on Tuesday, February 26, 2013. It was really great to be able to talk to so many of you afterwards as well.

We’ve enumerated quite a bit of non-slide-but-in-presentation information that we wanted to aggregate into a blog post so you can viz along at home. If you need more of a guided path, I strongly encourage you to take a look at some of the free courses over at Coursera.

For starters, here’s a bit.ly bundle of data analysis & visualization bookmarks that @dseverski & I maintain. We’ve been doing (IMO) a pretty good job adding new resources as they come up and may have some duplicates to the ones below.

People Mentioned

Tools Mentioned

  • R : Jay & I probably use this a bit too much as a hammer (i.e. treat ever data project as a nail) but it’s just far too flexible and powerful to not use as a go-to resource
  • RStudio : An amazing IDE for R. I, personally, usually despise IDEs (yes, I even dislike Xcode), but RStudio truly improves workflow by several orders of magnitude. There are both desktop and server versions of it; the latter gives you the ability to setup a multi-user environment and use the IDE from practically anywhere you are. RStudio also makes generating reproducible research a joy with built-in easy access to tools like kintr.
  • iPython : This version of Python takes an already amazing language and kicks it up a few notches. It brings it up to the level of R+RStudio, especially with it’s knitr-like iPython Notebooks for–again–reproducible research.
  • Mondrian : This tool needs far more visibility. It enables extremely quick visualization of even very large data sets. The interface takes a bit of getting used to, but it’s faster then typing R commands or fumbling in Excel.
  • Tableau : This tool may be one of the most accessible, fast & flexible ways to explore data sets to get an idea of where you need to/can do further analysis.
  • Processing : A tool that was designed from the ground up to help journalists create powerful, interactive data visualizations that you can slipstream directly onto the web via the Processing.js library.
  • D3 : The foundation of modern, data-driven visualization on the web.
  • Gephi : A very powerful tool when you need to explore networks & create beautiful, publication-worthy visualizations.
  • MongoDB : NoSQL database that’s highly & easily scaleable without a steep learning curve.
  • CRUSH Tools by Google : Kicks up your command-line data munging.
Performance Optimization WordPress Plugins by W3 EDGE