Skip navigation

Category Archives: DataVis

Earlier this evening, I somewhat half-heartedly challenged @jayjacobs that he & I should be generating one data visualization per day. I didn’t specify anything else (well, at least that I can disclose publicly, for now) but I think I’m going to try to formalize a bit of the ‘rules’ before I get some shut-eye:

– The datavis _must_ be posted to either one of our blogs (i.e it and the data behind it must be shareable). Alternative: we setup a blog just for this.
– The data behind the datavis _must_ also be public data and either referenced or published with the datavis.
– The datavis _must_ answer a question. No random generation of numbers for a lazy bar chart, etc. Said question must be posed with the datavis and (hopefully) a bit of a short story/explanation with it and the datavis in the blog post.
– The datavis cannot be a blatant repeat of a previous datavis.
– The datavis does not have to break new ground (i.e. bar charts are #spiffy).
– The datavis _must_ be open for comments.
– There are no restrictions on what tools/languages can be used (i.e. Jay can cheat and make Tableau robocharts).
– There are no restrictions on the type of data being analyzed & visualized. Ideally, it will be from infosec or IT, but restricting it to those areas might make the challenge more difficult (the ‘public’ bit).

I’ll sleep on that and, perhaps, reduce the requirement to one per week after talking to Jay again this week.

Your thoughts & input on this challenge are most welcome in the comments, especially if you want to suggest things we can visualize. Also, feel free to volunteer to join us in this, once we start it.

Now that I’m back in the US and relaxing, I can take time for one final blather on the [PC Maker Slopegraph](http://rud.is/b/2013/04/11/ugly-tables-vs-slopegraphs-pc-maker-shipments-marketshare/) post from earlier in the week.

Slopegraphs can be quite long depending on the increment between discrete entries (as I’ve [pointed out before](http://rud.is/b/2012/06/07/slopegraphs-in-python-exploring-binningrounding/)). You either need to do binning/rounding, change the scale or add some annotations to the chart to make up for the length. Binning/rounding seems to make the most sense since you can add a table for precision but give the reader a good view of what you’re trying to communicate in as compact a fashion as possible.

I’ll, again, ask the reader, what tells you which PC maker is on top: this table:

Screen-Shot-2013-04-10-at-6.14.56-PM

or these slopegraphs:

PC Maker Shipments (in thousands, rounded to nearest thousand)
pcs

PC Maker Market Share (rounded to nearest %)
pcs-share

Labeled properly, the rounding makes for a much more compact chart and doesn’t detract from the message, especially when I also include a much prettier, quick precision reference via Google Fusion Tables:

(though the column sort feature seems a bit wonky for some reason…).

Given that the focus was on the top individual maker, the “Other” category is just noise, so excluding it is also not an issue. If we wanted to tell the story of how well individual makers are performing against that bucket of contenders or point-players, then we would include that data and use other visualizations to communicate whatever conclusions we want to lead the reader to.

Remember, data tables and visualizations should be there to help tell your story, not detract from it or require real work/effort to grok (unless you’re breaking new visualization ground, which is most definitely not happening in the Ars story).

While not perfect, I noticed that it was possible to make a pretty decent slopegraph over at [Datawrapper](http://datawrapper.de/) as I was poking at some new features they announced recently. As an example, I ran one of the charts from my [most recent](http://rud.is/b/2013/04/11/ugly-tables-vs-slopegraphs-pc-maker-shipments-marketshare/) blog post as an example.

If they had an option to do away with the gray horizontal lines, it wouldn’t be a bad slopegraph at all. I’m not sure how it’d handle overlaps, but if you have some basic data and don’t feel like messing with my Python or R code (and don’t want to do any machinations in Excel), Datawrapper might not be a bad choice.

Andrew Cunningham (@IT_AndrewC) posted an article—If you make PCs and you’re not Lenovo, you might be in trouble—on the always #spiffy @arstechnica that had this horrid table in it:

Screen-Shot-2013-04-10-at-6.14.56-PM

That table was not made by Andrew (so, don’t blame him) but Ars graphics folk *could* have made the post a bit more readable.

I’m not going to bother making a prettier table (it’s possible, but table formatting is not the point of this post), but I am going to show two slopegraphs that communicate the point of the post (that Lenovo is sitting pretty) much better:

PC Maker Market Share
pcs-share

PC Maker Shipments (in thousands)

pcs

They’re a little long (a problem I’ve noted with slopegraphs previously) but I think they are much better at conveying message intended by the story. I may try to tweak them a bit or even finish the D3 port of my slopegraph library when I’m back from Bahrain.

The basic technique of cybercrime statistics—measuring the incidence of a given phenomenon (DDoS, trojan, APT) as a percentage of overall population size—had entered the mainstream of cybersecurity thought only in the previous decade. Cybersecurity as a science was still in its infancy, as many of its basic principles had yet to be established.

At the same time, the scientific method rarely intersected with the development and testing of new detection & prevention regimens. When you read through that endless stream of quack cybercures published daily on the Internet and at conferences like RSA, what strikes you most is not that they are all, almost without exception, based on anecdotal or woefully inadequately small evidence. What’s striking is that they never apologize for the shortcoming. They never pause to say, “Of course, this is all based on anecdotal evidence, but hear me out.” There’s no shame in these claims, no awareness of the imperfection of the methods, precisely because it seems to eminently reasonable that the local observation of a handful of minuscule cases might serve the silver bullet for cybercrime, if you look hard enough.


But, cybercrime couldn’t be studied in isolation. It was as much a product of the internet expansion as news and social media, where it was so uselessly anatomized. To understand the beast, you needed to think on the scale of the enterprise, from the hacker’s-eye view. You needed to look at the problem from the perspective of Henry Mayhew’s balloon. And you needed a way to persuade others to join you there.

Sadly, that’s not a modern story. It’s an adapted quote from chapter 4 (pp. 97-98, paperback) of The Ghost Map, by Steven Johnson, a book on the cholera epidemic of 1854.

I won’t ruin the book nor continue my attempt at analogy any further. Suffice it to say, you should read the book—if you haven’t already—and join me in calling out for the need for the John Snow of our cyber-time to arrive.

Far too many interesting bits to spam on Twitter individually but each is worth getting the word out on:

tumblr_m0wabdueuR1qd78lno1_1280

– It’s [π Day](https://www.google.com/search?q=pi+day)*
– Unless you’re living in a hole, you probably know that [Google Reader is on a death march](http://www.bbc.co.uk/news/technology-21785378). I’m really liking self-hosting [Tiny Tiny RSS](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CDEQFjAA&url=https%3A%2F%2Fgithub.com%2Fgothfox%2FTiny-Tiny-RSS&ei=YtlBUfOLJvLe4AOHtoDIAQ&usg=AFQjCNGwtEr8slx-i0vNzhQi4b4evRVXFA&bvm=bv.43287494,d.dmg) so far, and will follow up with a standalone blog post on it.
– @jayjacobs & I are speaking soon at both [SOURCE Boston](http://www.sourceconference.com/boston/) & [Secure360](http://secure360.org/). We hope to have time outside the talk to discuss security visualization & analysis with you, so go on and register!
– Speaking of datavis, [VizSec 2013](http://www.vizsec.org/) is on October 14th in Atlanta, GA this year and the aligned [VAST 2013 Challenge](http://vacommunity.org/VAST+Challenge+2013) (via @ieeevisweek) will have another infosec-themed mini-challenge. Watch that space, grab the data and start analyzing & visualizing!
– If you’re in or around Boston, it’d be great to meet you at @openvisconf on May 16-17.
– Majestic SEO has released a new data set for folks to play with: [a list of the top 100,000 websites broken down by the country in which they are hosted](http://blog.majesticseo.com/research/top-websites-by-country/). It requires a free reg to get the link, but no spam so far. (h/t @teamcymru)
– And, finally, @MarketplaceAPM aired a [good, accessible story](http://www.marketplace.org/topics/tech/who-pays-bill-cyber-war) on “Who pays the bill for a cyber war?”. I always encourage infosec industry folk to listen to shows like Marketplace to see how we sound to non-security folk and to glean ways we can communicate better with the people we are ultimately trying to help.

*Image via davincismurf

This is a fourth post in my [Visualizing Risky Words](http://rud.is/b/2013/03/06/visualizing-risky-words/) series. You’ll need to read starting from that link for context if you’re just jumping in now.

I was going to create a rudimentary version of an interactive word tree for this, but the extremely talented @jasondavies (I marvel especially at his cartographic work) just posted what is probably the best online [word tree generator](https://www.jasondavies.com/wordtree/) ever made…and in D3 no less.

Word_Tree

A word tree is a “visual interactive concordance” and was created back in 2007 by Martin Wattenberg and Fernanda Viégas. You can [read more about](http://hint.fm/projects/wordtree/) this technique on your own, but a good summary (from their site) is:

A word tree is a visual search tool for unstructured text, such as a book, article, speech or poem. It lets you pick a word or phrase and shows you all the different contexts in which it appears. The contexts are arranged in a tree-like branching structure to reveal recurrent themes and phrases.

I pasted the VZ RISK INTSUM texts into Jason’s tool so you could investigate the corpus to your heart’s content. I would suggest exploring “patch”, “vulnerability”, “adobe”, “breach” & “malware” (for starters).

Jason’s implementation is nothing short of beautiful. He uses SVG text tspans to make the individual text elements not just selectable but easily scaleable with browser window resize events.

Screenshot_3_12_13_1_36_PM

The actual [word tree D3 javascript code](http://www.jasondavies.com/wordtree/wordtree.js?20130312.1) shows just how powerful the combination of the language and @mbostock’s library is. He has, in essence, built a completely cross-platform tokenizer and interactive visualization tool in ~340 lines of javascript. Working your way through that code through to understanding will really help improve your D3 skills.

The DST changeover in the US has made today a fairly strange one, especially when combined with a very busy non-computing day yesterday. That strangeness manifest as a need to take the D3 heatmap idea mentioned in the [previous post](http://rud.is/b/2013/03/09/visualizing-risky-words-part-2/) and actually (mostly) implement it. Folks just coming to this thread may want to start with the [first post](http://rud.is/b/2013/03/06/visualizing-risky-words/) in the series.

I did a quick extraction of the R TermDocumentMatrix with nested for loops and then extracted the original texts of the corpus and put them into some javascript variables along with some D3 code to show how to do a [rudimentary interactive heatmap](http://rud.is/d3/vzwordshm/).

(click for larger)

(click for live demo)

As you mouse over each of the tiles they will be highlighted and the word/document will be displayed along with the frequency count. Click on the tile and the text will appear with the highlighted word.

Some caveats:

– The heatmap looks a bit different from the R one in part 2 as the terms/keywords are in alphabetical order.
– There are no column or row headers. I won’t claim anything but laziness, though the result does look a bit cleaner this way.
– I didn’t bother extracting the stemming results (it’s kind of a pain), so not all of the keywords will be highlighted when you select a tile.
– It’s one, big HTML document complete with javascript & data. That’s not a recommended practice in production code but it will make it easier for folks to play around with.
– The HTML is not commented well, but there’s really not much of it. The code that makes the heatmap is pretty straightforward if you even have a rough familiarity with D3. Basically, enumerate the data, generating a colored tile for each row/col entry and then mapping click & mouseover/out events to each generated SVG element.
– I also use jQuery in the code, but that’s not a real dependency. I just like using jQuery selectors for non-D3 graphics work. It bulks up the document, so use them together wisely. NOTE: If you don’t want to use jQuery, you’ll need to change the selector code.

Drop a note in the comments if you do use the base example and improve on it or if you have any questions on the code. For those interested, I’ll be putting all the code from the “Visualizing Risky Words” posts into a github repository at the end of the series.