Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

We infosec folk eat up industry reports and most of us have no doubt already gobbled up @panda_security’s recently released [Q1 2013 Report](http://press.pandasecurity.com/wp-content/uploads/2010/05/PandaLabs-Quaterly-Report.pdf) [PDF]. It’s a good read (so go ahead and read it, we’ll still be here!) and I was really happy to see a nicely stylized chart in the early pages:

Screenshot_5_24_13_8_14_AM

However, I quickly became a #sadpanda when I happened across some explosive 3D pie charts later on. Rather than deride, I thought a re-imagining would be a better use of time and let you decide which visualizations both communicate better and are more appealing.

I chose to use @Datawrapper to showcase how easy it is to build and publish pleasing and informative visualizations without even leaving your browser.

Figure 4, Original:

Panda Labs Q1 2013 Report Fig 5 (Orig)

Figure 4, Alternative:

Figure 5, Original

Fig 4: New malware strains In Q1 2013, by Type (orig)

Figure 5, Alternative (horizontal vs vertical, just to mix it up a bit):

If the charts had been closer together in the report, I would have opted for vertical design for both and probably kept malware-type ordering vs sort by highest percentage.

How would you re-imagine the pie charts? Post a link to your creations in the comments and I’ll make sure they show up embedded with the post.

Many thanks to all who attended the talk @jayjacobs & I gave at @Secure360 on Wednesday, May 15, 2013. As promised, here are the [slides](https://dl.dropboxusercontent.com/u/43553/Secure360-2013.pdf).

We’ve enumerated quite a bit of non-slide-but-in-presentation information that we wanted to aggregate into a blog post so you can vi[sz] along at home. If you need more of a guided path, I strongly encourage you to take a look at some of the free courses over at [Coursera](https://www.coursera.org/).

For starters, here’s a bit.ly bundle of data analysis & visualization bookmarks that @dseverski & I maintain. We’ve been doing (IMO) a pretty good job adding new resources as they come up and may have some duplicates to the ones below.

People Mentioned

– [Stephen Few’s Perceptual Edge blog](http://www.perceptualedge.com/) : Start from the beginning to learn from a giant in information visualization
– [Andy Kirk’s Visualising Data blog](http://www.visualisingdata.com/) (@visualisingdata) : Perhaps the quintessential leader in the modern visualization movement.
– [Mike Bostock’s blog](http://bost.ocks.org/mike/) (@mbostock) : Creator of D3 and producer of amazing, interactive graphics for the @NYTimes
– [Edward Tufte’s blog](http://www.edwardtufte.com/tufte/) : The father of what we would now identify as our core visualization principles & practices.
– [Nathan Yau’s Flowing Data blog](http://flowingdata.com/) : Making visualization accessible, practical and repeatable.
– [Data Stories Podcast](http://datastori.es/) : Yes, you can learn much about data visualization from an audio podacst (@datastories)
– [storytelling with data](http://www.storytellingwithdata.com/) (@storywithdata) : Extremely practical blog by Cole Nussbaumer that will especially help folks “stuck” in Excel
– [Jay’s blog](http://beechplane.wordpress.com/)
– [My {this} blog](http://rud.is/b)

Tools Mentioned

– [R](http://www.r-project.org/) : Jay & I probably use this a bit too much as a hammer (i.e. treat every data project as a nail) but it’s just far too flexible and powerful to not use as a go-to resource
– [RStudio](http://www.rstudio.com/) : An *amazing* IDE for R. I, personally, usually despise IDEs (yes, I even dislike Xcode), but RStudio truly improves workflow by several orders of magnitude. There are both desktop and server versions of it; the latter gives you the ability to setup a multi-user environment and use the IDE from practically anywhere you are. RStudio also makes generating [reproducible research](http://cran.r-project.org/web/views/ReproducibleResearch.html) a joy with built-in easy access to tools like [kintr](http://yihui.name/knitr/).
– [iPython](http://ipython.org/) : This version of Python takes an already amazing language and kicks it up a few notches. It brings it up to the level of R+RStudio, especially with it’s knitr-like [iPython Notebooks](http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html) for–again–reproducible research.
– [SecViz](http://secviz.org/) : Security-centric Visualization Site & Tools by @raffaelmarty
– [Mondrian](http://www.theusrus.de/Mondrian/) : This tool needs far more visibility. It enables extremely quick visualization of even very large data sets. The interface takes a bit of getting used to, but it’s faster then typing R commands or fumbling in Excel.
– [Tableau](http://www.tableausoftware.com/) : This tool may be one of the most accessible, fast & flexible ways to explore data sets to get an idea of where you need to/can do further analysis.
– [Processing](http://processing.org/) : A tool that was designed from the ground up to help journalists create powerful, interactive data visualizations that you can slipstream directly onto the web via the [Processing.js](http://processingjs.org/) library.
– [D3](http://d3js.org/) : The foundation of modern, data-driven visualization on the web.
– [Gephi](https://gephi.org/) : A very powerful tool when you need to explore networks & create beautiful, publication-worthy visualizations.
– [MongoDB](http://www.mongodb.org/) : NoSQL database that’s highly & easily scaleable without a steep learning curve.
– [CRUSH Tools by Google](https://code.google.com/p/crush-tools/) : Kicks up your command-line data munging.

@adammontville [posited](http://www.tripwire.com/state-of-security/it-security-data-protection/quick-thoughts-on-verizons-dbir-and-20-critical-security-control-mappings/) that Figure 15 from this year’s [DBIR](http://www.verizonenterprise.com/DBIR/2013/) could use some slopegraph love. As I am not one to back down from a reasonable challenge, I obliged.

Here’s the original chart (produced by @jayjacobs):

figure15-orig

and, here’s a _very_ _quick_ slopegraph version of it:

figure15-slope

You can click on both/either for a larger version. If I had more time, I could have made the slopegraph version nicer, but it conveys a story fairly well the way it is, especially with the highlight on the two biggest changes between 2008 & 2012.

Two problems with the modified visualization are (a) multi-column slopegraphs blend into a [parallel coordinate](http://www.juiceanalytics.com/writing/parallel-coordinates/) or plain old line graph pretty quickly (thus, reducing their slopegraph-y goodness); and, (b) the diversity of the year-over-year DBIR data set makes the comparison between years almost pointless (as the DBIR itself points out).

I also generated a proper/traditional slopegraph, comparing 2008 to 2012:

figure15-true-slope

The visualization is far more compact and, if the goal was to show the change between 2008 and 2012, it provides a much clearer view of what has and has not changed.

wwdpm.001For those that wanted to play along at home, I’ve cleaned up the text and made the Wait Wait…Don’t Pwn Me! closing segment of SOURCE Boston 2013 available for download [PDF]. The video crew had cameras running, so keep checking the @SOURCEconf web site as it’ll probably get posted as they crank through all of the conference session videos (give them time, tho, as there are a ton of vids to process).

I also wanted to, again, thank @selenakyle for her most excellent job playing Carl Kasell; the awesome panelists: @451Wendy, @innismir & @andrewsmhay; @joshcorman for—yet again—putting up with me picking on him (and getting all the questions right); and our volunteers: @ra6bit, @Gmanfunky (and three more who I need Twitter handles from :-).

I only hope that @petersagal & the WWDTM crew can forgive me if they ever read the transcript or views the video of the segment.

Many thanks to all who attended the talk @jayjacobs & I gave at @SOURCEconf on Thursday, April 18, 2013. As promised, here are the [slides](https://dl.dropboxusercontent.com/u/43553/SOURCE-Boston-2013.pdf) which should be much less washed out than the projector version :-)

We’ve enumerated quite a bit of non-slide-but-in-presentation information that we wanted to aggregate into a blog post so you can viz along at home. If you need more of a guided path, I strongly encourage you to take a look at some of the free courses over at [Coursera](https://www.coursera.org/).

For starters, here’s a bit.ly bundle of data analysis & visualization bookmarks that @dseverski & I maintain. We’ve been doing (IMO) a pretty good job adding new resources as they come up and may have some duplicates to the ones below.

People Mentioned

– [Stephen Few’s Perceptual Edge blog](http://www.perceptualedge.com/) : Start from the beginning to learn from a giant in information visualization
– [Andy Kirk’s Visualising Data blog](http://www.visualisingdata.com/) (@visualisingdata) : Perhaps the quintessential leader in the modern visualization movement.
– [Mike Bostock’s blog](http://bost.ocks.org/mike/) (@mbostock) : Creator of D3 and producer of amazing, interactive graphics for the @NYTimes
– [Edward Tufte’s blog](http://www.edwardtufte.com/tufte/) : The father of what we would now identify as our core visualization principles & practices.
– [Nathan Yau’s Flowing Data blog](http://flowingdata.com/) : Making visualization accessible, practical and repeatable.
– [Jay’s blog](http://beechplane.wordpress.com/)
– [My {this} blog](http://rud.is/b)

Tools Mentioned

– [R](http://www.r-project.org/) : Jay & I probably use this a bit too much as a hammer (i.e. treat ever data project as a nail) but it’s just far too flexible and powerful to not use as a go-to resource
– [RStudio](http://www.rstudio.com/) : An *amazing* IDE for R. I, personally, usually despise IDEs (yes, I even dislike Xcode), but RStudio truly improves workflow by several orders of magnitude. There are both desktop and server versions of it; the latter gives you the ability to setup a multi-user environment and use the IDE from practically anywhere you are. RStudio also makes generating [reproducible research](http://cran.r-project.org/web/views/ReproducibleResearch.html) a joy with built-in easy access to tools like [kintr](http://yihui.name/knitr/).
– [iPython](http://ipython.org/) : This version of Python takes an already amazing language and kicks it up a few notches. It brings it up to the level of R+RStudio, especially with it’s knitr-like [iPython Notebooks](http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html) for–again–reproducible research.
– [SecViz](http://secviz.org/) : Security-centric Visualization Site & Tools by @raffaelmarty
– [Mondrian](http://www.theusrus.de/Mondrian/) : This tool needs far more visibility. It enables extremely quick visualization of even very large data sets. The interface takes a bit of getting used to, but it’s faster then typing R commands or fumbling in Excel.
– [Tableau](http://www.tableausoftware.com/) : This tool may be one of the most accessible, fast & flexible ways to explore data sets to get an idea of where you need to/can do further analysis.
– [Processing](http://processing.org/) : A tool that was designed from the ground up to help journalists create powerful, interactive data visualizations that you can slipstream directly onto the web via the [Processing.js](http://processingjs.org/) library.
– [D3](http://d3js.org/) : The foundation of modern, data-driven visualization on the web.
– [Gephi](https://gephi.org/) : A very powerful tool when you need to explore networks & create beautiful, publication-worthy visualizations.
– [MongoDB](http://www.mongodb.org/) : NoSQL database that’s highly & easily scaleable without a steep learning curve.
– [CRUSH Tools by Google](https://code.google.com/p/crush-tools/) : Kicks up your command-line data munging.

Earlier this evening, I somewhat half-heartedly challenged @jayjacobs that he & I should be generating one data visualization per day. I didn’t specify anything else (well, at least that I can disclose publicly, for now) but I think I’m going to try to formalize a bit of the ‘rules’ before I get some shut-eye:

– The datavis _must_ be posted to either one of our blogs (i.e it and the data behind it must be shareable). Alternative: we setup a blog just for this.
– The data behind the datavis _must_ also be public data and either referenced or published with the datavis.
– The datavis _must_ answer a question. No random generation of numbers for a lazy bar chart, etc. Said question must be posed with the datavis and (hopefully) a bit of a short story/explanation with it and the datavis in the blog post.
– The datavis cannot be a blatant repeat of a previous datavis.
– The datavis does not have to break new ground (i.e. bar charts are #spiffy).
– The datavis _must_ be open for comments.
– There are no restrictions on what tools/languages can be used (i.e. Jay can cheat and make Tableau robocharts).
– There are no restrictions on the type of data being analyzed & visualized. Ideally, it will be from infosec or IT, but restricting it to those areas might make the challenge more difficult (the ‘public’ bit).

I’ll sleep on that and, perhaps, reduce the requirement to one per week after talking to Jay again this week.

Your thoughts & input on this challenge are most welcome in the comments, especially if you want to suggest things we can visualize. Also, feel free to volunteer to join us in this, once we start it.

Now that I’m back in the US and relaxing, I can take time for one final blather on the [PC Maker Slopegraph](http://rud.is/b/2013/04/11/ugly-tables-vs-slopegraphs-pc-maker-shipments-marketshare/) post from earlier in the week.

Slopegraphs can be quite long depending on the increment between discrete entries (as I’ve [pointed out before](http://rud.is/b/2012/06/07/slopegraphs-in-python-exploring-binningrounding/)). You either need to do binning/rounding, change the scale or add some annotations to the chart to make up for the length. Binning/rounding seems to make the most sense since you can add a table for precision but give the reader a good view of what you’re trying to communicate in as compact a fashion as possible.

I’ll, again, ask the reader, what tells you which PC maker is on top: this table:

Screen-Shot-2013-04-10-at-6.14.56-PM

or these slopegraphs:

PC Maker Shipments (in thousands, rounded to nearest thousand)
pcs

PC Maker Market Share (rounded to nearest %)
pcs-share

Labeled properly, the rounding makes for a much more compact chart and doesn’t detract from the message, especially when I also include a much prettier, quick precision reference via Google Fusion Tables:

(though the column sort feature seems a bit wonky for some reason…).

Given that the focus was on the top individual maker, the “Other” category is just noise, so excluding it is also not an issue. If we wanted to tell the story of how well individual makers are performing against that bucket of contenders or point-players, then we would include that data and use other visualizations to communicate whatever conclusions we want to lead the reader to.

Remember, data tables and visualizations should be there to help tell your story, not detract from it or require real work/effort to grok (unless you’re breaking new visualization ground, which is most definitely not happening in the Ars story).

While not perfect, I noticed that it was possible to make a pretty decent slopegraph over at [Datawrapper](http://datawrapper.de/) as I was poking at some new features they announced recently. As an example, I ran one of the charts from my [most recent](http://rud.is/b/2013/04/11/ugly-tables-vs-slopegraphs-pc-maker-shipments-marketshare/) blog post as an example.

If they had an option to do away with the gray horizontal lines, it wouldn’t be a bad slopegraph at all. I’m not sure how it’d handle overlaps, but if you have some basic data and don’t feel like messing with my Python or R code (and don’t want to do any machinations in Excel), Datawrapper might not be a bad choice.