Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

[@hrbrmstr starts working in javascript again]
The Internets: What do you think?
@hrbrmstr: It’s vile.
The Internets: I know. It’s so bubbly and cloying and happy.
@hrbrmstr: Just like the Federation.
The Internets: And you know what’s really frightening? If you develop with it enough, you begin to like it.
@hrbrmstr: It’s insidious.
The Internets: Just like the Federation.

(With apologies to ST:DS9)

UPDATE: It seems my use of <script async> optimization for Raphaël busted the inline slopegraph generation. Will work on tweaking the example posts to wait for Raphaël to load when I get some time.

So, I had to alter this to start after a user interaction. It loaded fine as a static, local page but seems to get a bit wonky embedded in a complex page. I also see some artifacts in Chrome but not in Safari. Still, not a bad foray into basic animation.

Animate Slopegraph


There were enough eye-catching glitches in the experimental javascript support and the ugly large-number display in the spam example post that I felt compelled to make a couple formatting tweaks in the code. I also didn’t have time to do “real” work on the codebase this weekend.

So, along with spacing adjustments, there’s now an “add_commas” non-mandatory option that will toss commas in large numbers so they’re easy to read. Here’s an example of the new output (both the Raphaël display and commas):


As usual, it’s up on github

Not much progress over the weekend on my latest obsession (been busy enjoying some non-rainy days here in Maine). So, here are some other slopegraph implementations/resources I’ve found through mining the internets:

In preparation for the upcoming 1.0 release and with the hopes of laying a foundation for more interactive slopegraphs, I threw together some rudimentary output support over lunch today for Raphaël, which means that all you have to do is generate a new slopegraph with the “js” output type and include the salient portions of the generated html/css/javascript into a web page (along with including the Raphaël script code).

The next github push will have this update. Here’s an example of the output, using the classic Tufte example chart:


It’s definitely a bit rough around the edges (my eyes immediately fixate upon spacing discrepancies) and lacking any interactivity, but the basic building blocks are in place. It also does not render on my Android phone (HTC Incredible 2) but it does render in Chrome, Safari & on my iPad. Embedding a Raphaël graphic in a web page will definitely have advantages over a PNG or PDF in most situations even if it’s not interactive, so I’ll probably keep the support in regardless of whether I continue to improve upon it.

As I was playing with the code, I kept thinking how neat it would be if there was a Raphaël Cairosurface” option. Perhaps that will be a side project if all goes well, since it would not be that much more complicated (in fact, it may be less complicated) than the Cairo SVG surface code.

Given the focus on actual development of the PySlopegraph tool in most of the blog posts of late, folks may be wondering why an infosec/inforisk guy is obsessing so much on a tool and not talking security. Besides the fixation on filling a void and promoting an underused visualization tool, I do believe there is a place for slopegraphs in infosec data analysis and will utilize some data from McAfee’s recent Q1 2012 Threat Report [PDF] to illustrate how one might use slopegraphs in interpreting the “Spam Volume” data presented in the “Messaging Threats” section (pages 11 & 12 of the report).

The report shows individual graphs of spam volume per country from April of 2011 through March of 2012. Each individual graph conveys useful information, but I put together two slopegraphs that each show alternate and aggregate views which let you compare spam volume data relative to each country (versus just in-country).

When first doing this exploration, the scale problem reared it’s ugly head again since the United States is a huge spam outlier and causes the chart to be as tall as my youngest son when printed. I really wanted to show relative spam volume between countries as well as the increase or decrease between years in one chart and — after chatting with @maximumyin a bit — decided to test out using a log scale option for the charting (click for larger image):

This chart — Spam Volume by Country — instantly shows that:

  • overall volume has declined for most countries
  • two countries have remained steady
  • one country (Germany) has increased

The next chart – Spam Volume Percentage by Country — also needed to be presented on a log scale and has some equally compelling information:

Despite holding steady count-wise, the United States percentage of global spam actually increased and is joined by seven other countries, with Germany having the second largest percentage increase. Both charts present an opportunity to further explore why the values changed (since the best metrics are supposed to both inform and be actionable in some way).

I’m going to extract some more data from the McAfee report and some other security reports to show how slopegraphs can be used to interpret the data. Feedback on both the views and the use of the log scale would be greatly appreciated by general data scientists as well as those in the infosec community.

One of the last items for the 1.0 release is support for multiple columns of data. That will require some additional refactoring, so I’ve been procrastinating by exploring the recent “fudging” discovery. Despite claims to the contrary on other sites, there are more folks playing with slopegraphs than you might imagine. The inspiration for today’s installment comes from Jon Custer (@stuffisthings). He has a two partTelling Stories with Data” series that does some exploration of export data with slopegraphs. In his “Slopegraph Strikes Back” post, Jon does a spiffy job discussing data visualization fundamentals and walks the reader through his re-design of a chart on commodities ranking, including a commentary on an aspect of slopegraphs that I’ve been noticing as I’ve been doing my exploring: the ‘scale’ problem (which I began to point out in the aforementioned “fudging” post).

The data set Jon is working with allows for a great exploration as to what works best when trying to convey a message with slopegraphs. I took the values from one of the tables he extracted:

and made a “raw” slopegraph from them (focusing on the “top 10”). The graphic won’t even come close to fitting in this post but you can grab the PDF of it and see how scale is the primary enemy of slopegraphs. It does show how gold and precious metal ores have skyrocketed from 1998 to 2007, but it’s hardly an engaging and easy to read visualization (unless you really like using your scroll wheel).

Jon grok’d this point, too, and decided to focus on the power law ranking and use the slopegraph to present the rate of change of each commodity:

While he didn’t “pull a Tufte” and just include values without caveat (see left & right 90° side labels), I still believe that there needs to be either increased annotation or the inclusion of base tabular data. Using my PySlopegraph code (forgot to mention the name change), I worked up a version of Jon’s visualization that I believe provides a clean, honest view of the data (click for larger view):

Because the chart is still based on the percentages that are fairly precise:

  1. "Coconuts, Brazil nuts, cashews",17.93,0.93
    
  2. Coffee,12.93,3.91
    
  3. Fish,7.89,5.04
    
  4. Tobacco,7.25,3.19
    
  5. Gold,6.62,18.63
    
  6. Tea,4.14,1.32
    
  7. Cotton,4.01,1.36
    
  8. Cloves,3.58,0.29
    
  9. Diamonds,3.44,0.58
    
  10. Mounted stones,2.44,1.5
    
  11. Vegetables,1.61,1.73
    
  12. Wheat,0.54,1.38
    
  13. "Precious metal ores",0,6.76

I finally added an option to the PySlopegraph configuration file for rounding (NOTE: rounding != true binning). If you add the “round_precision” option with a value that supports Python’s round function’s little-known second parameter (arbitrary positional rounding), you can have the values round to decimal or tens/hundreds/etc places which will help with scaling issues, but will also group items (in ways that you may not have originally intended). For this chart, if we use a value of “1” (first decimal rounding precision…use negative values for rounding on the whole integer side of the decimal) it’s still unreadable due to the scale it imposes by that precision, so I ended up using the nearest whole integer rounding option (value of “0”) and also included the table of actual values, along with annotating the “rate of change” nature of the slopes.

This (again) defeats the “no wasted ink (pixels?)” component of Tufte’s original creation, but I believe it’s necessary for some types of slopegraphs to ensure the chart can stand on it’s own. I’m definitely becoming more convinced that many slopegraphs are more suited for an interactive visualization where you can encode more information in rollovers/popups/etc plus allow for switching of view from percentage, power-law ranking or raw numeric comparison.

For those interested in playing with this particular data set, it’ll be included in the next github code push, which will also include the rounding feature.

As the codebase gets closer to the 1.0 stretch we now have the addition of slope colors for when values go up/down or remain constant between points. The code still only handles two columns of data, but the intent is for each segment to also be colored appropriately (up/down/same) in a multi-column layout.

I was scanning for ‘slopegraph’ again via a few search engines and came across Chris Conley’s (@ResearchChat) Education and Health Care – Using Slopegraphs to Understand Complex Systems. I really like what Chris has done with the slopegraph formatting and copied the LINH data example to the project. As you can see, Chris came up with a pretty neat way to handle the overlapping data/label issue (and one which I may “borrow” when expanding my slopegraph generator):

Since Chris used colors, it seemed like a fitting example to use to show off the newest feature of the slopegraph code. Here’s the output for the same data with my implementation.

Both “slope_up_color” & “slope_down_color” (Lines 23-24) control the slope color.

This example also showed that the ‘header’ processing needs some tweaking. The code currently assumes the header label width will be less than or equal to the width of the data labels. We’ll need to do some bounding box fitting and/or canvas expansion to enable more robust header text formatting.

Given the minor tweak, no code inclusion here but yet-another handy link to github.

  1. {
  2.  
  3. "label_font_family" : "Arial Narrow",
  4. "label_font_size" : "9",
  5.  
  6. "header_font_family" : "Arial Narrow",
  7. "header_font_size" : "10",
  8.  
  9. "x_margin" : "20",
  10. "y_margin" : "30",
  11.  
  12. "line_width" : "1.0",
  13.  
  14. "slope_length" : "300",
  15.  
  16. "labels" : [ "# Below Average Indicators", "# Above Average Indicators" ],
  17.  
  18. "header_color" : "000000",
  19. "background_color" : "FFFFFF",
  20. "label_color" : "111111",
  21. "value_color" : "999999",
  22. "slope_color" : "222222",
  23. "slope_up_color" : "B0B465",
  24. "slope_down_color" : "A13E52",
  25.  
  26. "value_format_string" : "%2d",
  27.  
  28. "input" : "examples/linh.csv",
  29. "output" : "examples/output/linh",
  30. "format" : "pdf",
  31.  
  32. "description" : "2011 report from ICES; LINH Indicators",
  33. "source" : "http://cconley.ca/2011/07/18/education-and-health-care-using-slopegraphs-to-understand-complex-systems/"
  34.  
  35. }