Skip navigation

Category Archives: Programming

In preparation for the upcoming 1.0 release and with the hopes of laying a foundation for more interactive slopegraphs, I threw together some rudimentary output support over lunch today for Raphaël, which means that all you have to do is generate a new slopegraph with the “js” output type and include the salient portions of the generated html/css/javascript into a web page (along with including the Raphaël script code).

The next github push will have this update. Here’s an example of the output, using the classic Tufte example chart:


It’s definitely a bit rough around the edges (my eyes immediately fixate upon spacing discrepancies) and lacking any interactivity, but the basic building blocks are in place. It also does not render on my Android phone (HTC Incredible 2) but it does render in Chrome, Safari & on my iPad. Embedding a Raphaël graphic in a web page will definitely have advantages over a PNG or PDF in most situations even if it’s not interactive, so I’ll probably keep the support in regardless of whether I continue to improve upon it.

As I was playing with the code, I kept thinking how neat it would be if there was a Raphaël Cairosurface” option. Perhaps that will be a side project if all goes well, since it would not be that much more complicated (in fact, it may be less complicated) than the Cairo SVG surface code.

Given the focus on actual development of the PySlopegraph tool in most of the blog posts of late, folks may be wondering why an infosec/inforisk guy is obsessing so much on a tool and not talking security. Besides the fixation on filling a void and promoting an underused visualization tool, I do believe there is a place for slopegraphs in infosec data analysis and will utilize some data from McAfee’s recent Q1 2012 Threat Report [PDF] to illustrate how one might use slopegraphs in interpreting the “Spam Volume” data presented in the “Messaging Threats” section (pages 11 & 12 of the report).

The report shows individual graphs of spam volume per country from April of 2011 through March of 2012. Each individual graph conveys useful information, but I put together two slopegraphs that each show alternate and aggregate views which let you compare spam volume data relative to each country (versus just in-country).

When first doing this exploration, the scale problem reared it’s ugly head again since the United States is a huge spam outlier and causes the chart to be as tall as my youngest son when printed. I really wanted to show relative spam volume between countries as well as the increase or decrease between years in one chart and — after chatting with @maximumyin a bit — decided to test out using a log scale option for the charting (click for larger image):

This chart — Spam Volume by Country — instantly shows that:

  • overall volume has declined for most countries
  • two countries have remained steady
  • one country (Germany) has increased

The next chart – Spam Volume Percentage by Country — also needed to be presented on a log scale and has some equally compelling information:

Despite holding steady count-wise, the United States percentage of global spam actually increased and is joined by seven other countries, with Germany having the second largest percentage increase. Both charts present an opportunity to further explore why the values changed (since the best metrics are supposed to both inform and be actionable in some way).

I’m going to extract some more data from the McAfee report and some other security reports to show how slopegraphs can be used to interpret the data. Feedback on both the views and the use of the log scale would be greatly appreciated by general data scientists as well as those in the infosec community.

One of the last items for the 1.0 release is support for multiple columns of data. That will require some additional refactoring, so I’ve been procrastinating by exploring the recent “fudging” discovery. Despite claims to the contrary on other sites, there are more folks playing with slopegraphs than you might imagine. The inspiration for today’s installment comes from Jon Custer (@stuffisthings). He has a two partTelling Stories with Data” series that does some exploration of export data with slopegraphs. In his “Slopegraph Strikes Back” post, Jon does a spiffy job discussing data visualization fundamentals and walks the reader through his re-design of a chart on commodities ranking, including a commentary on an aspect of slopegraphs that I’ve been noticing as I’ve been doing my exploring: the ‘scale’ problem (which I began to point out in the aforementioned “fudging” post).

The data set Jon is working with allows for a great exploration as to what works best when trying to convey a message with slopegraphs. I took the values from one of the tables he extracted:

and made a “raw” slopegraph from them (focusing on the “top 10”). The graphic won’t even come close to fitting in this post but you can grab the PDF of it and see how scale is the primary enemy of slopegraphs. It does show how gold and precious metal ores have skyrocketed from 1998 to 2007, but it’s hardly an engaging and easy to read visualization (unless you really like using your scroll wheel).

Jon grok’d this point, too, and decided to focus on the power law ranking and use the slopegraph to present the rate of change of each commodity:

While he didn’t “pull a Tufte” and just include values without caveat (see left & right 90° side labels), I still believe that there needs to be either increased annotation or the inclusion of base tabular data. Using my PySlopegraph code (forgot to mention the name change), I worked up a version of Jon’s visualization that I believe provides a clean, honest view of the data (click for larger view):

Because the chart is still based on the percentages that are fairly precise:

  1. "Coconuts, Brazil nuts, cashews",17.93,0.93
    
  2. Coffee,12.93,3.91
    
  3. Fish,7.89,5.04
    
  4. Tobacco,7.25,3.19
    
  5. Gold,6.62,18.63
    
  6. Tea,4.14,1.32
    
  7. Cotton,4.01,1.36
    
  8. Cloves,3.58,0.29
    
  9. Diamonds,3.44,0.58
    
  10. Mounted stones,2.44,1.5
    
  11. Vegetables,1.61,1.73
    
  12. Wheat,0.54,1.38
    
  13. "Precious metal ores",0,6.76

I finally added an option to the PySlopegraph configuration file for rounding (NOTE: rounding != true binning). If you add the “round_precision” option with a value that supports Python’s round function’s little-known second parameter (arbitrary positional rounding), you can have the values round to decimal or tens/hundreds/etc places which will help with scaling issues, but will also group items (in ways that you may not have originally intended). For this chart, if we use a value of “1” (first decimal rounding precision…use negative values for rounding on the whole integer side of the decimal) it’s still unreadable due to the scale it imposes by that precision, so I ended up using the nearest whole integer rounding option (value of “0”) and also included the table of actual values, along with annotating the “rate of change” nature of the slopes.

This (again) defeats the “no wasted ink (pixels?)” component of Tufte’s original creation, but I believe it’s necessary for some types of slopegraphs to ensure the chart can stand on it’s own. I’m definitely becoming more convinced that many slopegraphs are more suited for an interactive visualization where you can encode more information in rollovers/popups/etc plus allow for switching of view from percentage, power-law ranking or raw numeric comparison.

For those interested in playing with this particular data set, it’ll be included in the next github code push, which will also include the rounding feature.

As the codebase gets closer to the 1.0 stretch we now have the addition of slope colors for when values go up/down or remain constant between points. The code still only handles two columns of data, but the intent is for each segment to also be colored appropriately (up/down/same) in a multi-column layout.

I was scanning for ‘slopegraph’ again via a few search engines and came across Chris Conley’s (@ResearchChat) Education and Health Care – Using Slopegraphs to Understand Complex Systems. I really like what Chris has done with the slopegraph formatting and copied the LINH data example to the project. As you can see, Chris came up with a pretty neat way to handle the overlapping data/label issue (and one which I may “borrow” when expanding my slopegraph generator):

Since Chris used colors, it seemed like a fitting example to use to show off the newest feature of the slopegraph code. Here’s the output for the same data with my implementation.

Both “slope_up_color” & “slope_down_color” (Lines 23-24) control the slope color.

This example also showed that the ‘header’ processing needs some tweaking. The code currently assumes the header label width will be less than or equal to the width of the data labels. We’ll need to do some bounding box fitting and/or canvas expansion to enable more robust header text formatting.

Given the minor tweak, no code inclusion here but yet-another handy link to github.

  1. {
  2.  
  3. "label_font_family" : "Arial Narrow",
  4. "label_font_size" : "9",
  5.  
  6. "header_font_family" : "Arial Narrow",
  7. "header_font_size" : "10",
  8.  
  9. "x_margin" : "20",
  10. "y_margin" : "30",
  11.  
  12. "line_width" : "1.0",
  13.  
  14. "slope_length" : "300",
  15.  
  16. "labels" : [ "# Below Average Indicators", "# Above Average Indicators" ],
  17.  
  18. "header_color" : "000000",
  19. "background_color" : "FFFFFF",
  20. "label_color" : "111111",
  21. "value_color" : "999999",
  22. "slope_color" : "222222",
  23. "slope_up_color" : "B0B465",
  24. "slope_down_color" : "A13E52",
  25.  
  26. "value_format_string" : "%2d",
  27.  
  28. "input" : "examples/linh.csv",
  29. "output" : "examples/output/linh",
  30. "format" : "pdf",
  31.  
  32. "description" : "2011 report from ICES; LINH Indicators",
  33. "source" : "http://cconley.ca/2011/07/18/education-and-health-care-using-slopegraphs-to-understand-complex-systems/"
  34.  
  35. }

The best way to explain this release will be to walk you through an updated configuration file:

  1. {
  2.  
  3. "label_font_family" : "Palatino",
  4. "label_font_size" : "9",
  5.  
  6. "header_font_family" : "Palatino",
  7. "header_font_size" : "10",
  8.  
  9. "x_margin" : "20",
  10. "y_margin" : "30",
  11.  
  12. "line_width" : "0.5",
  13.  
  14. "slope_length" : "150",
  15.  
  16. "labels" : [ "1970", "1979" ],
  17.  
  18. "header_color" : "000000",
  19. "background_color" : "FFFFFF",
  20. "label_color" : "111111",
  21. "value_color" : "999999",
  22. "slope_color" : "AAAAAA",
  23.  
  24. "value_format_string" : "%2d",
  25.  
  26. "input" : "receipts.csv",
  27. "output" : "receipts",
  28. "format" : "svg",
  29.  
  30. "description" : "Current Receipts of Government as a Percentage of Gross Domestic Product, 1970 & 1979",
  31. "source" : "Tufte, Edward. The Visual Display of Quantitative Information. Cheshire, Connecticut: Graphics Press; 1983; p. 158"
  32.  
  33. }

I added the ability to include column headers and separated the font specifications for both the column data/labels and the headers (Lines 2-7). You’re not required to use headers, so just leave out the header font specification and the “labels” option (Line 16) if you don’t want them (it keys off of the font spec, tho). You can also color headers via the “header_color” option (line 18).

If you use the keyword “transparent” for the “background_color” config option (Line 19, tho it’s not transparent in this example) it will leave out the fill, which is useful for blog posts or embedding in other documents. Works best for PNG & PDF output.

If you want to use a different value for width of the space for the slopelines, you can tweak this via the “slope_length” option (Line 14). This is setting the stage for multi-column slopegraphs.

When exchanging some communications with @jayjacobs regarding slopegraphs and seeing his spiffy use of them for incident data anlysis, it became readily apparent that I needed to include a way of formatting the column data values, so there’s a “value_format_string” option, now, that works with Pythonic sprintf formats.

Finally, I added “description” and “source” options that the code does not yet process, but allows for documenting the configuration a bit, since there’s no good way to embed comments in a JSON-format configuration file.

As always, the code’s up on github and also below:

  1. import csv
  2. import cairo
  3. import argparse
  4. import json
  5.  
  6. def split(input, size):
  7.     return [input[start:start+size] for start in range(0, len(input), size)]
  8.  
  9. class Slopegraph:
  10.  
  11.     starts = {} # starting "points"
  12.     ends = {} # ending "points"
  13.     pairs = [] # base pair array for the final plotting
  14.  
  15.     def readCSV(self, filename):
  16.  
  17.         slopeReader = csv.reader(open(filename, 'rb'), delimiter=',', quotechar='"')
  18.  
  19.         for row in slopeReader:
  20.  
  21.             # add chosen values (need start/end for each CSV row) to the final plotting array.
  22.  
  23.             lab = row[0] # label
  24.             beg = float(row[1]) # left vals
  25.             end = float(row[2]) # right vals
  26.  
  27.             self.pairs.append( (float(beg), float(end)) )
  28.  
  29.             # combine labels of common values into one string
  30.  
  31.             if beg in self.starts:
  32.                 self.starts[beg] = self.starts[beg] + "; " + lab
  33.             else:
  34.                 self.starts[beg] = lab
  35.  
  36.  
  37.             if end in self.ends:
  38.                 self.ends[end] = self.ends[end] + "; " + lab
  39.             else:
  40.                 self.ends[end] = lab
  41.  
  42.  
  43.     def sortKeys(self):
  44.  
  45.         # sort all the values (in the event the CSV wasn't) so
  46.         # we can determine the smallest increment we need to use
  47.         # when stacking the labels and plotting points
  48.  
  49.         self.startSorted = [(k, self.starts[k]) for k in sorted(self.starts)]
  50.         self.endSorted = [(k, self.ends[k]) for k in sorted(self.ends)]
  51.  
  52.         self.startKeys = sorted(self.starts.keys())
  53.         self.delta = max(self.startSorted)
  54.         for i in range(len(self.startKeys)):
  55.             if (i+1 <= len(self.startKeys)-1):
  56.                 currDelta = float(self.startKeys[i+1]) - float(self.startKeys[i])
  57.                 if (currDelta < self.delta):
  58.                     self.delta = currDelta
  59.  
  60.         self.endKeys = sorted(self.ends.keys())
  61.         for i in range(len(self.endKeys)):
  62.             if (i+1 <= len(self.endKeys)-1):
  63.                 currDelta = float(self.endKeys[i+1]) - float(self.endKeys[i])
  64.                 if (currDelta < self.delta):
  65.                     self.delta = currDelta
  66.  
  67.  
  68.     def findExtremes(self):
  69.  
  70.         # we also need to find the absolute min & max values
  71.         # so we know how to scale the plots
  72.  
  73.         self.lowest = min(self.startKeys)
  74.         if (min(self.endKeys) < self.lowest) : self.lowest = min(self.endKeys)
  75.  
  76.         self.highest = max(self.startKeys)
  77.         if (max(self.endKeys) > self.highest) : self.highest = max(self.endKeys)
  78.  
  79.         self.delta = float(self.delta)
  80.         self.lowest = float(self.lowest)
  81.         self.highest = float(self.highest)
  82.  
  83.  
  84.     def calculateExtents(self, filename, format, valueFormatString):
  85.  
  86.         if (format == "pdf"):
  87.             surface = cairo.PDFSurface (filename, 8.5*72, 11*72)
  88.         elif (format == "ps"):
  89.             surface = cairo.PSSurface(filename, 8.5*72, 11*72)
  90.             surface.set_eps(True)
  91.         elif (format == "svg"):
  92.             surface = cairo.SVGSurface (filename, 8.5*72, 11*72)
  93.         elif (format == "png"):
  94.             surface = cairo.ImageSurface (cairo.FORMAT_ARGB32, int(8.5*72), int(11*72))
  95.         else:
  96.             surface = cairo.PDFSurface (filename, 8.5*72, 11*72)
  97.  
  98.         cr = cairo.Context(surface)
  99.         cr.save()
  100.         cr.select_font_face(self.LABEL_FONT_FAMILY, cairo.FONT_SLANT_NORMAL, cairo.FONT_WEIGHT_NORMAL)
  101.         cr.set_font_size(self.LABEL_FONT_SIZE)
  102.         cr.set_line_width(self.LINE_WIDTH)
  103.  
  104.         # find the *real* maximum label width (not just based on number of chars)
  105.  
  106.         maxLabelWidth = 0
  107.         maxNumWidth = 0
  108.  
  109.         for k in sorted(self.startKeys):
  110.             s1 = self.starts[k]
  111.             xbearing, ybearing, self.sWidth, self.sHeight, xadvance, yadvance = (cr.text_extents(s1))
  112.             if (self.sWidth > maxLabelWidth) : maxLabelWidth = self.sWidth
  113.             xbearing, ybearing, self.startMaxLabelWidth, startMaxLabelHeight, xadvance, yadvance = (cr.text_extents(valueFormatString % (k)))
  114.             if (self.startMaxLabelWidth > maxNumWidth) : maxNumWidth = self.startMaxLabelWidth
  115.  
  116.         self.sWidth = maxLabelWidth
  117.         self.startMaxLabelWidth = maxNumWidth
  118.  
  119.         maxLabelWidth = 0
  120.         maxNumWidth = 0
  121.  
  122.         for k in sorted(self.endKeys):
  123.             e1 = self.ends[k]
  124.             xbearing, ybearing, self.eWidth, eHeight, xadvance, yadvance = (cr.text_extents(e1))
  125.             if (self.eWidth > maxLabelWidth) : maxLabelWidth = self.eWidth
  126.             xbearing, ybearing, self.endMaxLabelWidth, endMaxLabelHeight, xadvance, yadvance = (cr.text_extents(valueFormatString % (k)))
  127.             if (self.endMaxLabelWidth > maxNumWidth) : maxNumWidth = self.endMaxLabelWidth
  128.  
  129.         self.eWidth = maxLabelWidth
  130.         self.endMaxLabelWidth = maxNumWidth 
  131.  
  132.         cr.restore()
  133.         cr.show_page()
  134.         surface.finish()
  135.  
  136.         self.width = self.X_MARGIN + self.sWidth + self.SPACE_WIDTH + self.startMaxLabelWidth + self.SPACE_WIDTH + self.SLOPE_LENGTH + self.SPACE_WIDTH + self.endMaxLabelWidth + self.SPACE_WIDTH + self.eWidth + self.X_MARGIN ;
  137.         self.height = (self.Y_MARGIN * 2) + (((self.highest - self.lowest) / self.delta) * self.LINE_HEIGHT)
  138.  
  139.         self.HEADER_SPACE = 0.0
  140.         if (self.HEADER_FONT_FAMILY != None):
  141.             self.HEADER_SPACE = self.HEADER_FONT_SIZE + 2*self.LINE_HEIGHT
  142.             self.height += self.HEADER_SPACE
  143.  
  144.  
  145.     def makeSlopegraph(self, filename, config):
  146.  
  147.         (lab_r,lab_g,lab_b) = split(config["label_color"],2)        
  148.         LAB_R = (int(lab_r, 16)/255.0)
  149.         LAB_G = (int(lab_g, 16)/255.0)
  150.         LAB_B = (int(lab_b, 16)/255.0)
  151.  
  152.         (val_r,val_g,val_b) = split(config["value_color"],2)
  153.         VAL_R = (int(val_r, 16)/255.0)
  154.         VAL_G = (int(val_g, 16)/255.0)
  155.         VAL_B = (int(val_b, 16)/255.0)
  156.  
  157.         (line_r,line_g,line_b) = split(config["slope_color"],2)
  158.         LINE_R = (int(line_r, 16)/255.0)
  159.         LINE_G = (int(line_g, 16)/255.0)
  160.         LINE_B = (int(line_b, 16)/255.0)
  161.  
  162.         if (config["background_color"] != "transparent"):
  163.             (bg_r,bg_g,bg_b) = split(config["background_color"],2)
  164.             BG_R = (int(bg_r, 16)/255.0)
  165.             BG_G = (int(bg_g, 16)/255.0)
  166.             BG_B = (int(bg_b, 16)/255.0)
  167.  
  168.         if (config['format'] == "pdf"):
  169.             surface = cairo.PDFSurface (filename, self.width, self.height)
  170.         elif (config['format'] == "ps"):
  171.             surface = cairo.PSSurface(filename, self.width, self.height)
  172.             surface.set_eps(True)
  173.         elif (config['format'] == "svg"):
  174.             surface = cairo.SVGSurface (filename, self.width, self.height)
  175.         elif (config['format'] == "png"):
  176.             surface = cairo.ImageSurface (cairo.FORMAT_ARGB32, int(self.width), int(self.height))
  177.         else:
  178.             surface = cairo.PDFSurface (filename, self.width, self.height)
  179.  
  180.         cr = cairo.Context(surface)
  181.  
  182.         cr.save()
  183.  
  184.         cr.set_line_width(self.LINE_WIDTH)
  185.  
  186.         if (config["background_color"] != "transparent"):
  187.             cr.set_source_rgb(BG_R,BG_G,BG_B)
  188.             cr.rectangle(0,0,self.width,self.height)
  189.             cr.fill()
  190.  
  191.         # draw headers (if present)
  192.  
  193.         if (self.HEADER_FONT_FAMILY != None):
  194.  
  195.             (header_r,header_g,header_b) = split(config["header_color"],2)      
  196.             HEADER_R = (int(header_r, 16)/255.0)
  197.             HEADER_G = (int(header_g, 16)/255.0)
  198.             HEADER_B = (int(header_b, 16)/255.0)
  199.  
  200.             cr.save()
  201.  
  202.             cr.select_font_face(self.HEADER_FONT_FAMILY, cairo.FONT_SLANT_NORMAL, cairo.FONT_WEIGHT_BOLD)
  203.             cr.set_font_size(self.HEADER_FONT_SIZE)
  204.             cr.set_source_rgb(HEADER_R,HEADER_G,HEADER_B)
  205.  
  206.             xbearing, ybearing, hWidth, hHeight, xadvance, yadvance = (cr.text_extents(config["labels"][0]))            
  207.             cr.move_to(self.X_MARGIN + self.sWidth - hWidth, self.Y_MARGIN + self.HEADER_FONT_SIZE)
  208.             cr.show_text(config["labels"][0])
  209.  
  210.             xbearing, ybearing, hWidth, hHeight, xadvance, yadvance = (cr.text_extents(config["labels"][1]))            
  211.             cr.move_to(self.width - self.X_MARGIN - self.SPACE_WIDTH - self.eWidth, self.Y_MARGIN + self.HEADER_FONT_SIZE)
  212.             cr.show_text(config["labels"][1])
  213.  
  214.             cr.stroke()
  215.  
  216.             cr.restore()
  217.  
  218.         # draw start labels at the correct positions
  219.  
  220.         cr.select_font_face(self.LABEL_FONT_FAMILY, cairo.FONT_SLANT_NORMAL, cairo.FONT_WEIGHT_NORMAL)
  221.         cr.set_font_size(self.LABEL_FONT_SIZE)
  222.  
  223.         valueFormatString = config["value_format_string"]
  224.  
  225.         for k in sorted(self.startKeys):
  226.  
  227.             val = float(k)
  228.             label = self.starts[k]
  229.             xbearing, ybearing, lWidth, lHeight, xadvance, yadvance = (cr.text_extents(label))
  230.             xbearing, ybearing, kWidth, kHeight, xadvance, yadvance = (cr.text_extents(valueFormatString % (val)))
  231.  
  232.             cr.set_source_rgb(LAB_R,LAB_G,LAB_B)
  233.             cr.move_to(self.X_MARGIN + (self.sWidth - lWidth), self.Y_MARGIN + self.HEADER_SPACE + (self.highest - val) * self.LINE_HEIGHT * (1/self.delta))
  234.             cr.show_text(label)
  235.  
  236.             cr.set_source_rgb(VAL_R,VAL_G,VAL_B)
  237.             cr.move_to(self.X_MARGIN + self.sWidth + self.SPACE_WIDTH + (self.startMaxLabelWidth - kWidth), self.Y_MARGIN + self.HEADER_SPACE + (self.highest - val) * self.LINE_HEIGHT * (1/self.delta))
  238.             cr.show_text(valueFormatString % (val))
  239.  
  240.             cr.stroke()
  241.  
  242.         # draw end labels at the correct positions
  243.  
  244.         for k in sorted(self.endKeys):
  245.  
  246.             val = float(k)
  247.             label = self.ends[k]
  248.             xbearing, ybearing, lWidth, lHeight, xadvance, yadvance = (cr.text_extents(label))
  249.  
  250.             cr.set_source_rgb(VAL_R,VAL_G,VAL_B)
  251.             cr.move_to(self.width - self.X_MARGIN - self.SPACE_WIDTH - self.eWidth - self.SPACE_WIDTH - self.endMaxLabelWidth, self.Y_MARGIN + self.HEADER_SPACE + (self.highest - val) * self.LINE_HEIGHT * (1/self.delta))
  252.             cr.show_text(valueFormatString % (val))
  253.  
  254.             cr.set_source_rgb(LAB_R,LAB_G,LAB_B)
  255.             cr.move_to(self.width - self.X_MARGIN - self.SPACE_WIDTH - self.eWidth, self.Y_MARGIN + self.HEADER_SPACE + (self.highest - val) * self.LINE_HEIGHT * (1/self.delta))
  256.             cr.show_text(label)
  257.  
  258.             cr.stroke()
  259.  
  260.         # do the actual plotting
  261.  
  262.         cr.set_line_width(self.LINE_WIDTH)
  263.         cr.set_source_rgb(LINE_R, LINE_G, LINE_B)
  264.  
  265.         for s1,e1 in self.pairs:
  266.             cr.move_to(self.X_MARGIN + self.sWidth + self.SPACE_WIDTH + self.startMaxLabelWidth + self.LINE_START_DELTA, self.Y_MARGIN + self.HEADER_SPACE + (self.highest - s1) * self.LINE_HEIGHT * (1/self.delta) - self.LINE_HEIGHT/4)
  267.             cr.line_to(self.width - self.X_MARGIN - self.eWidth - self.SPACE_WIDTH - self.endMaxLabelWidth - self.LINE_START_DELTA, self.Y_MARGIN + self.HEADER_SPACE + (self.highest - e1) * self.LINE_HEIGHT * (1/self.delta) - self.LINE_HEIGHT/4)
  268.             cr.stroke()
  269.  
  270.         cr.restore()
  271.         cr.show_page()
  272.  
  273.         if (config['format'] == "png"):
  274.             surface.write_to_png(filename)
  275.  
  276.         surface.finish()    
  277.  
  278.     def __init__(self, config):
  279.  
  280.         # since some methods need these, make them local to the class
  281.  
  282.         self.LABEL_FONT_FAMILY = config["label_font_family"]
  283.         self.LABEL_FONT_SIZE = float(config["label_font_size"])
  284.  
  285.         if "header_font_family" in config:
  286.             self.HEADER_FONT_FAMILY = config["header_font_family"]
  287.             self.HEADER_FONT_SIZE = float(config["header_font_size"])
  288.         else:
  289.             self.HEADER_FONT_FAMILY = None
  290.             self.HEADER_FONT_SIZE = None
  291.  
  292.         self.X_MARGIN = float(config["x_margin"])
  293.         self.Y_MARGIN = float(config["y_margin"])
  294.         self.LINE_WIDTH = float(config["line_width"])
  295.  
  296.         if "slope_length" in config:
  297.             self.SLOPE_LENGTH = float(config["slope_length"])
  298.         else:
  299.             self.SLOPE_LENGTH = 300
  300.  
  301.         self.SPACE_WIDTH = self.LABEL_FONT_SIZE / 2.0
  302.         self.LINE_HEIGHT = self.LABEL_FONT_SIZE + (self.LABEL_FONT_SIZE / 2.0)
  303.         self.LINE_START_DELTA = 1.5*self.SPACE_WIDTH
  304.  
  305.         OUTPUT_FILE = config["output"] + "." + config["format"]
  306.  
  307.         # process the values & make the slopegraph
  308.  
  309.         self.readCSV(config["input"])
  310.         self.sortKeys()
  311.         self.findExtremes()
  312.         self.calculateExtents(OUTPUT_FILE, config["format"], config["value_format_string"])
  313.         self.makeSlopegraph(OUTPUT_FILE, config)
  314.  
  315.  
  316. def main():
  317.  
  318.     parser = argparse.ArgumentParser(description="Creates a slopegraph from a CSV source")
  319.     parser.add_argument("--config",required=True,
  320.                     help="config file name to use for  slopegraph creation",)
  321.     args = parser.parse_args()
  322.  
  323.     if args.config:
  324.  
  325.         json_data = open(args.config)
  326.         config = json.load(json_data)
  327.         json_data.close()
  328.  
  329.         Slopegraph(config)
  330.  
  331.     return(0)
  332.  
  333. if __name__ == "__main__":
  334.     main()

If you’re even remotely familiar with slopegraphs, then you’ll recognize Tufte’s classic 1970-1979 GDP chart example (click for larger version):

What you may not notice initially, however, is that Tufte — well — cheated. Yes, I said it. Cheated. I can show it by zooming into the “Belgium/Canada/Finland” grouping in the left column:

The value “35.2” shows up twice and is actually equivalent (on the scale) to the “34.9” value. If the numbers were actually representative of the positional elements, the slopegraph should, in fact, look more like this (you’ll need to click on the image to view it, as it’s big):

You may want to even grab the PDF version as it’s a bit more legible if you’re going to peruse it for any length of time.

One of the strengths of the slopegraph chart is to have every bit of ink be useful. In the GDP chart, however, while the slopes are accurate (as you’ll see below) you’re a bit misled by the juxtaposition of similar values on either column and the use of detailed values when clearly the chart data has been rounded. Here’s more of what the actual chart represents (click the graphic for larger version):

If you compare this version (made with my code) with Tufte’s example, you’ll see the similarity (except that I group common values on a single line). I only discovered this after inputting the GDP numbers from Tufte’s chart into an example configuration that I could use with my code.

The fact that Tufte rounded (perhaps “binned” may be a better word) the values does not make the results less useful nor make the “Britain” outlier any less significant (it’s even more pronounced in my huge, “unbinned” version). But, his graphic does seem misleading (to me) knowing that they are rounded values. I believe something like this would have been a more honest reproduction,even though it adds ink (click the graphic for larger version):

I believe a cue about the rounding and the inclusion of the actual values make this a more honest chart.

As a result of this digging, I’m going to be working on some “binning” code for my slopegraph implementation. I’m anticipating having to use a more interactive model when it comes to some data sources (think hover/zoom for data detail) and will probably look at using Raphael or D3 for them.

I’m curious as to what you think of the “undeclared rounding” and how you’d present the information given the obvious need for some type of “binning”.

I’m on a “three things” motif for 2012, as it’s really difficult for most folks to focus on more than three core elements well. This is especially true for web developers as they have so much to contend with on a daily basis, whether it be new features, bug reports, user help requests or just ensuring proper caffeine levels are maintained.

In 2011, web sites took more hits then they ever have and—sadly—most attacks could have been prevented. I fear that the pastings will continue in 2012, but there are some steps you can take to help make your site less of a target.

Bookmark & Use OWASP’s Web Site Regularly

I’d feel a little sorry for hacked web sites if it weren’t for resources like OWASP, tools like IronBee and principles like Rugged being in abundance, with many smart folks associated with them being more than willing to offer counsel and advice.

If you run a web site or develop web applications and have not inhaled all the information OWASP has to provide, then you are engaging in the Internet equivalent of driving a Ford Pinto (the exploding kind) without seat belts, airbags, doors and a working dashboard console. There is so much good information and advice out there with solid examples that prove some truly effective security measures can really be implemented in a single line of code.

Make it a point to read, re-read and keep-up-to-date on new articles and resources that OWASP provides. I know you also need to beat the competition to new features and crank out “x” lines of code per day, but you also need to do what it takes to avoid joining the ranks of those in DataLossDB.

Patch & Properly Configure Your Bootstrap Components

Your web app uses frameworks, runs in some type of web container and sits on top of an operating system. Unfortunately, vulnerabilities pop up in each of those components from time to time and you need to keep on top of those and determine which ones you will patch and when. Sites like Secunia and US-CERT aggregate patch information pretty well for operating systems and popular server software components, but it’s best to also subscribe to release and security mailing lists for your frameworks and other bootstrap components.

Configuring your bootstrap environment securely is also important and you can use handy guides over at the Center for Internet Security and the National Vulnerability Database (which is also good for vulnerability reports). The good news is that you probably only need to double-check this a couple times a year and can also integreate secure configuration baselines into tools like Chef & Puppet.

Secure Data Appropriately

I won’t belabor this point (especially if you promise to read the OWASP guidance on this thoroughly) but you need to look at the data being stored and how it is accessed and determine the most appropriate way to secure it. Don’t store more than you absolutely need to. Encrypt password fields (and other sensitive data) with more than a plain MD5 hash. Don’t store any credit card numbers (really, just don’t) or tokenize them if you do (but you really don’t). Keep data off the front-end environment and watch the database and application logs with a service like Loggly (to see if there’s anything fishy going on).

I’m going to cheat and close with a fourth resolution for you: Create (and test) a data breach response plan. If any security professional is being honest, it’s virtually impossible to prevent a breach if a hacker is determined enough and the best thing you can do for your user base is to respond well when it happens. The only way to do that is have a plan and to test it (so you know what you are doing when the breach occurs). And, you should run your communications plan by other folks to make sure it’s adequate (ping @securitytwits for suggestions for good resources).

You want to be able to walk away from a breach with your reputation as intact as possible (so you’ll have to keep the other three resolutions anyway) with your users feeling fully informed and assured that you did everything you could to prevent it.

What other security-related resolutions are you making this year as a web developer or web site owner and what other tools/services are you using to secure your sites?

UPDATE: Check out the newer post on additional features.

There has been much ado of late about Dropbox security with one of the most egregious issues being how easy it is to surreptitiously “clone” someone else’s Dropbox by obtaining just one piece of data – the host id – from the Dropbox SQLite config.db.

Moloch built a Windows & Linux impersonation/cloning utility in Python that was/is meant to be used from a USB/external volume. The utility can save the cloned host id to a local file and also has the capability to use a simple HTTP GET request to log data to a “mothership” web site.

Since many Dropbox users use OS X (including me) I didn’t want them to feel left out or smugly more secure. So, I set about creating a native version of the utility.

This release is not as feature-rich as Moloch’s Python script but it won’t take much more effort to crank out a version that duplicates all of the functionality. “Release early. Release often.” as the kids these days are wont to say.

You can find the source at its github repository. When building it or just downloading & running the executable (see below), you should heed the repo’s README and take care to change the following items in the application’s Info.plist property list:

  • MothershipURL – this is the URL of the remote host you want to store the cloned info to. It defaults to somesite.domain/mothership.php to avoid accidentally sending your own Dropbox data to a remote host. PLEASE NOTE that you will need to get the mothership.php script from the original Windows/Linux code distribution as I have not asked for permission to distribute it here. You can grab the original dbClone.rar directly from here: dl.dropbox.com/u/341940/dbClone.rar (I love the irony of it being hosted on Dropbox itself).

    ALSO NOTE that there’s no need to modify the application’s property list if you don’t mind typing in a URL each run. I eventually plan on making this a separate property list file that allows for multiple URLs so you can select it from a drop-down (and still type a new one if you like).

  • LogFilenamejust include the filename you want to use when storing the cloned info locally if you do not like the default (it’s the same as Moloch’s – "GroceryList.txt"). It defaults to the top-level of the mounted volume (the original Linux & Windows dbClone was meant to be run from a USB/external volume) or "~/" if running it on your boot drive.

You can use the property list editor(s) that come with Apple’s Developer Tools or use vim, TextEdit, TextWrangler (or your favorite text editor) and modify these lines appropriately:

[code]
<key>LogFilename</key>
<string>GroceryList.txt</string>
<key>MothershipURL</key>
<string>http://somesite.domain/mothership.php</string>
[/code]

If you do use the “backup” option, the current naming scheme is "backup-config.db" and it”s important to note that the program will not attempt to overwrite the file. I may change that behaviour in an upcoming release.

I tested the build on OS X 10.6.7 but the Xcode project is set to build for compatibility with 10.5.x or 10.6.x. Feedback on behaviour on other systems would be most welcome.

If you just want the executable, grab the zip’d app and give it a go.

Any and all feedback is welcome (via github or in the comments).