Skip navigation

Category Archives: Python

alogoWhile you can (and should) view [all the presentations](https://speakerdeck.com/pyconslides) from #PyCon2013, here are my picks for the ones that interested me the most, as they focus on scaling, mapping, automation (both web & electronics) and data analysis:

– [Chef: Why you should automate your web infrastructure](https://speakerdeck.com/pyconslides/chef-why-you-should-automate-your-web-infrastructure-by-kate-heddleston) by Kate Heddleston
– [Messaging at Scale at Instagram](https://speakerdeck.com/pyconslides/messaging-at-scale-at-instagram-by-rick-branson) by Rick Branson
– [Python at Netflix](https://speakerdeck.com/pyconslides/python-at-netflix-by-jeremy-edberg-corey-bertram-and-roy-rapoport) by Jeremy Edberg, Corey Bertram, and Roy Rapoport
– [Real-time Tracking and Mapping of Geographic Objects](https://speakerdeck.com/pyconslides/real-time-tracking-and-mapping-of-geographic-objects-by-ragi-burhum) by Ragi Burhum
– [Scaling Realtime at DISQUS](https://speakerdeck.com/pyconslides/scaling-realtime-at-disqus-by-adam-hitchcock) by Adam Hitchcock
– [A Crash Course in MongoDB](https://speakerdeck.com/pyconslides/a-crash-course-in-mongodb)
– [Server Log Analysis with Pandas](https://speakerdeck.com/pyconslides/server-log-analysis-with-pandas-by-taavi-burns) by Taavi Burns
– [Who’s There – Home Automation with Arduino and RaspberryPi](https://speakerdeck.com/pyconslides/whos-there-home-automation-with-arduino-and-raspberrypi-by-rupa-dachere) by Rupa Dachere
x
– [Why you should use Python 3 for text processing](https://speakerdeck.com/pyconslides/why-you-should-use-python-3-for-text-processing-by-david-mertz) by David Mertz
– [Awesome Big Data Algorithms](https://speakerdeck.com/pyconslides/awesome-big-data-algorithms-by-titus-brown) by Titus Brown

A huge thanks to the speakers and conference organizers for making these resources freely available, especially to those of us who were not able to attend the conference.

I updated the code to use ggsave and tweaked some of the font & line size values for more consistent (and pretty) output. This also means that I really need to get this up on github.

If you even remotely follow this blog, you’ll see that I’m kinda obsessed with slopegraphs. While I’m pretty happy with my Python implementation, I do quite a bit of data processing & data visualization in R these days and had a few free hours on a recent trip to Seattle, so I whipped up some R code to do traditional and multi-column rank-order slopegraphs in R, mostly due to a post over at Microsoft’s security blog.

#
# multicolumn-rankorder-slopegraph.R
#
# 2013-01-12 - formatting tweaks
# 2013-01-10 - Initial version - boB Rudis - @hrbrmstr
#
# Pretty much explained by the script title. This is an R script which is designed to produce
# 2+ column rank-order slopegraphs with the ability to highlight meaningful patterns
#
 
library(ggplot2)
library(reshape2)
 
# transcription of table from:
# http://blogs.technet.com/b/security/archive/2013/01/07/operating-system-infection-rates-the-most-common-malware-families-on-each-platform.aspx
#
# You can download it from: 
# https://docs.google.com/spreadsheet/ccc?key=0AlCY1qfmPPZVdHpwYk0xYkh3d2xLN0lwTFJrWXppZ2c
 
df = read.csv("~/Desktop/malware.csv")
 
# For this slopegraph, we care that #1 is at the top and that higher value #'s are at the bottom, so we 
# negate the rank values in the table we just read in
 
df$Rank.Win7.SP1 = -df$Rank.Win7.SP1
df$Rank.Win7.RTM = -df$Rank.Win7.RTM
df$Rank.Vista = -df$Rank.Vista
df$Rank.XP = -df$Rank.XP
 
# Also, we are really comparing the end state (ultimately) so sort the list by the end state.
# In this case, it's the Windows 7 malware data.
 
df$Family = with(df, reorder(Family, Rank.Win7.SP1))
 
# We need to take the multi-columns and make it into 3 for line-graph processing 
 
dfm = melt(df)
 
# We need to take the multi-columns and make it into 3 for line-graph processing 
 
dfm = melt(df)
 
# We define our color palette manually so we can highlight the lines that "matter".
# This means you'll need to generate the slopegraph at least one time prior to determine
# which lines need coloring. This should be something we pick up algorithmically, eventually
 
sgPalette = c("#990000", "#990000",  "#CCCCCC", "#CCCCCC", "#CCCCCC","#CCCCCC", "#990000", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC")
#sgPalette = c("#CCCCCC", "#CCCCCC",  "#CCCCCC", "#CCCCCC", "#CCCCCC","#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC", "#CCCCCC")
#sgPalette = c("#000000", "#000000",  "#000000", "#000000", "#000000","#000000", "#000000", "#000000", "#000000", "#000000", "#000000", "#000000", "#000000", "#000000", "#000000")
 
 
# start the plot
#
# we do a ton of customisations to the plain ggplot canvas, but it's not rocket science
 
sg = ggplot(dfm, aes(factor(variable), value, 
                     group = Family, 
                     colour = Family, 
                     label = Family)) +
  scale_colour_manual(values=sgPalette) +
  theme(legend.position = "none", 
        axis.text.x = element_text(size=5),
        axis.text.y=element_blank(), 
        axis.title.x=element_blank(),
        axis.title.y=element_blank(),
        axis.ticks=element_blank(),
        axis.line=element_blank(),
        panel.grid.major = element_line("black", size = 0.1),
        panel.grid.major = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.background = element_blank())
 
# plot the right-most labels
 
sg1 = sg + geom_line(size=0.15) + 
  geom_text(data = subset(dfm, variable == "Rank.Win7.SP1"), 
            aes(x = factor(variable), label=sprintf(" %-2d %s",-(value),Family)), size = 1.75, hjust = 0) 
 
# plot the left-most labels
 
sg1 = sg1 + geom_text(data = subset(dfm, variable == "Rank.XP"), 
                     aes(x = factor(variable), label=sprintf("%s %2d ",Family,-(value))), size = 1.75, hjust = 1)
 
# this ratio seems to work well for png output
# you'll need to tweak font size for PDF output, but PDF will make post-processing in 
# Illustrator or Inkscape much easier.
 
ggsave("~/Desktop/malware.pdf",sg1,w=8,h=5,dpi=150)

malware
Click for larger version

I really didn’t think the table told a story well and I truly believe slopegraphs are pretty good at telling stories.

This bit of R code is far from generic and requires the data investigator to do some work to make it an effective visualization, but (I think) it’s one of the better starts at a slopegraph library in R. It suffers from the same issues I’ve pointed out before, but it’s far fewer lines of code than my Python version and it handles multi-column slopegraphs quite nicely.

To be truly effective, you’ll need to plot the slopegraph first and then figure out which nodes to highlight and change the sgPalette accordingly to help the reader/viewer focus on what’s important.

I’ll post everything on github once I recover from cross-country travel and—as always–welcome feedback and fixes.

We had our first, real, snowfall of the season in Maine today and that usually means school delays/closings. Our “local” station – @WCHS6 – has a Storm Center Closings page as well as an SMS notification service. I decided this morning that I needed a command line version (and, eventually, a version that sends me a Twitter DM), but I also was tight for time (a lunchtime meeting ending early is responsible for this blog post).

While I’ve consumed my share of Beautiful Soup and can throw down some mechanize with the best of them, it came to me that there may be an even easier way, and one that may also help with the eventual blocking of such a scraping service.

I setup a Google Drive spreadsheet to use the importHTML formula to read in the closings table on the page:

=importHTML("http://www.wcsh6.com/weather/severe_weather/cancellations_closings/default.aspx","table",0)

Then did a File→Publish to the web and setup up Sheet 1 to “Automatically republish when changes are made” and also to have the link be to the CSV version of the data:

Screenshot 12:17:12 1:16 PM

The raw output looks a bit like:

Name,Status,Last Updated
,,
Westbook Seniors,Luncheon PPD to January 7th,12/17/2012 5:22:51
,,
Allied Wheelchair Van Services,Closed,12/17/2012 6:49:47
,,
American Legion - Dixfield,Bingo cancelled,12/17/2012 11:44:12
,,
American Legion Post 155 - Naples,Closed,12/17/2012 12:49:00

The conversion has some “blank” lines but that’s easy enough to filter out with some quick bash:

curl --silent "https://docs.google.com/spreadsheet/pub?key=0AlCY1qfmPPZVdFBsX3kzLUVHZl9Mdmw3bS1POWNsWnc&single=true&gid=0&outpu
t=csv" | grep -v "^,,"

And, looking for the specific school(s) of our kids is an easy grep as well.

The reason this is interesting is that the importHTML is dynamic and will re-convert the HTML table each time the code retrieves the CSV URL. Couple that with the fact that it’s far less likely that Google will be blocked than it is my IP address(es) and this seems to be a pretty nice alternative to traditional parsing.

If I get some time over the break, I’ll do a quick benchmark of using this method over some python and perl scraping/parsing methods.

I played around with OSE Firewall for WordPress for a couple days to see if it was worth switching to from the plugin I was previously using. It’s definitely not as full featured and I didn’t see any WP database extensions where it kept a log I could review/analyze, so I whipped up a little script to extract all the alert data from the Gmail account I setup for it to log to.

The script below – while focused on getting OSE Firewall alert data – can be easily modified to search for other types of automated/formatted e-mails and build a CSV file with the results. Remember, tho, that you’re going to be putting your e-mail credentials in this file (if you end up using it) so either use a mailbox you don’t care about or make sure you use sane permissions on the script and keep it somewhere safe.

I tested it on linux boxes, but it should work anywhere you have Python and mailbox access.

I highly doubt there will be any updates to this version (I’m not using OSE Firewall anymore), but you an grab the source below or on github. There should be sufficient annotation in the comments, but if you have any questions, drop a note in the comments.

# oswfw.py - extract WordPress OSE Firewall mail alerts to CSV
# 
# Author: @hrbrmstr
#

import imaplib
import datetime
import re

# get 'today' (in the event you are just reporting on today's hits
date = (datetime.date.today() - datetime.timedelta(1)).strftime("%d-%b-%Y")

# setup IMAP connection

gmail = imaplib.IMAP4_SSL('imap.gmail.com',993) # use your IMAP server it not Gmail
gmail.login("YOUR_IMAP_USERNAME","YOUR_PASSWORD")
gmail.select('[Gmail]/All Mail') # Your IMAP's "all mail" if not using Gmail

# now search for all mails with "OSE Firewall" in the subject

# uncomment this line and comment out the next one to just get results from 'today'
#result, data = gmail.uid('search', None, '(SENTSINCE {date} HEADER Subject "OSE Firewall*")'.format(date=date))
result, data = gmail.uid('search', None, '(HEADER Subject "OSE Firewall*")')

# setup CSV file for output

f = open("osefw.csv", "w+")
f.write("Date,IP,URI,Method,UserAgent,Referer\n") ;

# cycle through result set from IMAP search query, extracting salient info
# from headers/body of each found message

for msg in data[0].split():

    # fetch the msg for the UID
    res, msg_txt = gmail.uid('fetch', msg, '(RFC822)')

    # get rid of carriage returns
    body = re.sub(re.compile('\r', re.MULTILINE), '', msg_txt[0][1])

    # extract salient fields from the message body/header
    DATE = re.findall('^Date: (.*?)$', body, re.M)
    IP = re.findall('^FROM IP: http:\/\/whois.domaintools.com\/(.*?)$', body, re.M)
    URI = re.findall('^URI: (.*?)$', body, re.M)
    METHOD = re.findall('^METHOD: (.*?)$', body, re.M)
    USERAGENT= re.findall('^USERAGENT: (.*?)$', body, re.M)
    REFERER = re.findall('^REFERER: (.*?)$', body, re.M)

    # format for CSV output
    ose_log  = "%s,%s,%s,%s,%s,%s\n" % (DATE, IP, URI, METHOD, USERAGENT, REFERER)

    # quicker to replace array output brackets than to deal with non-array results checking
    f.write(re.sub("[\[\]]*", "", ose_log))

    f.flush() ;

gmail.logout()
f.close()

The Fund For Peace (FFP) and Foreign Policy jointly released the 2012 version of the “failed states index” (FSI). From the FFP site, the FSI:

…focuses on the indicators of risk and is based on thousands of articles and reports that are processed by our CAST Software from electronically available sources.

I read it every year (mostly due to being an ardent reader of Foreign Policy magazine) and find the rankings, methodology & insights quite intriguing. With my recent work on slopegraphs, I thought this would be a good data set to play with to determine what – if any – features were necessary to support rank order (and to provide some impetus to finally refactor the code to support multi-column slopegraphs…more on that later).

However, I was not looking forward to transcribing the data from the Flash visualization on the Foreign Policy web site. There are HTML grids on the FFP site but I really just wanted the overall rankings (i.e. no sub-indices) and noticed this interesting scrollable mini-grid on one of the FFP FSI pages:

Thankfully[?] it’s an IFRAME and I was able to pull 2010, 2011 & 2012 data in a very usable format by manipulating this URL: http://www.fundforpeace.org/global/tables/fsiindex2010_sml.htm.

After some quick transformations, I had two CSV files for a 2010-2012 comparison and a 2011-2012 comparison.

(Before continuing, I feel the need to point out that the data, methodology, etc is 100% Copyright © 2012 The Fund for Peace as they overtly point out many times on their site.)

When I threw the data into the slopegraph tool, it was immediately obvious that I was missing something important: the ability to specify sort order for the data. For most slopegraphs, the code works well since our brains expect the larger values on the top. For a rank-order slopegraph, that sort order (for the most part) should be ascending vs descending to best represent changes in rank position. It does feel odd that being “#1” in the FSI actually means you’re really a loser, but I didn’t make the rules for their index.

So, PySlopegraph now handles two column rank order slopegraphs and, as you’ll see in part two, also handles multi-column slopegraphs (but that bit needs some work). The code will be up on github in a couple days as I’ve also got some half-finished support for Processing.js and Paper.js that I want to finish before another push. If anyone needs it sooner, just @ or DM me.

Now, For The Data

The “Top 25” (that sounds way too positive for what it really means) slopegraph is the easiest to read (as it’s the smallest). It is also where Foreign Policy & FFP focus some dataviz effort as well (though they do have visualizations for all the data). Here’s the slopegraph showing the rank order chance from 2010 to 2012:

The full slopegraphs are tall slopegraphs (I’ve been prototyping some ways to make tall ones more useful, but that’s nowhere near ready for public consumption). You may just want to grab the two PDFs and look there vs in this post:

Rank Order Comparison :: 2010/2012


Rank Order Comparison :: 2011/2012

While it requires scrolling, the changes in rank are immediately noticeable as is the fact that the the FFP folk allow for ties that leave “holes” in the table. I think you really get a feel for which countries are stable, improving and declining very quickly with the slopegraph version, but I’d like to hear your thoughts if you have an opine you’d like to share.

Stay tuned for part two!

UPDATE: It seems my use of <script async> optimization for Raphaël busted the inline slopegraph generation. Will work on tweaking the example posts to wait for Raphaël to load when I get some time.

So, I had to alter this to start after a user interaction. It loaded fine as a static, local page but seems to get a bit wonky embedded in a complex page. I also see some artifacts in Chrome but not in Safari. Still, not a bad foray into basic animation.

Animate Slopegraph


There were enough eye-catching glitches in the experimental javascript support and the ugly large-number display in the spam example post that I felt compelled to make a couple formatting tweaks in the code. I also didn’t have time to do “real” work on the codebase this weekend.

So, along with spacing adjustments, there’s now an “add_commas” non-mandatory option that will toss commas in large numbers so they’re easy to read. Here’s an example of the new output (both the Raphaël display and commas):


As usual, it’s up on github

In preparation for the upcoming 1.0 release and with the hopes of laying a foundation for more interactive slopegraphs, I threw together some rudimentary output support over lunch today for Raphaël, which means that all you have to do is generate a new slopegraph with the “js” output type and include the salient portions of the generated html/css/javascript into a web page (along with including the Raphaël script code).

The next github push will have this update. Here’s an example of the output, using the classic Tufte example chart:


It’s definitely a bit rough around the edges (my eyes immediately fixate upon spacing discrepancies) and lacking any interactivity, but the basic building blocks are in place. It also does not render on my Android phone (HTC Incredible 2) but it does render in Chrome, Safari & on my iPad. Embedding a Raphaël graphic in a web page will definitely have advantages over a PNG or PDF in most situations even if it’s not interactive, so I’ll probably keep the support in regardless of whether I continue to improve upon it.

As I was playing with the code, I kept thinking how neat it would be if there was a Raphaël Cairosurface” option. Perhaps that will be a side project if all goes well, since it would not be that much more complicated (in fact, it may be less complicated) than the Cairo SVG surface code.