Python Archives - Page 4 of 4

(NOTE: You can keep up with progress best at github, but can always search on “slopegraph” here or just hit the tag page: “slopegraph” regularly)

I’ve been a bit obsessed with slopegraphs (a.k.a “Tufte table-chart”) of late and very dissatisfied with the lack of tools to make this particular visualization tool more prevalent. While my ultimate goal is to have a user-friendly modern web app or platform app that’s as easy as a “drag & drop” of a CSV file, this first foray will require a bit (not much, really!) of elbow grease to be used.

For those who want to get right to the code, head on over to github and have a look (I’ll post all updates there). Setup, sample & source are also below.

First, you’ll need a modern Python install. I did all the development on Mac OS Mountain Lion (beta) with the stock Python 2.7 build. You’ll also need the Cairo 2D graphics library which built and installed perfectly from source, even on ML, so it should work fine for you. If you want something besides PDF rendering, you may need additional libraries, but PDF is decent for hi-res embedding, converting to jpg/png (see below) and tweaking in programs like Illustrator.

If you search for “Gender Comparisons” in the comments on this post at Tufte’s blog, you’ll see what I was trying to reproduce in this bit of skeleton code (below). By modifying the CSV file you’re using [line 21] and then which fields are relevant [lines 45-47] you should be able to make your own basic slopegraphs without much trouble.

If you catch any glitches, add some tweak or have a slopegraph “wish list”, let me know here, twitter (@hrbrmstr) or over at github.

```
# slopegraph.py
```
```
#
```
```
# Author: Bob Rudis (@hrbrmstr)
```
```
#
```

# Basic Python skeleton to do simple two value slopegraphs

# with output to PDF (most useful form for me...Cairo has tons of options)

```
#
```

# Find out more about & download Cairo here:

```
# http://cairographics.org/
```
```
#
```

# 2012-05-28 - 0.5 - Initial github release. Still needs some polish

```
#
```
```
 
```
```
import csv
```
```
import cairo
```
```
 
```

# original data source: http://www.calvin.edu/~stob/data/television.csv

```
 
```
```
# get a CSV file to work with 
```
```
 
```

slopeReader = csv.reader(open('television.csv', 'rb'), delimiter=',', quotechar='"')

```
 
```
```
starts = {} # starting "points"/
```
```
ends = {} # ending "points"
```
```
 
```

# Need to refactor label max width into font calculations

# as there's no guarantee the longest (character-wise)

```
# label is the widest one
```
```
 
```
```
startLabelMaxLen = 0
```
```
endLabelMaxLen = 0
```
```
 
```

# build a base pair array for the final plotting

# wastes memory, but simplifies plotting

```
 
```
```
pairs = []
```
```
 
```
```
for row in slopeReader:
```
```
 
```

	# add chosen values (need start/end for each CSV row)

	# to the final plotting array. Try this sample with

	# row[1] (average life span) instead of row[5] to see some

```
	# of the scaling in action
```
```
 
```
```
	lab = row[0] # label
```
```
	beg = row[5] # male life span
```
```
	end = row[4] # female life span
```
```
 
```

	pairs.append( (float(beg), float(end)) )

```
 
```

	# combine labels of common values into one string

	# also (as noted previously, inappropriately) find the

```
	# longest one
```
```
 
```
```
	if beg in starts:
```

		starts[beg] = starts[beg] + "; " + lab

```
	else:
```
```
		starts[beg] = lab
```
```
 
```

	if ((len(starts[beg]) + len(beg)) > startLabelMaxLen):

		startLabelMaxLen = len(starts[beg]) + len(beg)

```
		s1 = starts[beg]
```
```
 
```
```
 
```
```
	if end in ends:
```
```
		ends[end] = ends[end] + "; " + lab
```
```
	else:
```
```
		ends[end] = lab
```
```
 
```

	if ((len(ends[end]) + len(end)) > endLabelMaxLen):

		endLabelMaxLen = len(ends[end]) + len(end)

```
		e1 = ends[end]
```
```
 
```

# sort all the values (in the event the CSV wasn't) so

# we can determine the smallest increment we need to use

# when stacking the labels and plotting points

```
 
```

startSorted = [(k, starts[k]) for k in sorted(starts)]

endSorted = [(k, ends[k]) for k in sorted(ends)]

```
 
```
```
startKeys = sorted(starts.keys())
```
```
delta = max(startSorted)
```
```
for i in range(len(startKeys)):
```
```
	if (i+1 <= len(startKeys)-1):
```

		currDelta = float(startKeys[i+1]) - float(startKeys[i])

```
		if (currDelta < delta):
```
```
			delta = currDelta
```
```
 
```
```
endKeys = sorted(ends.keys())
```
```
for i in range(len(endKeys)):
```
```
	if (i+1 <= len(endKeys)-1):
```

		currDelta = float(endKeys[i+1]) - float(endKeys[i])

```
		if (currDelta < delta):
```
```
			delta = currDelta
```
```
 
```

# we also need to find the absolute min & max values

```
# so we know how to scale the plots
```
```
 
```
```
lowest = min(startKeys)
```

if (min(endKeys) < lowest) : lowest = min(endKeys)

```
 
```
```
highest = max(startKeys)
```

if (max(endKeys) > highest) : highest = max(endKeys)

```
 
```

# just making sure everything's a number

# probably should move some of this to the csv reader section

```
 
```
```
delta = float(delta)
```
```
lowest = float(lowest)
```
```
highest = float(highest)
```

startLabelMaxLen = float(startLabelMaxLen)

```
endLabelMaxLen = float(endLabelMaxLen)
```
```
 
```

# setup line width and font-size for the Cairo

# you can change these and the constants should

```
# scale the plots accordingly
```
```
 
```
```
FONT_SIZE = 9
```
```
LINE_WIDTH = 0.5
```
```
 
```

# there has to be a better way to get a base "surface"

# to do font calculations besides this. we're just making

# this Cairo surface to we know the max pixel width

# (font extents) of the labels in order to scale the graph

# accurately (since width/height are based, in part, on it)

```
 
```
```
filename = 'slopegraph.pdf'
```

surface = cairo.PDFSurface (filename, 8.5*72, 11*72)

```
cr = cairo.Context (surface)
```
```
cr.save()
```

cr.select_font_face("Sans", cairo.FONT_SLANT_NORMAL, cairo.FONT_WEIGHT_NORMAL)

```
cr.set_font_size(FONT_SIZE)
```
```
cr.set_line_width(LINE_WIDTH)
```

xbearing, ybearing, sWidth, sHeight, xadvance, yadvance = (cr.text_extents(s1))

xbearing, ybearing, eWidth, eHeight, xadvance, yadvance = (cr.text_extents(e1))

xbearing, ybearing, spaceWidth, spaceHeight, xadvance, yadvance = (cr.text_extents(" "))

```
cr.restore()
```
```
cr.show_page()
```
```
surface.finish()
```
```
 
```

# setup some more constants for plotting

# all of these are malleable and should cascade nicely

```
 
```
```
X_MARGIN = 10
```
```
Y_MARGIN = 10
```
```
SLOPEGRAPH_CANVAS_SIZE = 200
```
```
spaceWidth = 5
```
```
LINE_HEIGHT = 15
```
```
PLOT_LINE_WIDTH = 0.5
```
```
 
```

width = (X_MARGIN * 2) + sWidth + spaceWidth + SLOPEGRAPH_CANVAS_SIZE + spaceWidth + eWidth

height = (Y_MARGIN * 2) + (((highest - lowest + 1) / delta) * LINE_HEIGHT)

```
 
```
```
# create the real Cairo surface/canvas
```
```
 
```
```
filename = 'slopegraph.pdf'
```

surface = cairo.PDFSurface (filename, width, height)

```
cr = cairo.Context (surface)
```
```
 
```
```
cr.save()
```
```
 
```

cr.select_font_face("Sans", cairo.FONT_SLANT_NORMAL, cairo.FONT_WEIGHT_NORMAL)

```
cr.set_font_size(FONT_SIZE)
```
```
 
```
```
cr.set_line_width(LINE_WIDTH)
```

cr.set_source_rgba (0, 0, 0) # need to make this a constant

```
 
```

# draw start labels at the correct positions

# cheating a bit here as the code doesn't (yet) line up

```
# the actual data values
```
```
 
```
```
for k in sorted(startKeys):
```
```
 
```
```
	label = starts[k]
```

	xbearing, ybearing, lWidth, lHeight, xadvance, yadvance = (cr.text_extents(label))

```
 
```
```
	val = float(k)
```
```
 
```

	cr.move_to(X_MARGIN + (sWidth - lWidth), Y_MARGIN + (highest - val) * LINE_HEIGHT * (1/delta) + LINE_HEIGHT/2)

```
	cr.show_text(label + " " + k)
```
```
	cr.stroke()
```
```
 
```

# draw end labels at the correct positions

# cheating a bit here as the code doesn't (yet) line up

```
# the actual data values
```
```
 
```
```
for k in sorted(endKeys):
```
```
 
```
```
	label = ends[k]
```

	xbearing, ybearing, lWidth, lHeight, xadvance, yadvance = (cr.text_extents(label))

```
 
```
```
	val = float(k)
```
```
 
```

	cr.move_to(width - X_MARGIN - eWidth - (4*spaceWidth), Y_MARGIN + (highest - val) * LINE_HEIGHT * (1/delta) + LINE_HEIGHT/2)

```
	cr.show_text(k + " " + label)
```
```
	cr.stroke()
```
```
 
```
```
# do the actual plotting
```
```
 
```
```
cr.set_line_width(PLOT_LINE_WIDTH)
```

cr.set_source_rgba (0.75, 0.75, 0.75) # need to make this a constant

```
 
```
```
for s1,e1 in pairs:
```

	cr.move_to(X_MARGIN + sWidth + spaceWidth + 20, Y_MARGIN + (highest - s1) * LINE_HEIGHT * (1/delta) + LINE_HEIGHT/2)

	cr.line_to(width - X_MARGIN - eWidth - spaceWidth - 20, Y_MARGIN + (highest - e1) * LINE_HEIGHT * (1/delta) + LINE_HEIGHT/2)

```
	cr.stroke()
```
```
 
```
```
cr.restore()
```
```
cr.show_page()
```
```
surface.finish()
```

As promised, this post is a bit more graphical, but I feel the need to stress the importance of the first few points in chapter 2 of the book (i.e. the difference between mean and average and why variance is meaningful). These are fundamental concepts for future work.

The “pumpkin” example (2.1) gives us an opportunity to do some very basic R:

pumpkins <- c(1,1,1,3,3,591) #build an array

```
mean(pumpkins) #mean (average)
```
```
var(pumpkins) #variance
```
```
sd(pumpkins) #deviation
```

(as you can see, I’m still trying to find the best way to embed R source code)

We move from pumpkins to babies for Example 2.2 (you’ll need the whole bit of source from previous examples (that includes all the solutions in this example) to make the rest of the code snippets work). Here, we can quickly compute and compare the standard deviations (with difference) and the means (with difference) to help us analyze the statistical significane questions in the chapter:

```
sd(firstbabies$prglength)
```
```
sd(notfirstbabies$prglength)
```

sd(firstbabies$prglength) - sd(notfirstbabies$prglength)

```
 
```
```
mean(firstbabies$prglength)
```
```
mean(notfirstbabies$prglength)
```

mean(firstbabies$prglength) - mean(notfirstbabies$prglength)

You’ll see the power of R’s hist function in a moment, but you should be a bit surprised when you see the output if you enter to solve Example 2.3:

```
mode(firstbabies$prglength)
```

That’s right, R does not have a built-in mode function. It’s pretty straightforward to compute, tho:

names(sort(-table(firstbabies$prglength))[1])

(notice how “straightforward” != “simple”)

We have to use the table function to generate a table of value frequencies. It’s a two-dimensional structure with the actual value associated with the frequency represented as a string indexed at the same position. Using “-” inverts all the values (but keeps the two-dimensional indexing consistent) and sort orders the structure so we can use index “[1]” to get to the value we’re looking for. By using the names function, we get the string representing the value at the highest frequency. You can see this iteratively by breaking out the code:

```
table(firstbabies$prglength)
```
```
str(table(firstbabies$prglength))
```
```
sort(table(firstbabies$prglength))
```

sort(table(firstbabies$prglength))[1] #without the "-"

```
sort(-table(firstbabies$prglength))[1]
```

names(sort(-table(firstbabies$prglength))[1])

There are a plethora of other ways to compute the mode, but this one seems to work well for my needs.

Pictures Or It Didn’t Happen

I did debate putting the rest of this post into a separate example, but if you’ve stuck through this far, you deserve some stats candy. It’s actually pretty tricky to do what the book does here:

So, we’ll start off with simple histogram plots of each set being compared:

```
hist(firstbabies$prglength)
```

```
hist(notfirstbabies$prglength)
```

I separated those out since hist by default displays the histogram and if you just paste the lines consecutively, you’ll only see the last histogram. What does display is, well, ugly and charts should be beautiful. It will take a bit to explain the details (in another post) but this should get you started:

```
par(mfrow=c(1,2))par(mfrow=c(1,2))
```

hist(firstbabies$prglength, cex.lab=0.8, cex.axis=0.6, cex.main=0.8, las=1, col="white", ylim=c(0,3000),xlim=c(17,max(firstbabies$prglength)), breaks="Scott", main="Histogram of first babies", xlab="Weeks")

hist(notfirstbabies$prglength, cex.lab=0.8, cex.axis=0.6, cex.main=0.8, las=1, col="blue", ylim=c(0,3000),xlim=c(17,max(notfirstbabies$prglength)), breaks="Scott", main="Histogram of other babies", xlab="Weeks")

```
par(mfrow=c(1,1))
```

In the above code, we’re telling R to setup a canvas that will have one row and two plot areas. This makes it very easy to have many graphs on one canvas.

Next, the first hist sets up up some label proportions (the cex parameters), tells R to make Y labels horizontal (las=1), makes the bars white, sets up sane values for the X & Y axes, instructs R to use the “Scott” algorithm for calculating sane bins (we’ll cover this in more details next post) and sets up sane titles and X axis labels. Finally, we reset the canvas for the next plot.

There’s quite a bit to play with there and you can use the “help()” command to get information on the hist function and plot function. You can setup your own bin size by substituting an array for “Scott”. If you have specific questions, shoot a note in the comments, but I’ll explain more about what’s going on in the next post as we add in probability histograms and start looking at the data in more detail.