Skip navigation

Category Archives: Open Source

alogoWhile you can (and should) view [all the presentations](https://speakerdeck.com/pyconslides) from #PyCon2013, here are my picks for the ones that interested me the most, as they focus on scaling, mapping, automation (both web & electronics) and data analysis:

– [Chef: Why you should automate your web infrastructure](https://speakerdeck.com/pyconslides/chef-why-you-should-automate-your-web-infrastructure-by-kate-heddleston) by Kate Heddleston
– [Messaging at Scale at Instagram](https://speakerdeck.com/pyconslides/messaging-at-scale-at-instagram-by-rick-branson) by Rick Branson
– [Python at Netflix](https://speakerdeck.com/pyconslides/python-at-netflix-by-jeremy-edberg-corey-bertram-and-roy-rapoport) by Jeremy Edberg, Corey Bertram, and Roy Rapoport
– [Real-time Tracking and Mapping of Geographic Objects](https://speakerdeck.com/pyconslides/real-time-tracking-and-mapping-of-geographic-objects-by-ragi-burhum) by Ragi Burhum
– [Scaling Realtime at DISQUS](https://speakerdeck.com/pyconslides/scaling-realtime-at-disqus-by-adam-hitchcock) by Adam Hitchcock
– [A Crash Course in MongoDB](https://speakerdeck.com/pyconslides/a-crash-course-in-mongodb)
– [Server Log Analysis with Pandas](https://speakerdeck.com/pyconslides/server-log-analysis-with-pandas-by-taavi-burns) by Taavi Burns
– [Who’s There – Home Automation with Arduino and RaspberryPi](https://speakerdeck.com/pyconslides/whos-there-home-automation-with-arduino-and-raspberrypi-by-rupa-dachere) by Rupa Dachere
x
– [Why you should use Python 3 for text processing](https://speakerdeck.com/pyconslides/why-you-should-use-python-3-for-text-processing-by-david-mertz) by David Mertz
– [Awesome Big Data Algorithms](https://speakerdeck.com/pyconslides/awesome-big-data-algorithms-by-titus-brown) by Titus Brown

A huge thanks to the speakers and conference organizers for making these resources freely available, especially to those of us who were not able to attend the conference.

Many thanks to all who attended the talk @jayjacobs & I gave at RSA on Tuesday, February 26, 2013. It was really great to be able to talk to so many of you afterwards as well.

We’ve enumerated quite a bit of non-slide-but-in-presentation information that we wanted to aggregate into a blog post so you can viz along at home. If you need more of a guided path, I strongly encourage you to take a look at some of the free courses over at [Coursera](https://www.coursera.org/).

For starters, here’s a bit.ly bundle of data analysis & visualization bookmarks that @dseverski & I maintain. We’ve been doing (IMO) a pretty good job adding new resources as they come up and may have some duplicates to the ones below.

People Mentioned

– [Stephen Few’s Perceptual Edge blog](http://www.perceptualedge.com/) : Start from the beginning to learn from a giant in information visualization
– [Andy Kirk’s Visualising Data blog](http://www.visualisingdata.com/) (@visualisingdata) : Perhaps the quintessential leader in the modern visualization movement.
– [Mike Bostock’s blog](http://bost.ocks.org/mike/) (@mbostock) : Creator of D3 and producer of amazing, interactive graphics for the @NYTimes
– [Edward Tufte’s blog](http://www.edwardtufte.com/tufte/) : The father of what we would now identify as our core visualization principles & practices

Tools Mentioned

– [R](http://www.r-project.org/) : Jay & I probably use this a bit too much as a hammer (i.e. treat ever data project as a nail) but it’s just far too flexible and powerful to not use as a go-to resource
– [RStudio](http://www.rstudio.com/) : An *amazing* IDE for R. I, personally, usually despise IDEs (yes, I even dislike Xcode), but RStudio truly improves workflow by several orders of magnitude. There are both desktop and server versions of it; the latter gives you the ability to setup a multi-user environment and use the IDE from practically anywhere you are. RStudio also makes generating [reproducible research](http://cran.r-project.org/web/views/ReproducibleResearch.html) a joy with built-in easy access to tools like [kintr](http://yihui.name/knitr/).
– [iPython](http://ipython.org/) : This version of Python takes an already amazing language and kicks it up a few notches. It brings it up to the level of R+RStudio, especially with it’s knitr-like [iPython Notebooks](http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html) for–again–reproducible research.
– [Mondrian](http://www.theusrus.de/Mondrian/) : This tool needs far more visibility. It enables extremely quick visualization of even very large data sets. The interface takes a bit of getting used to, but it’s faster then typing R commands or fumbling in Excel.
– [Tableau](http://www.tableausoftware.com/) : This tool may be one of the most accessible, fast & flexible ways to explore data sets to get an idea of where you need to/can do further analysis.
– [Processing](http://processing.org/) : A tool that was designed from the ground up to help journalists create powerful, interactive data visualizations that you can slipstream directly onto the web via the [Processing.js](http://processingjs.org/) library.
– [D3](http://d3js.org/) : The foundation of modern, data-driven visualization on the web.
– [Gephi](https://gephi.org/) : A very powerful tool when you need to explore networks & create beautiful, publication-worthy visualizations.
– [MongoDB](http://www.mongodb.org/) : NoSQL database that’s highly & easily scaleable without a steep learning curve.
– [CRUSH Tools by Google](https://code.google.com/p/crush-tools/) : Kicks up your command-line data munging.

(NOTE: You can keep up with progress best at github, but can always search on “slopegraph” here or just hit the tag page: “slopegraph” regularly)

I’ve been a bit obsessed with slopegraphs (a.k.a “Tufte table-chart”) of late and very dissatisfied with the lack of tools to make this particular visualization tool more prevalent. While my ultimate goal is to have a user-friendly modern web app or platform app that’s as easy as a “drag & drop” of a CSV file, this first foray will require a bit (not much, really!) of elbow grease to be used.

For those who want to get right to the code, head on over to github and have a look (I’ll post all updates there). Setup, sample & source are also below.

First, you’ll need a modern Python install. I did all the development on Mac OS Mountain Lion (beta) with the stock Python 2.7 build. You’ll also need the Cairo 2D graphics library which built and installed perfectly from source, even on ML, so it should work fine for you. If you want something besides PDF rendering, you may need additional libraries, but PDF is decent for hi-res embedding, converting to jpg/png (see below) and tweaking in programs like Illustrator.

If you search for “Gender Comparisons” in the comments on this post at Tufte’s blog, you’ll see what I was trying to reproduce in this bit of skeleton code (below). By modifying the CSV file you’re using [line 21] and then which fields are relevant [lines 45-47] you should be able to make your own basic slopegraphs without much trouble.

If you catch any glitches, add some tweak or have a slopegraph “wish list”, let me know here, twitter (@hrbrmstr) or over at github.

  1. # slopegraph.py
  2. #
  3. # Author: Bob Rudis (@hrbrmstr)
  4. #
  5. # Basic Python skeleton to do simple two value slopegraphs
  6. # with output to PDF (most useful form for me...Cairo has tons of options)
  7. #
  8. # Find out more about & download Cairo here:
  9. # http://cairographics.org/
  10. #
  11. # 2012-05-28 - 0.5 - Initial github release. Still needs some polish
  12. #
  13.  
  14. import csv
  15. import cairo
  16.  
  17. # original data source: http://www.calvin.edu/~stob/data/television.csv
  18.  
  19. # get a CSV file to work with 
  20.  
  21. slopeReader = csv.reader(open('television.csv', 'rb'), delimiter=',', quotechar='"')
  22.  
  23. starts = {} # starting "points"/
  24. ends = {} # ending "points"
  25.  
  26. # Need to refactor label max width into font calculations
  27. # as there's no guarantee the longest (character-wise)
  28. # label is the widest one
  29.  
  30. startLabelMaxLen = 0
  31. endLabelMaxLen = 0
  32.  
  33. # build a base pair array for the final plotting
  34. # wastes memory, but simplifies plotting
  35.  
  36. pairs = []
  37.  
  38. for row in slopeReader:
  39.  
  40. 	# add chosen values (need start/end for each CSV row)
  41. 	# to the final plotting array. Try this sample with 
  42. 	# row[1] (average life span) instead of row[5] to see some
  43. 	# of the scaling in action
  44.  
  45. 	lab = row[0] # label
  46. 	beg = row[5] # male life span
  47. 	end = row[4] # female life span
  48.  
  49. 	pairs.append( (float(beg), float(end)) )
  50.  
  51. 	# combine labels of common values into one string
  52. 	# also (as noted previously, inappropriately) find the
  53. 	# longest one
  54.  
  55. 	if beg in starts:
  56. 		starts[beg] = starts[beg] + "; " + lab
  57. 	else:
  58. 		starts[beg] = lab
  59.  
  60. 	if ((len(starts[beg]) + len(beg)) > startLabelMaxLen):
  61. 		startLabelMaxLen = len(starts[beg]) + len(beg)
  62. 		s1 = starts[beg]
  63.  
  64.  
  65. 	if end in ends:
  66. 		ends[end] = ends[end] + "; " + lab
  67. 	else:
  68. 		ends[end] = lab
  69.  
  70. 	if ((len(ends[end]) + len(end)) > endLabelMaxLen):
  71. 		endLabelMaxLen = len(ends[end]) + len(end)
  72. 		e1 = ends[end]
  73.  
  74. # sort all the values (in the event the CSV wasn't) so
  75. # we can determine the smallest increment we need to use
  76. # when stacking the labels and plotting points
  77.  
  78. startSorted = [(k, starts[k]) for k in sorted(starts)]
  79. endSorted = [(k, ends[k]) for k in sorted(ends)]
  80.  
  81. startKeys = sorted(starts.keys())
  82. delta = max(startSorted)
  83. for i in range(len(startKeys)):
  84. 	if (i+1 <= len(startKeys)-1):
  85. 		currDelta = float(startKeys[i+1]) - float(startKeys[i])
  86. 		if (currDelta < delta):
  87. 			delta = currDelta
  88.  
  89. endKeys = sorted(ends.keys())
  90. for i in range(len(endKeys)):
  91. 	if (i+1 <= len(endKeys)-1):
  92. 		currDelta = float(endKeys[i+1]) - float(endKeys[i])
  93. 		if (currDelta < delta):
  94. 			delta = currDelta
  95.  
  96. # we also need to find the absolute min & max values
  97. # so we know how to scale the plots
  98.  
  99. lowest = min(startKeys)
  100. if (min(endKeys) < lowest) : lowest = min(endKeys)
  101.  
  102. highest = max(startKeys)
  103. if (max(endKeys) > highest) : highest = max(endKeys)
  104.  
  105. # just making sure everything's a number
  106. # probably should move some of this to the csv reader section
  107.  
  108. delta = float(delta)
  109. lowest = float(lowest)
  110. highest = float(highest)
  111. startLabelMaxLen = float(startLabelMaxLen)
  112. endLabelMaxLen = float(endLabelMaxLen)
  113.  
  114. # setup line width and font-size for the Cairo
  115. # you can change these and the constants should
  116. # scale the plots accordingly
  117.  
  118. FONT_SIZE = 9
  119. LINE_WIDTH = 0.5
  120.  
  121. # there has to be a better way to get a base "surface"
  122. # to do font calculations besides this. we're just making
  123. # this Cairo surface to we know the max pixel width 
  124. # (font extents) of the labels in order to scale the graph
  125. # accurately (since width/height are based, in part, on it)
  126.  
  127. filename = 'slopegraph.pdf'
  128. surface = cairo.PDFSurface (filename, 8.5*72, 11*72)
  129. cr = cairo.Context (surface)
  130. cr.save()
  131. cr.select_font_face("Sans", cairo.FONT_SLANT_NORMAL, cairo.FONT_WEIGHT_NORMAL)
  132. cr.set_font_size(FONT_SIZE)
  133. cr.set_line_width(LINE_WIDTH)
  134. xbearing, ybearing, sWidth, sHeight, xadvance, yadvance = (cr.text_extents(s1))
  135. xbearing, ybearing, eWidth, eHeight, xadvance, yadvance = (cr.text_extents(e1))
  136. xbearing, ybearing, spaceWidth, spaceHeight, xadvance, yadvance = (cr.text_extents(" "))
  137. cr.restore()
  138. cr.show_page()
  139. surface.finish()
  140.  
  141. # setup some more constants for plotting
  142. # all of these are malleable and should cascade nicely
  143.  
  144. X_MARGIN = 10
  145. Y_MARGIN = 10
  146. SLOPEGRAPH_CANVAS_SIZE = 200
  147. spaceWidth = 5
  148. LINE_HEIGHT = 15
  149. PLOT_LINE_WIDTH = 0.5
  150.  
  151. width = (X_MARGIN * 2) + sWidth + spaceWidth + SLOPEGRAPH_CANVAS_SIZE + spaceWidth + eWidth
  152. height = (Y_MARGIN * 2) + (((highest - lowest + 1) / delta) * LINE_HEIGHT)
  153.  
  154. # create the real Cairo surface/canvas
  155.  
  156. filename = 'slopegraph.pdf'
  157. surface = cairo.PDFSurface (filename, width, height)
  158. cr = cairo.Context (surface)
  159.  
  160. cr.save()
  161.  
  162. cr.select_font_face("Sans", cairo.FONT_SLANT_NORMAL, cairo.FONT_WEIGHT_NORMAL)
  163. cr.set_font_size(FONT_SIZE)
  164.  
  165. cr.set_line_width(LINE_WIDTH)
  166. cr.set_source_rgba (0, 0, 0) # need to make this a constant
  167.  
  168. # draw start labels at the correct positions
  169. # cheating a bit here as the code doesn't (yet) line up 
  170. # the actual data values
  171.  
  172. for k in sorted(startKeys):
  173.  
  174. 	label = starts[k]
  175. 	xbearing, ybearing, lWidth, lHeight, xadvance, yadvance = (cr.text_extents(label))
  176.  
  177. 	val = float(k)
  178.  
  179. 	cr.move_to(X_MARGIN + (sWidth - lWidth), Y_MARGIN + (highest - val) * LINE_HEIGHT * (1/delta) + LINE_HEIGHT/2)
  180. 	cr.show_text(label + " " + k)
  181. 	cr.stroke()
  182.  
  183. # draw end labels at the correct positions
  184. # cheating a bit here as the code doesn't (yet) line up 
  185. # the actual data values
  186.  
  187. for k in sorted(endKeys):
  188.  
  189. 	label = ends[k]
  190. 	xbearing, ybearing, lWidth, lHeight, xadvance, yadvance = (cr.text_extents(label))
  191.  
  192. 	val = float(k)
  193.  
  194. 	cr.move_to(width - X_MARGIN - eWidth - (4*spaceWidth), Y_MARGIN + (highest - val) * LINE_HEIGHT * (1/delta) + LINE_HEIGHT/2)
  195. 	cr.show_text(k + " " + label)
  196. 	cr.stroke()
  197.  
  198. # do the actual plotting
  199.  
  200. cr.set_line_width(PLOT_LINE_WIDTH)
  201. cr.set_source_rgba (0.75, 0.75, 0.75) # need to make this a constant
  202.  
  203. for s1,e1 in pairs:
  204. 	cr.move_to(X_MARGIN + sWidth + spaceWidth + 20, Y_MARGIN + (highest - s1) * LINE_HEIGHT * (1/delta) + LINE_HEIGHT/2)
  205. 	cr.line_to(width - X_MARGIN - eWidth - spaceWidth - 20, Y_MARGIN + (highest - e1) * LINE_HEIGHT * (1/delta) + LINE_HEIGHT/2)
  206. 	cr.stroke()
  207.  
  208. cr.restore()
  209. cr.show_page()
  210. surface.finish()

Starting sometime mid-year in 2011, I began having more ‘stuff’ to do than even my eidetic memory could help with. It’s not that I forgot things, per se, but the ability to mentally recall and prioritize work, family, personal and other tasks finally required some external assistance and I resolved to find a GTD system by the end of January.

Being an OS X user, there are great choices out there (both of those have iOS sister-apps, too). However, I’m not just an OS X user. As I was saying to @myrcurial (and even @reillyusa) the other day, I dislike being locked in to proprietary solutions. Plus, the $120 price tag for OmniFocus (OS X + iPad) seemed like a king’s ransom, especially since I am also an Android user (OmniFocus only has an iOS app) and pay for both Dropbox and various virtual hosts. Believing that I still have some usable skills left, I decided to — as @hatlessec characterized my solution — cobble something together on my own.

Once upon a time, I did maintain a .plan file (when I had sysadmin duties), but really doubted the efficacy of it and finger in the age of the modern web. The thought of machinating SQLite databases, parsing XML files or even digesting bits of JSON seemed overkill for my purposes. Searching through my Evernote clippings, my memory was drawn back to one of my favorite sites, Lifehacker, which has regular GTD coverage. After re-poking around a bit, I decided to settle on @ginatrapani’s @todotxtapps for meeting the following requirements (in order):

  • It uses a plain text file with a simple structure – (no exposit necessary…the link is a quick read and the format will become second nature after a glance)
  • It is Free (mostly) – mobile apps are ~$2.00USD each and if you need more than free Dropbox hosting and want a web interface, there are potential hosting costs. If you count your setup time as money, then add that in, too.
  • It runs on OS X, BSD, Windows & Linux – no platform lock-in
  • It has a thriving community – without being backed by a vendor (like the really #spiffy @omnigroup), a strong developer & user community is extremely important to ensure the longevity of the codebase. Todo.txt has very passionate developers and users who are very active on all fronts.
  • It is very extensible & integrable – I used @alfredapps to give me a quick OS X “GUI CLI” to the todo.sh commands. I built an Alfred keyword for my most used Todo.txt functions along with a generic one to bring up vim in a Terminal.app window for a free-form edit. Alfred’s shell-commands also give me @growlmac integration (so I get some feedback after working with tasks).

    I also integrated it with @geektool. I won’t steal the thunder from other GeekTool/Todo.txt integration posts (like this one). The GeekTool integration puts my todo’s right in front of me all the time on all my desktops.

    By storing my todo directory in @dropbox, it also makes syncing to my web site and mobile devices a snap.

    On my server, I have a simple cron job setup to e-mail me my todo’s at the beginning of the day (again, so it’s in front of me wherever I look).

  • It runs on iOS AND Android – again, no platform lock-in
  • There’s an optional web interface – the one I linked to (there are others) is far from ideal, but it was quick to setup and has no overt security issues. Properly protected behind nginx or apache, you should have no issues if you need to have a web version handy.

So, while the setup is a bit more than just downloading two commercial apps, it has many other benefits and isn’t too much more work if you already have some of the other pieces in place. If you want more info on the Alfred scripts or any other setup component, drop me a note in the comments.

While I’ve read about many GTD solutions and seen many user-stories of how they met their GTD needs, I’d be interested in what tools you use to ‘get things done’…