Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

With 1.2 under our belts, we go now to the example in section 1.3 which was designed to show us how to partition a larger set of data into subsets for analysis. In this case, we’re going to jump to example 1.3.2 to determine the number of live births.

While the Python loop is easy to write, the R code is even easier:

  1. livebirths <- subset(pregnancies,outcome==1)

First: don’t let the <- throw you off. It's just a more mathematical presentation of "=" (the assignment operator). While later versions of R support using = for assignment operations, it's considered good form to continue to use the left arrow.

The subset function will traverse pregnancies, looking for fields (variables) that meet the boolean expression outcome == 1 and place all those records into livebirths.

You can apply any amount of field logic to the subset function, as asked for by example 1.3.3:

  1. firstbabies <- subset(pregnancies,birthord==1 & outcome==1)

Since R was built for statistical computing, it's no surprise that to solve example 1.3.4 all we have to do is ask R to return the mean of that portion of the data frame:

  1. mean(firstbabies$prglength)
  2. mean(notfirstbabies$prglength)

(Here's a refresher on the basics of R data frame usage in case you skipped over that URL in the first post.)

To get the ~13hrs difference the text states, it's simple math. Just subtract the two values, multiply by 7 (days in a week) and then again by 24 (hours in a day).

In the next post, we'll begin to tap into the more visual side of R, but for now, play around with the following source code as you finish working through chapter one of Think Stats (you can also download the book for free from Green Tea Press).

  1. # ThinkStats in R by @hrbrmstr
  2. # Example 1.3
  3. # File format info: http://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm
  4.  
  5. # setup a data frame that has the field start/end info
  6.  
  7. pFields <- data.frame(name  = c('caseid', 'nbrnaliv', 'babysex', 'birthwgt_lb','birthwgt_oz','prglength', 'outcome', 'birthord',  'agepreg',  'finalwgt'), 
  8.                       begin = c(1, 22, 56, 57, 59, 275, 277, 278, 284, 423), 
  9.                       end   = c(12, 22, 56, 58, 60, 276, 277, 279, 287, 440) 
  10. ) 
  11.  
  12. # calculate widtds so we can pass them to read.fwf()
  13.  
  14. pFields$width <- pFields$end - pFields$begin + 1 
  15.  
  16. # we aren't reading every field (for the book exercises)
  17.  
  18. pFields$skip <-  (-c(pFields$begin[-1]-pFields$end[-nrow(pFields)]-1,0)) 
  19.  
  20. widths <- c(t(pFields[,4:5])) 
  21. widths <- widths[widths!=0] 
  22.  
  23. # read in the file
  24.  
  25. pregnancies <- read.fwf("2002FemPreg.dat", widths) 
  26.  
  27. # assign column names
  28.  
  29. names(pregnancies) <- pFields$name 
  30.  
  31. # divide mother's age by 100
  32.  
  33. pregnancies$agepreg <- pregnancies$agepreg / 100
  34.  
  35. # convert weight at birth from lbs/oz to total ounces
  36.  
  37. pregnancies$totalwgt_oz = pregnancies$birthwgt_lb * 16 + pregnancies$birthwgt_oz
  38.  
  39. rFields <- data.frame(name  = c('caseid'), 
  40.                       begin = c(1), 
  41.                       end   = c(12) 
  42. ) 
  43.  
  44. rFields$width <- rFields$end - rFields$begin + 1 
  45. rFields$skip <-  (-c(rFields$begin[-1]-rFields$end[-nrow(rFields)]-1,0)) 
  46.  
  47. widths <- c(t(rFields[,4:5])) 
  48. widths <- widths[widths!=0] 
  49.  
  50. respondents <- read.fwf("2002FemResp.dat", widths) 
  51. names(respondents) <- rFields$name
  52.  
  53. # exercise 1
  54. # not exactly the same, but even more info is provided in the summary from str()
  55.  
  56. str(respondents)
  57. str(pregnancies)
  58.  
  59. # for exercise 2
  60. # use subset() on the data frames
  61. # again, lazy use of str() for output
  62.  
  63. str(livebirths)
  64.  
  65. livebirths <- subset(pregnancies,outcome==1)
  66.  
  67. # exercise 3
  68.  
  69. firstbabies <- subset(pregnancies,birthord==1 & outcome==1)
  70. notfirstbabies <- subset(pregnancies,birthord > 1 & outcome==1)
  71.  
  72. str(firstbabies)
  73. str(notfirstbabies)
  74.  
  75. # exercise 4
  76.  
  77. mean(firstbabies$prglength)
  78. mean(notfirstbabies$prglength)
  79.  
  80.  
  81. hours = (mean(firstbabies$prglength) - mean(notfirstbabies$prglength)) * 7 * 24 
  82. hours

ThinkStats (by Allen B. Downey) is a good book to get you familiar with statistics (and even Python, if you’ve done some scripting in other languages).

I thought it would be interesting to present some of the examples & exercises in the book in R. Why? Well, once you’ve gone through the material in a particular chapter the “hard way”, seeing how you’d do the same thing in a language specifically designed for statistical computing should show when it’s best to use such a domain specific language and when you might want to consider a hybrid approach. I am also hoping it helps make R a bit more accessible to folks.

You’ll still need the book and should work through the Python examples to get the most out of these posts.

I’ll try to get at least one example/exercise section up a week.

Please submit all errors, omissions or optimizations in the comments section.

The star of the show is going to be the “data frame” in most of the examples (and is in this one). Unlike the Python code in the book, most of the hard work here is figuring out how to use the data frame file reader to parse the ugly fields in the CDC data file. By using some tricks, we can approximate the “field start:length” style of the Python code but still keep the automatic reading/parsing of the R code (including implicit handling of “NA” values).

The power & simplicity of using R’s inherent ability to apply a calculation across a whole column (pregnancies$agepreg <- pregnancies$agepreg / 100) should jump out. Unfortunately, not all elements of the examples in R will be as nice or straightforward.

You'll also notice that I cheat and use str() for displaying summary data.

Enough explanation! Here's the code:

  1. # ThinkStats in R by @hrbrmstr
  2. # Example 1.2
  3. # File format info: http://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm
  4.  
  5. # setup a data frame that has the field start/end info
  6.  
  7. pFields <- data.frame(name  = c('caseid', 'nbrnaliv', 'babysex', 'birthwgt_lb','birthwgt_oz','prglength', 'outcome', 'birthord',  'agepreg',  'finalwgt'), 
  8.                       begin = c(1, 22, 56, 57, 59, 275, 277, 278, 284, 423), 
  9.                       end   = c(12, 22, 56, 58, 60, 276, 277, 279, 287, 440) 
  10. ) 
  11.  
  12. # calculate widths so we can pass them to read.fwf()
  13.  
  14. pFields$width <- pFields$end - pFields$begin + 1 
  15.  
  16. # we aren't reading every field (for the book exercises)
  17.  
  18. pFields$skip <-  (-c(pFields$begin[-1]-pFields$end[-nrow(pFields)]-1,0)) 
  19.  
  20. widths <- c(t(pFields[,4:5])) 
  21. widths <- widths[widths!=0] 
  22.  
  23. # read in the file
  24.  
  25. pregnancies <- read.fwf("2002FemPreg.dat", widths) 
  26.  
  27. # assign column names
  28.  
  29. names(pregnancies) <- pFields$name 
  30.  
  31. # divide mother's age by 100
  32.  
  33. pregnancies$agepreg <- pregnancies$agepreg / 100
  34.  
  35. # convert weight at birth from lbs/oz to total ounces
  36.  
  37. pregnancies$totalwgt_oz = pregnancies$birthwgt_lb * 16 + pregnancies$birthwgt_oz
  38.  
  39. rFields <- data.frame(name  = c('caseid'), 
  40.                       begin = c(1), 
  41.                       end   = c(12) 
  42. ) 
  43.  
  44. rFields$width <- rFields$end - rFields$begin + 1 
  45. rFields$skip <-  (-c(rFields$begin[-1]-rFields$end[-nrow(rFields)]-1,0)) 
  46.  
  47. widths <- c(t(rFields[,4:5])) 
  48. widths <- widths[widths!=0] 
  49.  
  50. respondents <- read.fwf("2002FemResp.dat", widths) 
  51. names(respondents) <- rFields$name
  52.  
  53. str(respondents)
  54. str(pregnancies)

I’ve been an unapologetic Alfred user since @hatlessec recommended it and have recently been cobbling together quick shell scripts that make life a bit easier.

The following ones – lip & rip – copy your local & remote IP addresses (respectively) to your clipboard and also display a Growl message (if you’re a Growl user).

Nothing really special about them. They are each one-liners and are easily customizable once you install them.

Download: liprip.zip

I swear to fulfill, to the best of my ability and judgment, this covenant:

I will respect the hard-fought empirical gains of those practitioners in whose steps I walk, and gladly share such knowledge as is mine with those who are to follow.

I will apply, for the benefit of those who need it, all measures [that] are required, avoiding those twin traps of FUD and solutions that are unnecessary.

I will remember that there is art to security as well as science, and that respect, sympathy, and understanding may outweigh the metasploit or other blunt instruments.

I will not be ashamed to say “I don’t know”, nor will I fail to call in my colleagues when the skills of another are needed to solve a problem.

I will respect the privacy of those I serve, for their problems are not disclosed to me that the world may know. Most especially must I tread with care in matters of NPPI, PCI & HIPAA. If it is given to me to solve a problem, all thanks. But it may also be within my power to identify problems; this awesome responsibility must be faced with great humbleness and awareness of my own frailty. Above all, I must not play at God.

I will remember that I do not treat a server, a router, an application, but a fragile system, whose problems may affect a whole company and general economic stability. My responsibility includes these related problems, if I am to provide adequately for the those that need help.

I will prevent issues from occurring whenever I can, for prevention is preferable to remediation.

I will remember that I remain a member of society with special obligations to all my fellow human beings, those sound of mind and body as well as those who also need assistance.

If I do not violate this oath, may I enjoy life and art, respected while I live and remembered with affection thereafter. May I always act so as to preserve the finest traditions of my calling and may I long experience the joy of aiding those who seek my help.

I usually take a peek at the Internet Traffic Report (ITR) a couple times a day as part of my routine and was a bit troubled by all of the red today:

I wanted to do some crunching on the data, and I deliberately do not have Word or Excel on my new MacBook Pro (for reasons I can detail if asked). A SELECT / CUT / PASTE into TextWrangler did not really thrill me and I knew there had to be a way to get non-marked-up, columnar data into a format I could mangle and share easily.

Enter, Google Shreadsheet’s importHTML function.

If you don’t have the forumla bar enabled in Google Spreadsheets, just go to View->Formula Bar to enable it. Once there, enter the following in the formula bar to get the data from the ITR into a set of columns that will auto-update every time you reference the sheet.

=importHTML("http://www.internettrafficreport.com/namerica.htm","table",0)

(as you can see, it’s not case sensitive, either)

Yes, I know Excel can do this. I could have done a quick script whack the pasted data in TextWrangler. You can do something similar in R with htmlTreeParse + xpathApply and Perl has HTML::TableContentParser (and other handy modules), but this was a fast, easy way to get me to a point where I could do the basic analytics I wanted to perform (and, sometimes, all you need is quick & easy).

Official Google Help page on importHTML.

It’s rare that two of my passions—food and information security—intersect, but thanks to the USDA’s announcement of their Blueprint For Stronger Service, I can touch on both in one post.

In 2011, the Obama administration challenged all departments to reduce costs in a effort dubbed the “Campaign to Cut Waste“. In response, the USDA has managed to trim annual expenses by $150 million through a number of efforts. One such effort is to close 259 domestic USDA offices (you can see which states are impacted below).

I’m going to expand on why this is a bad idea over at #nom later this week, but 2011 was not a good year in terms of controlling food poisoning in the United States and I don’t think closing offices will make for better oversight.

Other efforts focus on the elimination of redundancies and inefficiencies. The Blueprint has 27 initial (or to-be-implemented immediately) improvements that include the following:

  • Consolidate more than 700 cell phone plans into about 10
  • Standardize civil rights training and purchases of cyber security products
  • Centralize civil rights, human resource, procurement, and property management functions

So, they were either getting gouged by suppliers (unlikely since there is negotiated pricing for the government) or the USDA’s “cyber-security” strategy was severely fragmented (and, thus, broken) enough that even finance folks could see the problem. Regardless of the source, it had to be pretty bad to make it to the top three of 27 immediate items (and called out in every sub-department press release) and even more so amongst over 160 initiatives that are being or have been put in place.

I still cannot find the details of the plan or budget analysis that went into the focus on cyber security products (links appreciated if you have them), but as private organizations continue their efforts to defend against existing and emerging threats, it might be worth a look at your strategy and spend a bit more closely. Would your infosec department be included in a similar list if your organization went through such a sweeping cost-cutting analysis program? Is your portfolio of security products as optimized as it can be? Could you use a budget sweep as an opportunity to leap frog your security capabilities (e.g. move to whitelisting vs signature-based anti-malware) vs just pressure your existing vendors and re-negotiate contracts?

Unfortunately, the government being the government, I’m now even more concerned that the USDA may need to worry about increased infections on both the food-level and the “cyber” level.

Starting sometime mid-year in 2011, I began having more ‘stuff’ to do than even my eidetic memory could help with. It’s not that I forgot things, per se, but the ability to mentally recall and prioritize work, family, personal and other tasks finally required some external assistance and I resolved to find a GTD system by the end of January.

Being an OS X user, there are great choices out there (both of those have iOS sister-apps, too). However, I’m not just an OS X user. As I was saying to @myrcurial (and even @reillyusa) the other day, I dislike being locked in to proprietary solutions. Plus, the $120 price tag for OmniFocus (OS X + iPad) seemed like a king’s ransom, especially since I am also an Android user (OmniFocus only has an iOS app) and pay for both Dropbox and various virtual hosts. Believing that I still have some usable skills left, I decided to — as @hatlessec characterized my solution — cobble something together on my own.

Once upon a time, I did maintain a .plan file (when I had sysadmin duties), but really doubted the efficacy of it and finger in the age of the modern web. The thought of machinating SQLite databases, parsing XML files or even digesting bits of JSON seemed overkill for my purposes. Searching through my Evernote clippings, my memory was drawn back to one of my favorite sites, Lifehacker, which has regular GTD coverage. After re-poking around a bit, I decided to settle on @ginatrapani’s @todotxtapps for meeting the following requirements (in order):

  • It uses a plain text file with a simple structure – (no exposit necessary…the link is a quick read and the format will become second nature after a glance)
  • It is Free (mostly) – mobile apps are ~$2.00USD each and if you need more than free Dropbox hosting and want a web interface, there are potential hosting costs. If you count your setup time as money, then add that in, too.
  • It runs on OS X, BSD, Windows & Linux – no platform lock-in
  • It has a thriving community – without being backed by a vendor (like the really #spiffy @omnigroup), a strong developer & user community is extremely important to ensure the longevity of the codebase. Todo.txt has very passionate developers and users who are very active on all fronts.
  • It is very extensible & integrable – I used @alfredapps to give me a quick OS X “GUI CLI” to the todo.sh commands. I built an Alfred keyword for my most used Todo.txt functions along with a generic one to bring up vim in a Terminal.app window for a free-form edit. Alfred’s shell-commands also give me @growlmac integration (so I get some feedback after working with tasks).

    I also integrated it with @geektool. I won’t steal the thunder from other GeekTool/Todo.txt integration posts (like this one). The GeekTool integration puts my todo’s right in front of me all the time on all my desktops.

    By storing my todo directory in @dropbox, it also makes syncing to my web site and mobile devices a snap.

    On my server, I have a simple cron job setup to e-mail me my todo’s at the beginning of the day (again, so it’s in front of me wherever I look).

  • It runs on iOS AND Android – again, no platform lock-in
  • There’s an optional web interface – the one I linked to (there are others) is far from ideal, but it was quick to setup and has no overt security issues. Properly protected behind nginx or apache, you should have no issues if you need to have a web version handy.

So, while the setup is a bit more than just downloading two commercial apps, it has many other benefits and isn’t too much more work if you already have some of the other pieces in place. If you want more info on the Alfred scripts or any other setup component, drop me a note in the comments.

While I’ve read about many GTD solutions and seen many user-stories of how they met their GTD needs, I’d be interested in what tools you use to ‘get things done’…

Feedburner has borked the old RSS feed for the site and has completely disassociated me from it (meaning it’s no longer in my Google Feedburner admin options and they won’t let me re-claim it).

So… the new feed link is http://rud.is/b/feed/atom/.

Apologies for any inconvenience.