Skip navigation

Category Archives: R

With 1.2 under our belts, we go now to the example in section 1.3 which was designed to show us how to partition a larger set of data into subsets for analysis. In this case, we’re going to jump to example 1.3.2 to determine the number of live births.

While the Python loop is easy to write, the R code is even easier:

  1. livebirths <- subset(pregnancies,outcome==1)

First: don’t let the <- throw you off. It's just a more mathematical presentation of "=" (the assignment operator). While later versions of R support using = for assignment operations, it's considered good form to continue to use the left arrow.

The subset function will traverse pregnancies, looking for fields (variables) that meet the boolean expression outcome == 1 and place all those records into livebirths.

You can apply any amount of field logic to the subset function, as asked for by example 1.3.3:

  1. firstbabies <- subset(pregnancies,birthord==1 & outcome==1)

Since R was built for statistical computing, it's no surprise that to solve example 1.3.4 all we have to do is ask R to return the mean of that portion of the data frame:

  1. mean(firstbabies$prglength)
  2. mean(notfirstbabies$prglength)

(Here's a refresher on the basics of R data frame usage in case you skipped over that URL in the first post.)

To get the ~13hrs difference the text states, it's simple math. Just subtract the two values, multiply by 7 (days in a week) and then again by 24 (hours in a day).

In the next post, we'll begin to tap into the more visual side of R, but for now, play around with the following source code as you finish working through chapter one of Think Stats (you can also download the book for free from Green Tea Press).

  1. # ThinkStats in R by @hrbrmstr
  2. # Example 1.3
  3. # File format info: http://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm
  4.  
  5. # setup a data frame that has the field start/end info
  6.  
  7. pFields <- data.frame(name  = c('caseid', 'nbrnaliv', 'babysex', 'birthwgt_lb','birthwgt_oz','prglength', 'outcome', 'birthord',  'agepreg',  'finalwgt'), 
  8.                       begin = c(1, 22, 56, 57, 59, 275, 277, 278, 284, 423), 
  9.                       end   = c(12, 22, 56, 58, 60, 276, 277, 279, 287, 440) 
  10. ) 
  11.  
  12. # calculate widtds so we can pass them to read.fwf()
  13.  
  14. pFields$width <- pFields$end - pFields$begin + 1 
  15.  
  16. # we aren't reading every field (for the book exercises)
  17.  
  18. pFields$skip <-  (-c(pFields$begin[-1]-pFields$end[-nrow(pFields)]-1,0)) 
  19.  
  20. widths <- c(t(pFields[,4:5])) 
  21. widths <- widths[widths!=0] 
  22.  
  23. # read in the file
  24.  
  25. pregnancies <- read.fwf("2002FemPreg.dat", widths) 
  26.  
  27. # assign column names
  28.  
  29. names(pregnancies) <- pFields$name 
  30.  
  31. # divide mother's age by 100
  32.  
  33. pregnancies$agepreg <- pregnancies$agepreg / 100
  34.  
  35. # convert weight at birth from lbs/oz to total ounces
  36.  
  37. pregnancies$totalwgt_oz = pregnancies$birthwgt_lb * 16 + pregnancies$birthwgt_oz
  38.  
  39. rFields <- data.frame(name  = c('caseid'), 
  40.                       begin = c(1), 
  41.                       end   = c(12) 
  42. ) 
  43.  
  44. rFields$width <- rFields$end - rFields$begin + 1 
  45. rFields$skip <-  (-c(rFields$begin[-1]-rFields$end[-nrow(rFields)]-1,0)) 
  46.  
  47. widths <- c(t(rFields[,4:5])) 
  48. widths <- widths[widths!=0] 
  49.  
  50. respondents <- read.fwf("2002FemResp.dat", widths) 
  51. names(respondents) <- rFields$name
  52.  
  53. # exercise 1
  54. # not exactly the same, but even more info is provided in the summary from str()
  55.  
  56. str(respondents)
  57. str(pregnancies)
  58.  
  59. # for exercise 2
  60. # use subset() on the data frames
  61. # again, lazy use of str() for output
  62.  
  63. str(livebirths)
  64.  
  65. livebirths <- subset(pregnancies,outcome==1)
  66.  
  67. # exercise 3
  68.  
  69. firstbabies <- subset(pregnancies,birthord==1 & outcome==1)
  70. notfirstbabies <- subset(pregnancies,birthord > 1 & outcome==1)
  71.  
  72. str(firstbabies)
  73. str(notfirstbabies)
  74.  
  75. # exercise 4
  76.  
  77. mean(firstbabies$prglength)
  78. mean(notfirstbabies$prglength)
  79.  
  80.  
  81. hours = (mean(firstbabies$prglength) - mean(notfirstbabies$prglength)) * 7 * 24 
  82. hours

ThinkStats (by Allen B. Downey) is a good book to get you familiar with statistics (and even Python, if you’ve done some scripting in other languages).

I thought it would be interesting to present some of the examples & exercises in the book in R. Why? Well, once you’ve gone through the material in a particular chapter the “hard way”, seeing how you’d do the same thing in a language specifically designed for statistical computing should show when it’s best to use such a domain specific language and when you might want to consider a hybrid approach. I am also hoping it helps make R a bit more accessible to folks.

You’ll still need the book and should work through the Python examples to get the most out of these posts.

I’ll try to get at least one example/exercise section up a week.

Please submit all errors, omissions or optimizations in the comments section.

The star of the show is going to be the “data frame” in most of the examples (and is in this one). Unlike the Python code in the book, most of the hard work here is figuring out how to use the data frame file reader to parse the ugly fields in the CDC data file. By using some tricks, we can approximate the “field start:length” style of the Python code but still keep the automatic reading/parsing of the R code (including implicit handling of “NA” values).

The power & simplicity of using R’s inherent ability to apply a calculation across a whole column (pregnancies$agepreg <- pregnancies$agepreg / 100) should jump out. Unfortunately, not all elements of the examples in R will be as nice or straightforward.

You'll also notice that I cheat and use str() for displaying summary data.

Enough explanation! Here's the code:

  1. # ThinkStats in R by @hrbrmstr
  2. # Example 1.2
  3. # File format info: http://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm
  4.  
  5. # setup a data frame that has the field start/end info
  6.  
  7. pFields <- data.frame(name  = c('caseid', 'nbrnaliv', 'babysex', 'birthwgt_lb','birthwgt_oz','prglength', 'outcome', 'birthord',  'agepreg',  'finalwgt'), 
  8.                       begin = c(1, 22, 56, 57, 59, 275, 277, 278, 284, 423), 
  9.                       end   = c(12, 22, 56, 58, 60, 276, 277, 279, 287, 440) 
  10. ) 
  11.  
  12. # calculate widths so we can pass them to read.fwf()
  13.  
  14. pFields$width <- pFields$end - pFields$begin + 1 
  15.  
  16. # we aren't reading every field (for the book exercises)
  17.  
  18. pFields$skip <-  (-c(pFields$begin[-1]-pFields$end[-nrow(pFields)]-1,0)) 
  19.  
  20. widths <- c(t(pFields[,4:5])) 
  21. widths <- widths[widths!=0] 
  22.  
  23. # read in the file
  24.  
  25. pregnancies <- read.fwf("2002FemPreg.dat", widths) 
  26.  
  27. # assign column names
  28.  
  29. names(pregnancies) <- pFields$name 
  30.  
  31. # divide mother's age by 100
  32.  
  33. pregnancies$agepreg <- pregnancies$agepreg / 100
  34.  
  35. # convert weight at birth from lbs/oz to total ounces
  36.  
  37. pregnancies$totalwgt_oz = pregnancies$birthwgt_lb * 16 + pregnancies$birthwgt_oz
  38.  
  39. rFields <- data.frame(name  = c('caseid'), 
  40.                       begin = c(1), 
  41.                       end   = c(12) 
  42. ) 
  43.  
  44. rFields$width <- rFields$end - rFields$begin + 1 
  45. rFields$skip <-  (-c(rFields$begin[-1]-rFields$end[-nrow(rFields)]-1,0)) 
  46.  
  47. widths <- c(t(rFields[,4:5])) 
  48. widths <- widths[widths!=0] 
  49.  
  50. respondents <- read.fwf("2002FemResp.dat", widths) 
  51. names(respondents) <- rFields$name
  52.  
  53. str(respondents)
  54. str(pregnancies)