Scripting languages Archives

ThinkStats (by Allen B. Downey) is a good book to get you familiar with statistics (and even Python, if you’ve done some scripting in other languages).

I thought it would be interesting to present some of the examples & exercises in the book in R. Why? Well, once you’ve gone through the material in a particular chapter the “hard way”, seeing how you’d do the same thing in a language specifically designed for statistical computing should show when it’s best to use such a domain specific language and when you might want to consider a hybrid approach. I am also hoping it helps make R a bit more accessible to folks.

You’ll still need the book and should work through the Python examples to get the most out of these posts.

I’ll try to get at least one example/exercise section up a week.

Please submit all errors, omissions or optimizations in the comments section.

The star of the show is going to be the “data frame” in most of the examples (and is in this one). Unlike the Python code in the book, most of the hard work here is figuring out how to use the data frame file reader to parse the ugly fields in the CDC data file. By using some tricks, we can approximate the “field start:length” style of the Python code but still keep the automatic reading/parsing of the R code (including implicit handling of “NA” values).

The power & simplicity of using R’s inherent ability to apply a calculation across a whole column (pregnancies$agepreg <- pregnancies$agepreg / 100) should jump out. Unfortunately, not all elements of the examples in R will be as nice or straightforward.

You'll also notice that I cheat and use str() for displaying summary data.

Enough explanation! Here's the code:

```
# ThinkStats in R by @hrbrmstr
```
```
# Example 1.2
```

# File format info: http://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm

```
 
```

# setup a data frame that has the field start/end info

```
 
```

pFields <- data.frame(name  = c('caseid', 'nbrnaliv', 'babysex', 'birthwgt_lb','birthwgt_oz','prglength', 'outcome', 'birthord',  'agepreg',  'finalwgt'),

                      begin = c(1, 22, 56, 57, 59, 275, 277, 278, 284, 423),

                      end   = c(12, 22, 56, 58, 60, 276, 277, 279, 287, 440)

```
) 
```
```
 
```

# calculate widths so we can pass them to read.fwf()

```
 
```

pFields$width <- pFields$end - pFields$begin + 1

```
 
```

# we aren't reading every field (for the book exercises)

```
 
```

pFields$skip <-  (-c(pFields$begin[-1]-pFields$end[-nrow(pFields)]-1,0))

```
 
```
```
widths <- c(t(pFields[,4:5])) 
```
```
widths <- widths[widths!=0] 
```
```
 
```
```
# read in the file
```
```
 
```

pregnancies <- read.fwf("2002FemPreg.dat", widths)

```
 
```
```
# assign column names
```
```
 
```
```
names(pregnancies) <- pFields$name 
```
```
 
```
```
# divide mother's age by 100
```
```
 
```

pregnancies$agepreg <- pregnancies$agepreg / 100

```
 
```

# convert weight at birth from lbs/oz to total ounces

```
 
```

pregnancies$totalwgt_oz = pregnancies$birthwgt_lb * 16 + pregnancies$birthwgt_oz

```
 
```

rFields <- data.frame(name  = c('caseid'),

```
                      begin = c(1), 
```
```
                      end   = c(12) 
```
```
) 
```
```
 
```

rFields$width <- rFields$end - rFields$begin + 1

rFields$skip <-  (-c(rFields$begin[-1]-rFields$end[-nrow(rFields)]-1,0))

```
 
```
```
widths <- c(t(rFields[,4:5])) 
```
```
widths <- widths[widths!=0] 
```
```
 
```

respondents <- read.fwf("2002FemResp.dat", widths)

```
names(respondents) <- rFields$name
```
```
 
```
```
str(respondents)
```
```
str(pregnancies)
```

I wanted to play with the AwesomeChartJS library and figured an interesting way to do that was to use it to track Microsoft Security Bulletins this year. While I was drawn in by just how simple it is to craft basic charts, that simplicity really only makes it useful for simple data sets. So, while I’ve produced three diferent views of Microsoft Security Bulletins for 2011 (to-date, and in advance of February’s Patch Tuesday), it would not be a good choice to do a running comparison between past years and 20111 (per-month). The authors self-admit that there are [deliberate] limitations and point folks to the most excellent flot library for more sophisticated analytics (which I may feature in March).

The library itself only works within an HTML5 environment (one of the reasons I chose it) and uses a separate <canvas> element to house each chart. After loading up the library iself in a script tag:

<script src="/b/js/AwesomeChartJS/awesomechart.js" type="application/javascript">

(which is ~32K un-minified) you then declare a canvas element:


<canvas id="canvas1" width="400" height="300"></canvas>

and use some pretty straighforward javascript (no dependency on jQuery or other large frameworks) to do the drawing:

var mychart = new AwesomeChart('canvas1');

mychart.title = "Microsoft Security Bulletins Raw Count By Month - 2011";

mychart.data = [2, 12];

mychart.colors = ["#0000FF","#0000FF"];

mychart.labels = ["January", "February"];

mychart.draw();

It’s definitely worth a look if you have simple charting needs.

Regrettably, it looks like February is going to be a busy month for Windows administrators.

rud.is

Tag Archives: Scripting languages

ThinkStats … in R (including Example 1.2)

AwesomeChartJS Meets Microsoft Security Bulletins