Quick hit :: Over at the #tri blog, I’ve got a review of a relatively new bone-conducting headset by @AfterShokz.
Xcode→Preferences…→Downloads→Components
Quick hit :: Over at the #tri blog, I’ve got a review of a relatively new bone-conducting headset by @AfterShokz.
Work & home chaos has me a bit behind in the “ThinkStats…in R” posts, but I “needed” to get some of the homebrew
kit working in Mountain Lion Developer Preview 2 (to run some network discovery tools while waiting for #4’s surgery to be done at the hospital).
Keying off the great outline by @myobie (read that first), I managed to get (at least what I needed working) everything cranking for homebrew
with the Xcode 4.4 Developer Preview 2 for Mountain Lion.
/Applications
Xcode→Preferences…→Downloads→Components
xcode-select
to tell the system which Xcode to use:
xcode-select -switch /Applications/Xcode.app/Contents/Developer
brewing
!After performing those steps, I was able to force an update install of nmap
that worked perfectly. As @myobie points out, it’s important to add the --use-gcc
option to your brew installs
if you experience anything behaving weirdly without it.
Drop a note below if you discover any other necessary tweaks for certain homebrew
operations in Mountain Lion Developer Preview 2.
As promised, this post is a bit more graphical, but I feel the need to stress the importance of the first few points in chapter 2 of the book (i.e. the difference between mean and average and why variance is meaningful). These are fundamental concepts for future work.
The “pumpkin” example (2.1) gives us an opportunity to do some very basic R:
pumpkins <- c(1,1,1,3,3,591) #build an array
mean(pumpkins) #mean (average)
var(pumpkins) #variance
sd(pumpkins) #deviation
(as you can see, I’m still trying to find the best way to embed R source code)
We move from pumpkins to babies for Example 2.2 (you’ll need the whole bit of source from previous examples (that includes all the solutions in this example) to make the rest of the code snippets work). Here, we can quickly compute and compare the standard deviations (with difference) and the means (with difference) to help us analyze the statistical significane questions in the chapter:
sd(firstbabies$prglength)
sd(notfirstbabies$prglength)
sd(firstbabies$prglength) - sd(notfirstbabies$prglength)
mean(firstbabies$prglength)
mean(notfirstbabies$prglength)
mean(firstbabies$prglength) - mean(notfirstbabies$prglength)
You’ll see the power of R’s hist
function in a moment, but you should be a bit surprised when you see the output if you enter to solve Example 2.3:
mode(firstbabies$prglength)
That’s right, R does not have a built-in mode
function. It’s pretty straightforward to compute, tho:
names(sort(-table(firstbabies$prglength))[1])
(notice how “straightforward” != “simple”)
We have to use the table
function to generate a table of value frequencies. It’s a two-dimensional structure with the actual value associated with the frequency represented as a string indexed at the same position. Using “-
” inverts all the values (but keeps the two-dimensional indexing consistent) and sort
orders the structure so we can use index “[1]
” to get to the value we’re looking for. By using the names
function, we get the string representing the value at the highest frequency. You can see this iteratively by breaking out the code:
table(firstbabies$prglength)
str(table(firstbabies$prglength))
sort(table(firstbabies$prglength))
sort(table(firstbabies$prglength))[1] #without the "-"
sort(-table(firstbabies$prglength))[1]
names(sort(-table(firstbabies$prglength))[1])
There are a plethora of other ways to compute the mode
, but this one seems to work well for my needs.
I did debate putting the rest of this post into a separate example, but if you’ve stuck through this far, you deserve some stats candy. It’s actually pretty tricky to do what the book does here:
So, we’ll start off with simple histogram plots of each set being compared:
hist(firstbabies$prglength)
hist(notfirstbabies$prglength)
I separated those out since hist
by default displays the histogram and if you just paste the lines consecutively, you’ll only see the last histogram. What does display is, well, ugly and charts should be beautiful. It will take a bit to explain the details (in another post) but this should get you started:
par(mfrow=c(1,2))par(mfrow=c(1,2))
hist(firstbabies$prglength, cex.lab=0.8, cex.axis=0.6, cex.main=0.8, las=1, col="white", ylim=c(0,3000),xlim=c(17,max(firstbabies$prglength)), breaks="Scott", main="Histogram of first babies", xlab="Weeks")
hist(notfirstbabies$prglength, cex.lab=0.8, cex.axis=0.6, cex.main=0.8, las=1, col="blue", ylim=c(0,3000),xlim=c(17,max(notfirstbabies$prglength)), breaks="Scott", main="Histogram of other babies", xlab="Weeks")
par(mfrow=c(1,1))
In the above code, we’re telling R to setup a canvas that will have one row and two plot areas. This makes it very easy to have many graphs on one canvas.
Next, the first hist
sets up up some label proportions (the cex
parameters), tells R to make Y labels horizontal (las=1
), makes the bars white, sets up sane values for the X & Y axes, instructs R to use the “Scott” algorithm for calculating sane bins (we’ll cover this in more details next post) and sets up sane titles and X axis labels. Finally, we reset the canvas for the next plot.
There’s quite a bit to play with there and you can use the “help()
” command to get information on the hist
function and plot
function. You can setup your own bin size by substituting an array for “Scott”. If you have specific questions, shoot a note in the comments, but I’ll explain more about what’s going on in the next post as we add in probability histograms and start looking at the data in more detail.
With 1.2 under our belts, we go now to the example in section 1.3 which was designed to show us how to partition a larger set of data into subsets for analysis. In this case, we’re going to jump to example 1.3.2 to determine the number of live births.
While the Python loop is easy to write, the R code is even easier:
livebirths <- subset(pregnancies,outcome==1)
First: don’t let the <-
throw you off. It's just a more mathematical presentation of "=
" (the assignment operator). While later versions of R support using =
for assignment operations, it's considered good form to continue to use the left arrow.
The subset
function will traverse pregnancies
, looking for fields (variables) that meet the boolean expression outcome == 1
and place all those records into livebirths
.
You can apply any amount of field logic to the subset
function, as asked for by example 1.3.3:
firstbabies <- subset(pregnancies,birthord==1 & outcome==1)
Since R was built for statistical computing, it's no surprise that to solve example 1.3.4 all we have to do is ask R to return the mean of that portion of the data frame:
mean(firstbabies$prglength)
mean(notfirstbabies$prglength)
(Here's a refresher on the basics of R data frame usage in case you skipped over that URL in the first post.)
To get the ~13hrs difference the text states, it's simple math. Just subtract the two values, multiply by 7 (days in a week) and then again by 24 (hours in a day).
In the next post, we'll begin to tap into the more visual side of R, but for now, play around with the following source code as you finish working through chapter one of Think Stats (you can also download the book for free from Green Tea Press).
# ThinkStats in R by @hrbrmstr
# Example 1.3
# File format info: http://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm
# setup a data frame that has the field start/end info
pFields <- data.frame(name = c('caseid', 'nbrnaliv', 'babysex', 'birthwgt_lb','birthwgt_oz','prglength', 'outcome', 'birthord', 'agepreg', 'finalwgt'),
begin = c(1, 22, 56, 57, 59, 275, 277, 278, 284, 423),
end = c(12, 22, 56, 58, 60, 276, 277, 279, 287, 440)
)
# calculate widtds so we can pass them to read.fwf()
pFields$width <- pFields$end - pFields$begin + 1
# we aren't reading every field (for the book exercises)
pFields$skip <- (-c(pFields$begin[-1]-pFields$end[-nrow(pFields)]-1,0))
widths <- c(t(pFields[,4:5]))
widths <- widths[widths!=0]
# read in the file
pregnancies <- read.fwf("2002FemPreg.dat", widths)
# assign column names
names(pregnancies) <- pFields$name
# divide mother's age by 100
pregnancies$agepreg <- pregnancies$agepreg / 100
# convert weight at birth from lbs/oz to total ounces
pregnancies$totalwgt_oz = pregnancies$birthwgt_lb * 16 + pregnancies$birthwgt_oz
rFields <- data.frame(name = c('caseid'),
begin = c(1),
end = c(12)
)
rFields$width <- rFields$end - rFields$begin + 1
rFields$skip <- (-c(rFields$begin[-1]-rFields$end[-nrow(rFields)]-1,0))
widths <- c(t(rFields[,4:5]))
widths <- widths[widths!=0]
respondents <- read.fwf("2002FemResp.dat", widths)
names(respondents) <- rFields$name
# exercise 1
# not exactly the same, but even more info is provided in the summary from str()
str(respondents)
str(pregnancies)
# for exercise 2
# use subset() on the data frames
# again, lazy use of str() for output
str(livebirths)
livebirths <- subset(pregnancies,outcome==1)
# exercise 3
firstbabies <- subset(pregnancies,birthord==1 & outcome==1)
notfirstbabies <- subset(pregnancies,birthord > 1 & outcome==1)
str(firstbabies)
str(notfirstbabies)
# exercise 4
mean(firstbabies$prglength)
mean(notfirstbabies$prglength)
hours = (mean(firstbabies$prglength) - mean(notfirstbabies$prglength)) * 7 * 24
hours
ThinkStats (by Allen B. Downey) is a good book to get you familiar with statistics (and even Python, if you’ve done some scripting in other languages).
I thought it would be interesting to present some of the examples & exercises in the book in R. Why? Well, once you’ve gone through the material in a particular chapter the “hard way”, seeing how you’d do the same thing in a language specifically designed for statistical computing should show when it’s best to use such a domain specific language and when you might want to consider a hybrid approach. I am also hoping it helps make R a bit more accessible to folks.
You’ll still need the book and should work through the Python examples to get the most out of these posts.
I’ll try to get at least one example/exercise section up a week.
Please submit all errors, omissions or optimizations in the comments section.
The star of the show is going to be the “data frame” in most of the examples (and is in this one). Unlike the Python code in the book, most of the hard work here is figuring out how to use the data frame file reader to parse the ugly fields in the CDC data file. By using some tricks, we can approximate the “field start:length” style of the Python code but still keep the automatic reading/parsing of the R code (including implicit handling of “NA” values).
The power & simplicity of using R’s inherent ability to apply a calculation across a whole column (pregnancies$agepreg <- pregnancies$agepreg / 100
) should jump out. Unfortunately, not all elements of the examples in R will be as nice or straightforward.
You'll also notice that I cheat and use str()
for displaying summary data.
Enough explanation! Here's the code:
# ThinkStats in R by @hrbrmstr
# Example 1.2
# File format info: http://www.cdc.gov/nchs/nsfg/nsfg_cycle6.htm
# setup a data frame that has the field start/end info
pFields <- data.frame(name = c('caseid', 'nbrnaliv', 'babysex', 'birthwgt_lb','birthwgt_oz','prglength', 'outcome', 'birthord', 'agepreg', 'finalwgt'),
begin = c(1, 22, 56, 57, 59, 275, 277, 278, 284, 423),
end = c(12, 22, 56, 58, 60, 276, 277, 279, 287, 440)
)
# calculate widths so we can pass them to read.fwf()
pFields$width <- pFields$end - pFields$begin + 1
# we aren't reading every field (for the book exercises)
pFields$skip <- (-c(pFields$begin[-1]-pFields$end[-nrow(pFields)]-1,0))
widths <- c(t(pFields[,4:5]))
widths <- widths[widths!=0]
# read in the file
pregnancies <- read.fwf("2002FemPreg.dat", widths)
# assign column names
names(pregnancies) <- pFields$name
# divide mother's age by 100
pregnancies$agepreg <- pregnancies$agepreg / 100
# convert weight at birth from lbs/oz to total ounces
pregnancies$totalwgt_oz = pregnancies$birthwgt_lb * 16 + pregnancies$birthwgt_oz
rFields <- data.frame(name = c('caseid'),
begin = c(1),
end = c(12)
)
rFields$width <- rFields$end - rFields$begin + 1
rFields$skip <- (-c(rFields$begin[-1]-rFields$end[-nrow(rFields)]-1,0))
widths <- c(t(rFields[,4:5]))
widths <- widths[widths!=0]
respondents <- read.fwf("2002FemResp.dat", widths)
names(respondents) <- rFields$name
str(respondents)
str(pregnancies)
I’ve been an unapologetic Alfred user since @hatlessec recommended it and have recently been cobbling together quick shell scripts that make life a bit easier.
The following ones – lip
& rip
– copy your local & remote IP addresses (respectively) to your clipboard and also display a Growl message (if you’re a Growl user).
Nothing really special about them. They are each one-liners and are easily customizable once you install them.
Download: liprip.zip
I swear to fulfill, to the best of my ability and judgment, this covenant:
I will respect the hard-fought empirical gains of those practitioners in whose steps I walk, and gladly share such knowledge as is mine with those who are to follow.
I will apply, for the benefit of those who need it, all measures [that] are required, avoiding those twin traps of FUD and solutions that are unnecessary.
I will remember that there is art to security as well as science, and that respect, sympathy, and understanding may outweigh the metasploit or other blunt instruments.
I will not be ashamed to say “I don’t know”, nor will I fail to call in my colleagues when the skills of another are needed to solve a problem.
I will respect the privacy of those I serve, for their problems are not disclosed to me that the world may know. Most especially must I tread with care in matters of NPPI, PCI & HIPAA. If it is given to me to solve a problem, all thanks. But it may also be within my power to identify problems; this awesome responsibility must be faced with great humbleness and awareness of my own frailty. Above all, I must not play at God.
I will remember that I do not treat a server, a router, an application, but a fragile system, whose problems may affect a whole company and general economic stability. My responsibility includes these related problems, if I am to provide adequately for the those that need help.
I will prevent issues from occurring whenever I can, for prevention is preferable to remediation.
I will remember that I remain a member of society with special obligations to all my fellow human beings, those sound of mind and body as well as those who also need assistance.
If I do not violate this oath, may I enjoy life and art, respected while I live and remembered with affection thereafter. May I always act so as to preserve the finest traditions of my calling and may I long experience the joy of aiding those who seek my help.