I saw a fly-by `#rstats` mention of more airplane accident data on — of all places — LinkedIn (email) today which took me to a [GitHub repo](https://github.com/philjette/CrashData) by @philjette. It seems there’s a [web site](http://www.planecrashinfo.com/) (run by what seems to be a single human) that tracks plane crashes. Here’s a tweet from @philjette announcing it:
The repo contains the R code that scrapes the site and it’s (mostly) in old-school R and works really well. I’m collecting and conjuring many bits of R for the classes I’m teaching in the fall and thought that it would be useful to replicate @philjette’s example in modern Hadleyverse style (i.e. `dplyr`, `rvest`, etc). I even submitted a [pull request](https://github.com/philjette/CrashData/pull/1) to him with the additional version. I’ve replicated it below with some additional comments for those wanting to jump into the Hadleyverse. No shiny `ggplot2` graphs this time, I’m afraid. This is all raw code, but will hopefully be useful to those learning the modern ropes.
Just to get the setup bits out of the way, here’s all the packages I’ll be using:
library(dplyr)
library(rvest)
library(magrittr)
library(stringr)
library(lubridate)
library(pbapply)
Phil made a function to grab data for a whole year, so I did the same and gave it a default parameter of the current year (programmatically). I also tossed in some parameter checking for good measure.
The basic setup is to:
– grab the HTML for the page of a given year
– extract and format the crash dates
– extract location & operator information, which is made slightly annoying since the site uses a `
` and includes spurious newlines within a single `
` element
– extract aircraft type and registration (same issues as previous column)
– extract accident details, which are embedded in a highly formatted column that requires `str_match_all` to handle (well)
Some things worth mentioning:
– `data_frame` is super-helpful in not-creating `factors` from the character vectors
– `bind_rows` and `bind_cols` are a nice alternative to using `data.table` functions
– I think `stringr` needs a more pipe-friendly replacement for `gsub` and, perhaps, even `ifesle` (yes, I guess I could submit a PR). The `.` just feels wrong in pipes to me, still
– if you’re not using `pbapply` functions (free progress bars for everyone!) you _should_ be, especially for long scraping operations
– sometimes XPath entries can be less verbose than CSS (and easier to craft) and I have no issue mixing them in scraping code when necessary
Here’s the new `get_data` function (_updated per comment and to also add some more hadleyverse goodness_):
#' retrieve crash data for a given year
#' defaults to current year
#' earliest year in the database is 1920
get_data <- function(year=as.numeric(format(Sys.Date(), "%Y"))) {
crash_base <- "http://www.planecrashinfo.com/%d/%s.htm"
if (year < 1920 | year > as.numeric(format(Sys.Date(), "%Y"))) {
stop("year must be >=1920 and <=current year", call.=FALSE)
}
# get crash date
pg <- html(sprintf(crash_base, year, year))
pg %>%
html_nodes("table > tr > td:nth-child(1)") %>%
html_text() %>%
extract(-1) %>%
dmy() %>%
data_frame(date=.) -> date
# get location and operator
loc_op <- bind_rows(lapply(1:length(date), function(i) {
pg %>%
html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/preceding-sibling::text()", i)) %>%
html_text() %>%
str_trim() %>%
str_replace_all("^(Near|Off) ", "") -> loc
pg %>%
html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/following-sibling::text()", i)) %>%
html_text() %>%
str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") -> op
data_frame(location=loc, operator=op)
}))
# get type & registration
type_reg <- bind_rows(lapply(1:length(date), function(i) {
pg %>%
html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/preceding-sibling::text()", i)) %>%
html_text() %>%
str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") %>%
ifelse(.=="?", NA, .) -> typ
pg %>% html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/following-sibling::text()", i)) %>%
html_text() %>%
str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") %>%
ifelse(.=="?", NA, .) -> reg
data_frame(type=typ, registration=reg)
}))
# get fatalities
pg %>% html_nodes("table > tr > td:nth-child(4)") %>%
html_text() %>%
str_match_all("([[:digit:]]+)/([[:digit:]]+)\\(([[:digit:]]+)\\)") %>%
lapply(function(x) {
data_frame(aboard=as.numeric(x[2]), fatalties=as.numeric(x[3]), ground=as.numeric(x[4]))
}) %>%
bind_rows %>% tail(-1) -> afg
bind_cols(date, loc_op, type_reg, afg)
}
While that gets one year, it’s super-simple to get all crashes since 1950:
crashes <- bind_rows(pblapply(1950:2015, get_data))
Yep. That’s it. Now `crashes` contains a `data.frame` (well, `tbl_df`) of all the crashes since 1950, ready for further analysis.
For the class I’m teaching, I’ll be extending this to grab the extra details for each crash link and then performing more data science-y operations.
If you’ve got any streamlining tips or alternate ways to handle the scraping Hadleyverse-style please drop a note in the comments. Also, definitely check out Phil’s great solution, especially to compare it to this new version.
UPDATE: You can now run this as a local Shiny app by entering shiny::runGist("95ec24c1b0cb433a76a5", launch.browser=TRUE) at an R prompt (provided all the dependent libraries (below) are installed) or use it interactively over at Shiny Apps.
The impending arrival of the first real snowfall of the year in my part of Maine got me curious about what the most likely “first snow” dates are for my region. The U.S. Historical Climatology Network (USHCN) maintains [historical daily climate records](http://cdiac.ornl.gov/epubs/ndp/ushcn/daily_doc.html) for each station in each state and has data (for some stations) going back as far as the 1800’s. A quick look at their [data files](http://cdiac.ornl.gov/ftp/ushcn_daily/) indicated that they would definitely help satiate my curiosity (and make for a late night of cranking out some R code and ggplot visualizations).
To start, we’ll need a bit more than base R to get the job done:
library(pbapply)
library(data.table)
library(dplyr)
library(ggplot2)
library(grid)
library(gridExtra)
library(stringi)
In all honesty, `pbapply`, `dplyr` and `stringi` are not necessary, but they definitely make life easier by (respectively) giving us:
– free progress bars for `*apply` operations,
– high efficacy data manipulation idioms, and
– a handy utility for converting strings to title case.
With setup out of the way, the first real task is to see which observer station is closest to my area. To figure that out we need to read in the station data file which is, sadly, in fixed-width format. Some stations have `#` characters in their titles, to we have to account for that when we call `read.fwf`. After reading in the station database we use a naive–but-usable distance calculation to find the closest station:
stations <- read.fwf("data/ushcn-stations.txt",
widths=c(6, 9, 10, 7, 3, 31, 7, 7, 7, 3),
col.names=c("coop_id", "latitude", "longitude", "elevation",
"state", "name", "component_1", "component_2",
"component_3", "utc_offset"),
colClasses=c("character", "numeric", "numeric", "numeric",
"character", "character", "character", "character",
"character", "character"),
comment.char="", strip.white=TRUE)
# not a great circle, but it gets the job done here
closestStation <- function(stations, lat, lon) {
index <- which.min(sqrt((stations$latitude-lat)^2 +
(stations$longitude-lon)^2))
stations[index,]
}
# what's the closest station?
closestStation(stations, 43.2672, -70.8617)
## coop_id latitude longitude elevation state name component_1 component_2 component_3 utc_offset
633 272174 43.15 -70.95 24.4 NH DURHAM ------ ------ ------ +5
As a Mainer, I’m not thrilled that this is the actual, closest station, so we’ll also see what the closest one is in Maine:
closestStation(stations %>% filter(state=="ME"), 43.2672, -70.8617)
## coop_id latitude longitude elevation state name component_1 component_2 component_3 utc_offset
10 176905 43.6497 -70.3003 13.7 ME PORTLAND JETPORT ------ ------ ------ +5
The analysis is easy enough to do for both, so we’ll first take a look at Durham, New Hampshire then do the exact same valuation for Portland, Maine.
Despite being fixed-width, the station database was not too difficult to wrangle. The state-level files that contain the readings are another matter:
| Variable |
Columns |
Type |
| COOP ID |
1-6 |
Character |
| YEAR |
7-10 |
Integer |
| MONTH |
11-12 |
Integer |
| ELEMENT |
13-16 |
Character |
| VALUE1 |
17-21 |
Integer |
| MFLAG1 |
22 |
Character |
| QFLAG1 |
23 |
Character |
| SFLAG1 |
24 |
Character |
| VALUE2 |
25-29 |
Integer |
| MFLAG2 |
30 |
Character |
| QFLAG2 |
31 |
Character |
| SFLAG2 |
32 |
Character |
| … |
… |
… |
| VALUE31 |
257-261 |
Integer |
| MFLAG31 |
262 |
Character |
| QFLAG31 |
263 |
Character |
| SFLAG31 |
264 |
Character |
We have fixed-width, wide-format records with 31 days for each month, which proves the existence of truly insidious people in the world. Rather than use `read.fwf` again, we’ll take a different approach (since we ultimately need the data in long format) and use `readLines` to read in all the records from the NH data file, then filter out everything but snowfall entries from the station we’re interested in.
Next, we setup nested `lapply` calls to build a long data frame from each month then combine them all together into a single data frame:
snow <- readLines("data/state27_NH.txt")
snow <- grep("SNOW", snow, value=TRUE)
snow <- grep("^272174", snow, value=TRUE)
snow_dat <- rbindlist(pblapply(snow, function(x) {
rbindlist(lapply(1:31, function(i) {
# record format described here:
# http://cdiac.ornl.gov/ftp/ushcn_daily/data_format.txt
start <- 17 + (i-1)*8
list(coop_id=substr(x, 1, 6),
date=sprintf("%s-%02d-%02d", substr(x, 7, 10), as.numeric(substr(x, 11, 12)), i),
element=substr(x, 13, 16),
value=as.numeric(substr(x, start, start+4)),
mflag=substr(x, start+5, start+5),
qflag=substr(x, start+6, start+6),
sflag=substr(x, start+7, start+7))
}))
}))
Now, we’ll clean up the records even further by removing invalid entries (those with a `value` == `-9999`) and convert record dates to actual `Date` objects and filter out invalid dates:
snow_dat <- snow_dat %>% filter(value != -9999)
# since the data file has 31 days for each records regardless of whether
# that's valid or not we do a shortcut to remove invalid dates by doing the
# a vectorized Date conversion, then removing records with NA dates
snow_dat$date <- as.Date(snow_dat$date)
snow_dat <- snow_dat %>% filter(!is.na(date))
# having the year extracted is handy for filtering
snow_dat$year <- format(snow_dat$date, "%Y")
Given that Winter in the U.S. spans across two calendar years, we need a way to keep dates in January-May associated with the previous year (yes, that adds an inherent assumption that no first snow is in June, which might not hold true for Alaska). To facilitate this, we’ll convert each date to its corresponding day of year value then add the number of total days in the start year to those values for all months <= May. We really do need to do this, too, since there are many cases where the first snowfall will be in January-March for many states.
snow_dat$doy <- as.numeric(format(snow_dat$date, "%j"))
snow_dat$doy <- ifelse(snow_dat$doy<=180,
snow_dat$doy + as.numeric(format(as.Date(sprintf("%s-12-31", snow_dat$year)), "%j")),
snow_dat$doy)
Now, the fun begins. We use (mostly) `dplyr` to extract the first snowfall day from each year, then make a dot-line plot from the data:
first <- snow_dat %>%
filter(value>0) %>% # ignore 0 values
filter(date>=as.Date("1950-01-01")) %>% # start at 1950 (arbitrary)
merge(stations, by="coop_id", all.x=TRUE) %>% # merge station details
group_by(coop_id, year) %>% # group by station and year
arrange(doy) %>% # sort by our munged day of year
filter(row_number(doy) == 1) %>% # grab the first entry by group
select(name, state, date, value, doy) # we only need some variables
title_1 <- sprintf("First observed snowfall (historical) at %s, %s", stri_trans_totitle(unique(first$name)), unique(first$state))
gg <- ggplot(first, aes(y=year, x=doy))
gg <- gg + geom_segment(aes(xend=min(first$doy)-20, yend=year), color="#9ecae1", size=0.25)
gg <- gg + geom_point(aes(color=coop_id), shape=8, size=3, color="#3182bd")
gg <- gg + geom_text(aes(label=format(date, "%b-%d")), size=3, hjust=-0.2)
gg <- gg + scale_x_continuous(expand=c(0, 0), limits=c(min(first$doy)-20, max(first$doy)+20))
gg <- gg + labs(x=NULL, y=NULL, title=title_1)
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="none")
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.text.y=element_text(color="#08306b"))
by_year <- gg
While that will help us see the diversity across years, we have to do quite a bit of eye tracking to get the most likely date range for the first snowfall, so we’ll add a boxplot into the mix and use `summary` to figure out the quartiles (for labeling the chart) for the actual date values:
wx_range <- summary(as.Date(format(first$date, "2013-%m-%d")))
names(wx_range) <- NULL
min_wx <- gsub("2013-", "", wx_range[2])
max_wx <- gsub("2013-", "", wx_range[5])
title_2 <- sprintf("Most likely first snowfall will be between %s & %s", min_wx, max_wx)
gg <- ggplot(first %>% mutate(name="0000"), aes(name, doy))
gg <- gg + geom_boxplot(fill="#3182bd", color="#08306b", outlier.colour="#08306b")
gg <- gg + scale_y_continuous(expand=c(0, 0),
limits=c(min(first$doy)-20, max(first$doy)+20))
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title=title_2)
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="none")
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_line(color="white"))
gg <- gg + theme(axis.text.y=element_text(color="white"))
gg <- gg + theme(plot.title=element_text(size=11))
box_wx <- gg
Finally, we’ll combine them together to get the finished product:
grid.arrange(by_year, box_wx, nrow=2, heights=unit(c(0.9, 0.1), "npc"))
And, do the same for Portland:
Click for larger
There are many more analyses and visualizations that can be performed on these data sets, but be wary when creating them as I’ve seen a few files with fixed-width formatting errors and have also noticed missing records for some observer stations.
You can find the complete, commented code up on [github](https://github.com/hrbrmstr/snowfirst).
The Washingon Post did another great story+vis, this time on states [Spending seized assets](http://www.washingtonpost.com/wp-srv/special/investigative/asset-seizures/).
According to their sub-head:
>_Since 2008, about 5,400 police agencies have spent $2.5 billion in proceeds from cash and property seized under federal civil forfeiture laws. Police suspected the assets were linked to crime, although in 81 percent of cases no one was indicted._
Their interactive visualization lets you drill down into each state to examine the spending in each category. Since the WaPo team made the [data available](http://www.washingtonpost.com/wp-srv/special/investigative/asset-seizures/data/all.json) [JSON] I thought it might be interesting to take a look at a comparison across states (i.e. who are the “big spenders” of this siezed hoarde). Here’s a snippet of the JSON:
{"states": [
{
"st": "AK",
"stn": "Alaska",
"total": 8470032,
"cats":
[{ "weapons": 1649832,
"electronicSurv": 402490,
"infoRewards": 760730,
"travTrain": 848128,
"commPrograms": 121664,
"salaryOvertime": 776766,
"other": 1487613,
"commComp": 1288439,
"buildImprov": 1134370 }],
"agencies": [
{
"aid": "AK0012700",
"aname": "Airport Police & Fire Ted Stevens Anch Int'L Arpt",
"total": 611553,
"cats":
[{ "weapons": 214296, "travTrain": 44467, "other": 215464, "commComp": 127308, "buildImprov": 10019 }]
},
{
"aid": "AK0010100",
"aname": "Anchorage Police Department",
"total": 3961497,
"cats":
[{ "weapons": 1104777, "electronicSurv": 94741, "infoRewards": 743230, "travTrain": 409474, "salaryOvertime": 770709, "other": 395317, "commComp": 249220, "buildImprov": 194029 }]
},
Getting the data was easy (in R, of course!). Let’s setup the packages we’ll need:
library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
library(grid)
library(statebins)
library(gridExtra)
We also need `jsonlite`, but only to parse the data (which I’ve downloaded locally), so we’ll just do that in one standalone line:
data <- jsonlite::fromJSON("all.json", simplifyVector=FALSE)
It’s not fair (or valid) to just compare totals since some states have a larger population than others, so we’ll show the data twice, once in raw totals and once with a per-capita lens. For that, we’ll need population data:
pop <- read.csv("http://www.census.gov/popest/data/state/asrh/2013/files/SCPRC-EST2013-18+POP-RES.csv", stringsAsFactors=FALSE)
colnames(pop) <- c("sumlev", "region", "divison", "state", "stn", "pop2013", "pop18p2013", "pcntest18p")
pop$stn <- gsub(" of ", " Of ", pop$stn)
We have to fix the `District of Columbia` since the WaPo data capitalizes the `Of`.
Now we need to extract the agency data. This is really straightforward with some help from the `data.table` package:
agencies <- rbindlist(lapply(data$states, function(x) {
rbindlist(lapply(x$agencies, function(y) {
data.table(st=x$st, stn=x$stn, aid=y$aid, aname=y$aname, rbindlist(y$cats))
}), fill=TRUE)
}), fill=TRUE)
The `rbindlist` `fill` option is super-handy in the event we have varying columns (and, we do in this case). It’s also wicked-fast.
Now, we use some `dplyr` and `tidyr` to integrate the population information and summarize our data (OK, we cheat and use `melt`, but some habits are hard to break):
c_st <- agencies %>%
merge(pop[,5:6], all.x=TRUE, by="stn") %>%
gather(category, value, -st, -stn, -pop2013, -aid, -aname) %>%
group_by(st, category, pop2013) %>%
summarise(total=sum(value, na.rm=TRUE), per_capita=sum(value, na.rm=TRUE)/pop2013) %>%
select(st, category, total, per_capita)
Let’s use a series of bar charts to compare state-against state. We’ll do the initial view with just raw totals. There are 9 charts, so this graphic scrolls a bit and you can select it to make it larger:
# hack to ordering the bars by kohske : http://stackoverflow.com/a/5414445/1457051 #####
c_st <- transform(c_st, category2=factor(paste(st, category)))
c_st <- transform(c_st, category2=reorder(category2, rank(-total)))
# pretty names #####
levels(c_st$category) <- c("Weapons", "Travel, training", "Other",
"Communications, computers", "Building improvements",
"Electronic surveillance", "Information, rewards",
"Salary, overtime", "Community programs")
gg <- ggplot(c_st, aes(x=category2, y=total))
gg <- gg + geom_bar(stat="identity", aes(fill=category))
gg <- gg + scale_y_continuous(labels=dollar)
gg <- gg + scale_x_discrete(labels=c_st$st, breaks=c_st$category2)
gg <- gg + facet_wrap(~category, scales = "free", ncol=1)
gg <- gg + labs(x="", y="")
gg <- gg + theme_bw()
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme(strip.text=element_text(size=15, face="bold"))
gg <- gg + theme(panel.margin=unit(2, "lines"))
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.position="none")
gg
Comparison of Spending Category by State (raw totals)

There are definitely a few, repeating “big spenders” in that view, but is that the _real_ story? Let’s take another look, but factoring in state population:
# change bar order to match per-capita calcuation #####
c_st <- transform(c_st, category2=reorder(category2, rank(-per_capita)))
# per-capita bar plot #####
gg <- ggplot(c_st, aes(x=category2, y=per_capita))
gg <- gg + geom_bar(stat="identity", aes(fill=category))
gg <- gg + scale_y_continuous(labels=dollar)
gg <- gg + scale_x_discrete(labels=c_st$st, breaks=c_st$category2)
gg <- gg + facet_wrap(~category, scales = "free", ncol=1)
gg <- gg + labs(x="", y="")
gg <- gg + theme_bw()
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme(strip.text=element_text(size=15, face="bold"))
gg <- gg + theme(panel.margin=unit(2, "lines"))
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.position="none")
gg
Comparison of Spending Category by State (per-capita)

That certainly changes things! Alaska, West Virginia, and D.C. definitely stand out for “Weapons”, “Other” & “Information”, respectively, (what’s Rhode Island hiding in “Other”?!) and the “top 10” in each category are very different from the raw total’s view. We can look at this per-capita view with the `statebins` package as well:
st_pl <- vector("list", 1+length(unique(c_st$category)))
j <- 0
for (i in unique(c_st$category)) {
j <- j + 1
st_pl[[j]] <- statebins_continuous(c_st[category==i,], state_col="st", value_col="per_capita") +
scale_fill_gradientn(labels=dollar, colours=brewer.pal(6, "PuBu"), name=i) +
theme(legend.key.width=unit(2, "cm"))
}
st_pl[[1+length(unique(c_st$category))]] <- list(ncol=1)
grid.arrange(st_pl[[1]], st_pl[[2]], st_pl[[3]],
st_pl[[4]], st_pl[[5]], st_pl[[6]],
st_pl[[7]], st_pl[[8]], st_pl[[9]], ncol=3)
Per-capita “Statebins” view of WaPo Seizure Data
(Doing this exercise also showed me I need to add some flexibility to the `statebins` package).
The (https://gist.github.com/hrbrmstr/27b8f44f573539dc2971) shows how to build a top-level category data table (along with the rest of the code in this post). I may spin this data up into an interactive D3 visualization in the next week or two (as I think it might work better than large faceted bar charts), so stay tuned!
A huge thank you to the WaPo team for making data available to others. Go forth and poke at it with your own questions and see what you can come up with (perhaps comparing by area of state)!
Rick Wicklin (@[RickWicklin](https://twitter.com/RickWicklin)) made a
recent post to the SAS blog on
[An exploratory technique for visualizing the distributions of 100
variables](http://blogs.sas.com/content/iml/). It’s a very succinct tutorial on both the power of
boxplots and how to make them in SAS (of course). I’m not one to let R
be “out-boxed”, so I threw together a quick re-creation of his example,
mostly as tutorial for any nascent R folks that come across it. (As an
aside, I catch Rick’s and other cool, non-R stuff via the [Stats
Blogs](http://www.statsblogs.com/) blog aggregator.)
The R implementation (syntax notwithstanding) is extremely similar.
First, we’ll need some packages to assist with data reshaping and pretty
plotting:
library(reshape2)
library(ggplot2)
Then, we setup a list so we can pick from the same four distributions
and set the random seed to make this example reproducible:
dists <- c(rnorm, rexp, rlnorm, runif)
set.seed(1492)
Now, we generate a data frame of the `100` variables with `1,000`
observations, normalized from `0`-`1`:
many_vars <- data.frame(sapply(1:100, function(x) {
# generate 1,000 random samples
tmp <- sample(dists, 1)[[1]](1000)
# normalize them to be between 0 & 1
(tmp - min(tmp)) / (max(tmp) - min(tmp))
}))
The `sapply` iterates over the numbers `1` through `100`, passing each
number into a function. Each iteration samples an object from the
`dists` list (which are actual R functions) and then calls the function,
telling it to generate `1,000` samples and normalize the result to be
values between `0` & `1`. By default, R will generate column names that
begin with `X`:
str(many_vars[1:5]) # show the structure of the first 5 cols
## 'data.frame': 1000 obs. of 5 variables:
## $ X1: num 0.1768 0.4173 0.5111 0.0319 0.0644 ...
## $ X2: num 0.217 0.275 0.596 0.785 0.825 ...
## $ X3: num 0.458 0.637 0.115 0.468 0.469 ...
## $ X4: num 0.5186 0.0358 0.5927 0.1138 0.1514 ...
## $ X5: num 0.2855 0.0786 0.2193 0.433 0.9634 ...
We’re going to plot the boxplots, sorted by the third quantile (just
like in Rick’s example), so we’ll calculate their rank and then use
those ranks (shortly) to order a factor varible:
ranks <- names(sort(rank(sapply(colnames(many_vars), function(x) {
as.numeric(quantile(many_vars[,x], 0.75))
}))))
There’s alot going on in there. We pass the column names from the
`many_vars` data frame to a function that will return the quantile we
want. Since `sapply` preserves the names we passed in as well as the
values, we extract them (via `names`) after we rank and sort the named
vector, giving us a character vector in the order we’ll need:
str(ranks)
## chr [1:100] "X29" "X8" "X92" "X43" "X11" "X52" "X34" ...
Just like in the SAS post, we’ll need to reshape the data into [long
format from wide
format](http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/),
which we can do with `melt`:
many_vars_m <- melt(as.matrix(many_vars))
str(many_vars_m)
## 'data.frame': 100000 obs. of 3 variables:
## $ Var1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Var2 : Factor w/ 100 levels "X1","X2","X3",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ value: num 0.1768 0.4173 0.5111 0.0319 0.0644 ...
And, now we’ll use our ordered column names to ensure that our boxplots
will be presented in the right order (it would be in alpha order if
not). Factor variables in R are space-efficient and allow for handy
manipulations like this (amongst other things). By default,
`many_vars_m$Var2` was in alpha order and this call just re-orders that
factor.
many_vars_m$Var2 <- factor(many_vars_m$Var2, ranks)
str(many_vars_m)
## 'data.frame': 100000 obs. of 3 variables:
## $ Var1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Var2 : Factor w/ 100 levels "X29","X8","X92",..: 24 24 24 24 24 24 24 24 24 24 ...
## $ value: num 0.1768 0.4173 0.5111 0.0319 0.0644 ...
Lastly, we plot all our hard work (click/touch for larger version):
gg <- ggplot(many_vars_m, aes(x=Var2, y=value))
gg <- gg + geom_boxplot(fill="#BDD7E7", notch=TRUE, outlier.size=1)
gg <- gg + labs(x="")
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(axis.text.x=element_text(angle=-45, hjust=0.001, size=5))
gg

Here’s the program in it’s entirety:
library(reshape2)
library(ggplot2)
dists <- c(rnorm, rexp, rlnorm, runif)
set.seed(1)
many_vars <- data.frame(sapply(1:100, function(x) {
tmp <- sample(dists, 1)[[1]](1000)
(tmp - min(tmp)) / (max(tmp) - min(tmp))
}))
ranks <- names(sort(rank(sapply(colnames(many_vars), function(x) {
as.numeric(quantile(many_vars[,x], 0.75))
}))))
many_vars_m <- melt(as.matrix(many_vars))
many_vars_m$Var2 <- factor(many_vars_m$Var2, ranks)
gg <- ggplot(many_vars_m, aes(x=Var2, y=value))
gg <- gg + geom_boxplot(fill="steelblue", notch=TRUE, outlier.size=1)
gg <- gg + labs(x="")
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(axis.text.x=element_text(angle=-45, hjust=0.001))
gg
I tweaked the boxplot, using a
[notch](https://sites.google.com/site/davidsstatistics/home/notched-box-plots)
and making the outliers take up a fewer pixels.
I’m definitely in agreement with Rick that this is an excellent way to
compare many distributions.
Bonus points for the commenter who shows code to color the bars by which distribution generated them!
I shot a quick post over at the [Data Driven Security blog](http://bit.ly/1hyqJiT) explaining how to separate Twitter data gathering from R code via the Ruby `t` ([github repo](https://github.com/sferik/t)) command. Using `t` frees R code from having to be a Twitter processor and lets the analyst focus on analysis and visualization, plus you can use `t` as a substitute for Twitter GUIs if you’d rather play at the command-line:
$ t timeline ddsecblog
@DDSecBlog
Monitoring Credential Dumps Plus Using Twitter As a Data Source http://t.co/ThYbjRI9Za
@DDSecBlog
Nice intro to R + stats // Data Analysis and Statistical Inference free @datacamp_com course
http://t.co/FC44FF9DSp
@DDSecBlog
Very accessible paper & cool approach to detection // Nazca: Detecting Malware Distribution in
Large-Scale Networks http://t.co/fqrSaFvUK2
@DDSecBlog
Start of a new series by new contributing blogger @spttnnh! // @AlienVault rep db Longitudinal
Study Part 1 : http://t.co/XM7m4zP0tr
...
The DDSec post shows how to mine the well-formatted output from the @dumpmon Twitter bot to visualize dump trends over time:
and has the code in-line and over at the [DDSec github repo](https://github.com/ddsbook/blog/blob/master/extra/src/R/dumpmon.R) [R].
|