Skip navigation

Category Archives: Data Analysis

Junk Charts [adeptly noted and fixed](http://junkcharts.typepad.com/junk_charts/2015/10/is-it-worth-the-drama.html) this excessively stylized chart from the WSJ this week:

6a00d8341e992c53ef01bb0885a274970d

Their take on it does reduce the ZOMGOSH WE ARE DOOMED! look and feel of the WSJ chart:

6a00d8341e992c53ef01bb0885a2ef970d

But, we can further reduce the drama by using a more neutral color encoding _and_ encode both the # of outbreaks and total size of the impacted flock populations _per week_ with a lollipop chart (and, thankfully the USDA makes this data readily available):

library(xml2)
library(rvest)
library(dplyr)
library(stringr)
library(ggplot2)
library(scales)
library(viridis)
library(ggthemes)
 
pg <- read_html("https://www.aphis.usda.gov/wps/portal/aphis/ourfocus/animalhealth/sa_animal_disease_information/sa_avian_health/sa_detections_by_states/ct_ai_pacific_flyway/!ut/p/a1/lVNNb-IwEP0tPewx2OSbI_QDwkdBRbuQXKyJ4yTWJnYUG1D-fZ10D7RqadcHS555M_PmPRkl6IgSAWdegOZSQNW_E58stwt7PMN2NN9PHnH0_OdpE64DZ7twDSA2APzFmeL39dtV5Pf1L3i2eBjjvYMOKEEJFbrRJYqhKbkiVArNhCYVT1tou19YAZGnluSSntTwAsFrqEjJoNLldSTjioFihItctvWwxFv6zEFc4zOmGe3TiqQdURo0M62pJsBJA5TnnJK86i7Q9fwayjMU57ZPAezccnwvtdwM21aah9hcGHtuCP6Y5v_0uLHwj_S8n08XbrA2CrqhjaMHUxxMNhhH_nf1g57fdBgAtyz7MGT-ODNDnta7YLW08cpDsSEZfMli4qL9f6q2_IEsdru53xSmLejS6g1Gx5vGv6WvjB8CnxmPjp8af5ihxJNBpIqeX1HJdPgQ8VSkTmiItCxnLWtHpVQaHS-Xy-ikMhgV8oya-ncdOh23_r6E2PGqYrerD9O7u1eBlNG5/?1dmy&urile=wcm%3apath%3a%2Faphis_content_library%2Fsa_our_focus%2Fsa_animal_health%2Fsa_animal_disease_information%2Fsa_avian_health%2Fsa_detections_by_states%2Fct_ai_full_list")
 
dat <- html_table(html_nodes(pg, "table"))[[1]]
 
dat %>% 
  mutate(`Confirmation date` = as.Date(`Confirmation date`, "%b %d, %Y"),
         week = format(`Confirmation date`, "%Y-%U"),
         week_start = as.Date(sprintf("%s-1", week), "%Y-%U-%u") ,
         `Flock size` = as.numeric(str_replace_all(`Flock size`, ",", ""))) %>% 
  select(week, week_start, `Flock size`) %>% 
  filter(!is.na(`Flock size`)) %>% 
  group_by(week_start) %>% 
  summarize(outbreaks=n(), 
            flock_total=sum(`Flock size`)) -> dat
 
first <- dat[2,]
last <- tail(dat, 1)
 
gg <- ggplot(dat, aes(x=week_start, y=outbreaks))
gg <- gg + geom_vline(xintercept=as.numeric(first$week_start), linetype="dashed", size=0.2, color="#7f7f7f")
gg <- gg + geom_text(data=first, aes(x=week_start, y=25), label=" First detection in 2015", hjust=0, size=3, color="#7f7f7f")
gg <- gg + geom_vline(xintercept=as.numeric(last$week_start), linetype="dashed", size=0.2, color="#7f7f7f")
gg <- gg + geom_text(data=last, aes(x=week_start, y=25), label="Last detection ", hjust=1, size=3, color="#7f7f7f")
gg <- gg + geom_segment(aes(x=week_start, xend=week_start, y=0, yend=outbreaks, color=flock_total), size=0.5)
gg <- gg + geom_point(aes(size=flock_total, fill=flock_total), shape=21)
gg <- gg + scale_size_continuous(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_color_viridis(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_fill_viridis(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_x_date(label=date_format("%b"))
gg <- gg + guides(color=guide_legend(), fill=guide_legend(), size=guide_legend())
gg <- gg + labs(x=NULL, y="# Outbreaks", title="Avian Flu Impact by Week (2015)")
gg <- gg + theme_tufte(base_family="Helvetica")
gg <- gg + theme(legend.key=element_rect(color=rgb(0,0,0,0)))
gg

RStudio

If we really want to see the discrete events, we can do that with our less-ZOMGOSH color scheme, too:

dat <- html_table(html_nodes(pg, "table"))[[1]]
dat %>% 
  mutate(`Confirmation date` = as.Date(`Confirmation date`, "%b %d, %Y"),
         `Flock size` = as.numeric(str_replace_all(`Flock size`, ",", ""))) %>% 
  filter(!is.na(`Flock size`)) %>% 
  rename(date=`Confirmation date`) %>% 
  arrange(date) -> dat
 
first <- dat[2,]
last <- tail(dat, 1)
 
gg <- ggplot(dat, aes(x=date, y=`Flock size`))
gg <- gg + geom_vline(xintercept=as.numeric(first$date), linetype="dashed", size=0.2, color="#7f7f7f")
gg <- gg + geom_text(data=first, aes(x=date, y=3000000), label=" First detection in 2015", hjust=0, size=3, color="#7f7f7f")
gg <- gg + geom_vline(xintercept=as.numeric(last$date), linetype="dashed", size=0.2, color="#7f7f7f")
gg <- gg + geom_text(data=last, aes(x=date, y=3000000), label="Last detection ", hjust=1, size=3, color="#7f7f7f")
gg <- gg + geom_segment(aes(x=date, xend=date, y=0, yend=`Flock size`, color=`Flock size`), size=0.5, alpha=1)
gg <- gg + scale_size_continuous(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_color_viridis(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_fill_viridis(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_x_date(label=date_format("%b"))
gg <- gg + scale_y_continuous(label=comma)
gg <- gg + guides(color=guide_legend(), fill=guide_legend(), size=guide_legend())
gg <- gg + labs(x=NULL, y="Flock size", title="Avian Flu Impact (2015)")
gg <- gg + theme_tufte(base_family="Helvetica")
gg <- gg + theme(legend.key=element_rect(color=rgb(0,0,0,0)))
gg

RStudio 2

Neither of those is ever going to sell any ads, tho.

I like to turn coincidence into convergence whenever possible. This weekend, a user of [cdcfluview](http://github.com/hrbrmstr/cdcfluview) had a question that caused me to notice a difference in behaviour between the package was interacting with CDC FluView API, so I updated the package to accommodate the change and the user.

Around the same time, @recology_ tweeted:

Finally, the [2015-2016 flu season](http://www.cdc.gov/flu/about/season/flu-season-2015-2016.htm) is also fast approaching (so a three-fer!), making a CRAN leap for `cdcfluview` quite timely.

Since I can’t let @quonimous have all the `#rstats` fun & glory, I added a function and two data sets to `cdcfluview`, did the CRAN dance and it’s now [on CRAN](https://cran.rstudio.com/web/packages/cdcfluview/). I also did the github dance to have it’s entry in the [Web Technologies Task View](https://cran.rstudio.com/web/views/WebTechnologies.html) updated.

The new function lets you grab the XML behind the high-level [Weekly US Map: Influenza Summary Update](http://www.cdc.gov/flu/weekly/usmap.htm) (I’ll be adding a function to make a similar plot that won’t require gosh-awful Flash) and the data sets provide metadata about the composition of HHS regions and Census regions, making it easier to compose factors, add labels to maps or even segment maps/combine polygons. The existing two grab current & historical detailed national & state influenza data.

There’s an example on github and in the

(https://rud.is/b/2015/01/10/new-r-package-cdcfluview-retrieve-flu-data-from-cdcs-fluview-portal/).

If you have any other data you need freed from the confines of the CDC FluView Flash portal, please file an issue & paste a screen shot (if you are comfortable with most browser Developer Tools views, even a dump of the request or “as cURL” URL would be awesome).

School of Data had a recent post how to copy “every item” from a multi-page list. While their post did provide a neat hack, their “words of warning” are definitely missing some items and the overall methodology can be improved upon with some basic R scripting.

First, the technique they outlined relies heavily on how parameters are passed and handled by the server the form is connected to. The manual technique is not guaranteed to work across all types of forms nor even those with a “count” popup. I can see this potentially frustrating many budding data janitors.

Second, this particular technique and example really centers around jQuery DataTables. While their display style can be highly customized, it’s usually pretty easy to determine if they are being used both visually:

List_of_Netflix_Movies_and_TV_Shows___AllFlicks

(i.e. by the controls & style of the controls available) and in the source:

view-source_www_allflicks_net

The URLs might be local or on a common content delivery network, but it should be pretty easy to determine when a jQuery DataTable is in use. Once you do, you should also be able to tell if it’s calling out to a URL for some JSON to populate the structure.

Developer_Tools_-_http___www_allflicks_net_

Here, I just used Chrome’s Developer Tools to look a the responses coming back from the server. That’s a pretty ugly GET request, but we can see the query parameters a bit better if we scroll down:
Developer_Tools_-_http___www_allflicks_net_ 2

These definitely track well with the jQuery DataTable server-side documentation so we should be able to use this to our advantage to avoid the pitfalls of overwhelming the browser with HTML entities and doing cut & paste to save out the list.

Getting the Data With R

The R code to get this same data is about as simple as it gets. All you need is the data source URL, with a modified length query parameter. After that’s it’s just a few lines of code:

library(httr)
library(jsonlite)
library(dplyr) # for glimpse
 
url <- "http://www.allflicks.net/wp-content/themes/responsive/processing/processing_us.php?draw=1&columns%5B0%5D%5Bdata%5D=box_art&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=title&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=year&columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=rating&columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=category&columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=available&columns%5B5%5D%5Bname%5D=&columns%5B5%5D%5Bsearchable%5D=true&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=director&columns%5B6%5D%5Bname%5D=&columns%5B6%5D%5Bsearchable%5D=true&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B7%5D%5Bdata%5D=cast&columns%5B7%5D%5Bname%5D=&columns%5B7%5D%5Bsearchable%5D=true&columns%5B7%5D%5Borderable%5D=true&columns%5B7%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B7%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=5&order%5B0%5D%5Bdir%5D=desc&start=0&length=7448&search%5Bvalue%5D=&search%5Bregex%5D=false&movies=true&shows=true&documentaries=true&rating=netflix&_=1431945465056"
 
resp <- GET(url)

Normally we would be able to do:

content(resp, as="parsed")

but this server did not set the Content-Type of the response well, so we have to do it by hand with the jsonlite package:

recs <- fromJSON(content(resp, as="text"))

The recs variable is now an R list with a structure that (thankfully) fully represents the expected server response:

## List of 4
##  $ draw           : int 1
##  $ recordsTotal   : int 7448
##  $ recordsFiltered: int 7448
##  $ data           :'data.frame':  7448 obs. of  9 variables:
##   ..$ box_art  : chr [1:7448] "<img src=\"http://cdn1.nflximg.net/images/9159/12119159.jpg\" width=\"55\" alt=\"Thumbnail\">" "<img src=\"http://cdn1.nflximg.net/images/6195/20866195.jpg\" width=\"55\" alt=\"Thumbnail\">" "<img src=\"http://cdn1.nflximg.net/images/3735/2243735.jpg\" width=\"55\" alt=\"Thumbnail\">" "<img src=\"http://cdn0.nflximg.net/images/2668/21112668.jpg\" width=\"55\" alt=\"Thumbnail\">" ...
##   ..$ title    : chr [1:7448] "In the Bedroom" "Wolfy: The Incredible Secret" "Bratz: Diamondz" "Tinker Bell and the Legend of the NeverBeast" ...
##   ..$ year     : chr [1:7448] "2001" "2013" "2006" "2015" ...
##   ..$ rating   : chr [1:7448] "3.3" "2.5" "3.6" "4" ...
##   ..$ category : chr [1:7448] "<a href=\"http://www.allflicks.net/category/thrillers/\">Thrillers</a>" "<a href=\"http://www.allflicks.net/category/children-and-family-movies/\">Children & Family Movies</a>" "<a href=\"http://www.allflicks.net/category/children-and-family-movies/\">Children & Family Movies</a>" "<a href=\"http://www.allflicks.net/category/children-and-family-movies/\">Children & Family Movies</a>" ...
##   ..$ available: chr [1:7448] "17 May 2015" "17 May 2015" "17 May 2015" "17 May 2015" ...
##   ..$ cast     : chr [1:7448] "Tom Wilkinson, Sissy Spacek, Nick Stahl, Marisa Tomei, William Mapother, William Wise, Celia Weston, Karen Allen, Frank T. Well"| __truncated__ "Rafael Marin, Christian Vandepas, Gerald Owens, Yamile Vasquez, Pilar Uribe, James Carrey, Rebecca Jimenez, Joshua Jean-Baptist"| __truncated__ "Olivia Hack, Soleil Moon Frye, Tia Mowry-Hardrict, Dionne Quan, Wendie Malick, Lacey Chabert, Kaley Cuoco, Charles Adler" "Ginnifer Goodwin, Mae Whitman, Rosario Dawson, Lucy Liu, Pamela Adlon, Raven-Symoné, Megan Hilty" ...
##   ..$ director : chr [1:7448] "Todd Field" "Éric Omond" "Mucci Fassett, Nico Rijgersberg" "Steve Loter" ...
##   ..$ id       : chr [1:7448] "60022258" "70302834" "70053695" "80028529" ...

We see there is a data.frame in there with the expected # of records. We can also use glimpse from dplyr to see the data table a bit better:

<

pre lang=”rsplus”>glimpse(recs$data)

Observations: 7448

Variables:

$ box_art (chr) “<img src=\”http://cdn1.nflximg.net/images/9159/12…

$ title (chr) “In the Bedroom”, “Wolfy: The Incredible Secret”, …

$ year (chr) “2001”, “2013”, “2006”, “2015”, “1993”, “2013”, “2…

$ rating (chr) “3.3”, “2.5”, “3.6”, “4”, “3.5”, “3.1”, “3.3”, “4….

$ category (chr) “<a href=\”http://www.allflicks.net/category/thril…

$ available (chr) “17 May 2015”, “17 May 2015”, “17 May 2015”, “17 M…

$ cast (chr) “Tom Wilkinson, Sissy Spacek, Nick Stahl, Marisa T…

$ director (chr) “Todd Field”, “Éric Omond”, “Mucci Fassett, Nico R…

$ id (chr) “60022258”, “70302834”, “70053695”, “80028529”, “8…

Now, we can use that in any R workflow or write it out as a CSV (or other format) for other workflows to use. No browsers were crashed and we have code we run again to scrape the site (i.e. when the add more movies to the database) vs a manual cut & paste workflow.

Many of the concepts in this post can be applied to other data table displays (i.e. those not based on jQuery DataTable), but you’ll have to get comfortable with the developer tools view of your favorite browser.

I saw a fly-by `#rstats` mention of more airplane accident data on — of all places — LinkedIn (email) today which took me to a [GitHub repo](https://github.com/philjette/CrashData) by @philjette. It seems there’s a [web site](http://www.planecrashinfo.com/) (run by what seems to be a single human) that tracks plane crashes. Here’s a tweet from @philjette announcing it:

The repo contains the R code that scrapes the site and it’s (mostly) in old-school R and works really well. I’m collecting and conjuring many bits of R for the classes I’m teaching in the fall and thought that it would be useful to replicate @philjette’s example in modern Hadleyverse style (i.e. `dplyr`, `rvest`, etc). I even submitted a [pull request](https://github.com/philjette/CrashData/pull/1) to him with the additional version. I’ve replicated it below with some additional comments for those wanting to jump into the Hadleyverse. No shiny `ggplot2` graphs this time, I’m afraid. This is all raw code, but will hopefully be useful to those learning the modern ropes.

Just to get the setup bits out of the way, here’s all the packages I’ll be using:

library(dplyr)
library(rvest)
library(magrittr)
library(stringr)
library(lubridate)
library(pbapply)

Phil made a function to grab data for a whole year, so I did the same and gave it a default parameter of the current year (programmatically). I also tossed in some parameter checking for good measure.

The basic setup is to:

– grab the HTML for the page of a given year
– extract and format the crash dates
– extract location & operator information, which is made slightly annoying since the site uses a `
` and includes spurious newlines within a single `

` element
– extract aircraft type and registration (same issues as previous column)
– extract accident details, which are embedded in a highly formatted column that requires `str_match_all` to handle (well)

Some things worth mentioning:

– `data_frame` is super-helpful in not-creating `factors` from the character vectors
– `bind_rows` and `bind_cols` are a nice alternative to using `data.table` functions
– I think `stringr` needs a more pipe-friendly replacement for `gsub` and, perhaps, even `ifesle` (yes, I guess I could submit a PR). The `.` just feels wrong in pipes to me, still
– if you’re not using `pbapply` functions (free progress bars for everyone!) you _should_ be, especially for long scraping operations
– sometimes XPath entries can be less verbose than CSS (and easier to craft) and I have no issue mixing them in scraping code when necessary

Here’s the new `get_data` function (_updated per comment and to also add some more hadleyverse goodness_):

#' retrieve crash data for a given year
#' defaults to current year
#' earliest year in the database is 1920
get_data <- function(year=as.numeric(format(Sys.Date(), "%Y"))) {
 
  crash_base <- "http://www.planecrashinfo.com/%d/%s.htm"
 
  if (year < 1920 | year > as.numeric(format(Sys.Date(), "%Y"))) {
    stop("year must be >=1920 and <=current year", call.=FALSE)
  }
 
  # get crash date
 
  pg <- html(sprintf(crash_base, year, year))
  pg %>%
    html_nodes("table > tr > td:nth-child(1)") %>%
    html_text() %>%
    extract(-1) %>%
    dmy() %>%
    data_frame(date=.) -> date
 
  # get location and operator
 
  loc_op <- bind_rows(lapply(1:length(date), function(i) {
 
    pg %>%
      html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/preceding-sibling::text()", i)) %>%
      html_text() %>%
      str_trim() %>%
      str_replace_all("^(Near|Off) ", "") -> loc
 
    pg %>%
      html_nodes(xpath=sprintf("//table/tr/td[2]/*/br[%d]/following-sibling::text()", i)) %>%
      html_text() %>%
      str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") -> op
 
    data_frame(location=loc, operator=op)
 
  }))
 
  # get type & registration
 
  type_reg <- bind_rows(lapply(1:length(date), function(i) {
 
    pg %>%
      html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/preceding-sibling::text()", i)) %>%
      html_text() %>%
      str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") %>%
      ifelse(.=="?", NA, .) -> typ
 
    pg %>% html_nodes(xpath=sprintf("//table/tr/td[3]/*/br[%d]/following-sibling::text()", i)) %>%
      html_text() %>%
      str_replace_all("(^[[:space:]]*|[[:space:]]*$|\\n)", "") %>%
      ifelse(.=="?", NA, .) -> reg
 
    data_frame(type=typ, registration=reg)
 
  }))
 
  # get fatalities
 
  pg %>% html_nodes("table > tr > td:nth-child(4)") %>%
    html_text() %>%
    str_match_all("([[:digit:]]+)/([[:digit:]]+)\\(([[:digit:]]+)\\)") %>%
    lapply(function(x) {
      data_frame(aboard=as.numeric(x[2]), fatalties=as.numeric(x[3]), ground=as.numeric(x[4]))
    }) %>%
    bind_rows %>% tail(-1) -> afg
 
  bind_cols(date, loc_op, type_reg, afg)
 
}

While that gets one year, it’s super-simple to get all crashes since 1950:

crashes <- bind_rows(pblapply(1950:2015, get_data))

Yep. That’s it. Now `crashes` contains a `data.frame` (well, `tbl_df`) of all the crashes since 1950, ready for further analysis.

For the class I’m teaching, I’ll be extending this to grab the extra details for each crash link and then performing more data science-y operations.

If you’ve got any streamlining tips or alternate ways to handle the scraping Hadleyverse-style please drop a note in the comments. Also, definitely check out Phil’s great solution, especially to compare it to this new version.

UPDATE: You can now run this as a local Shiny app by entering shiny::runGist("95ec24c1b0cb433a76a5", launch.browser=TRUE) at an R prompt (provided all the dependent libraries (below) are installed) or use it interactively over at Shiny Apps.

The impending arrival of the first real snowfall of the year in my part of Maine got me curious about what the most likely “first snow” dates are for my region. The U.S. Historical Climatology Network (USHCN) maintains [historical daily climate records](http://cdiac.ornl.gov/epubs/ndp/ushcn/daily_doc.html) for each station in each state and has data (for some stations) going back as far as the 1800’s. A quick look at their [data files](http://cdiac.ornl.gov/ftp/ushcn_daily/) indicated that they would definitely help satiate my curiosity (and make for a late night of cranking out some R code and ggplot visualizations).

To start, we’ll need a bit more than base R to get the job done:

library(pbapply)
library(data.table)
library(dplyr)
library(ggplot2)
library(grid)
library(gridExtra)
library(stringi)

In all honesty, `pbapply`, `dplyr` and `stringi` are not necessary, but they definitely make life easier by (respectively) giving us:

– free progress bars for `*apply` operations,
– high efficacy data manipulation idioms, and
– a handy utility for converting strings to title case.

With setup out of the way, the first real task is to see which observer station is closest to my area. To figure that out we need to read in the station data file which is, sadly, in fixed-width format. Some stations have `#` characters in their titles, to we have to account for that when we call `read.fwf`. After reading in the station database we use a naive–but-usable distance calculation to find the closest station:

stations <- read.fwf("data/ushcn-stations.txt",
                     widths=c(6, 9, 10, 7, 3, 31, 7, 7, 7, 3),
                     col.names=c("coop_id", "latitude", "longitude", "elevation",
                                 "state", "name", "component_1", "component_2",
                                 "component_3", "utc_offset"),
                     colClasses=c("character", "numeric", "numeric", "numeric",
                                  "character", "character", "character", "character",
                                  "character", "character"),
                     comment.char="", strip.white=TRUE)
 
# not a great circle, but it gets the job done here
closestStation <- function(stations, lat, lon) {
  index <- which.min(sqrt((stations$latitude-lat)^2 +
                          (stations$longitude-lon)^2))
  stations[index,]
}
 
# what's the closest station?
closestStation(stations, 43.2672, -70.8617)
 
##     coop_id latitude longitude elevation state   name component_1 component_2 component_3 utc_offset
633  272174    43.15    -70.95      24.4    NH DURHAM      ------      ------      ------         +5

As a Mainer, I’m not thrilled that this is the actual, closest station, so we’ll also see what the closest one is in Maine:

closestStation(stations %>% filter(state=="ME"), 43.2672, -70.8617)
##    coop_id latitude longitude elevation state             name component_1 component_2 component_3 utc_offset
10  176905  43.6497  -70.3003      13.7    ME PORTLAND JETPORT      ------      ------      ------         +5

The analysis is easy enough to do for both, so we’ll first take a look at Durham, New Hampshire then do the exact same valuation for Portland, Maine.

Despite being fixed-width, the station database was not too difficult to wrangle. The state-level files that contain the readings are another matter:

Variable Columns Type
COOP ID 1-6 Character
YEAR 7-10 Integer
MONTH 11-12 Integer
ELEMENT 13-16 Character
VALUE1 17-21 Integer
MFLAG1 22 Character
QFLAG1 23 Character
SFLAG1 24 Character
VALUE2 25-29 Integer
MFLAG2 30 Character
QFLAG2 31 Character
SFLAG2 32 Character
VALUE31 257-261 Integer
MFLAG31 262 Character
QFLAG31 263 Character
SFLAG31 264 Character

We have fixed-width, wide-format records with 31 days for each month, which proves the existence of truly insidious people in the world. Rather than use `read.fwf` again, we’ll take a different approach (since we ultimately need the data in long format) and use `readLines` to read in all the records from the NH data file, then filter out everything but snowfall entries from the station we’re interested in.

Next, we setup nested `lapply` calls to build a long data frame from each month then combine them all together into a single data frame:

snow <- readLines("data/state27_NH.txt")
 
snow <- grep("SNOW", snow, value=TRUE)
snow <- grep("^272174", snow, value=TRUE)
 
snow_dat <- rbindlist(pblapply(snow, function(x) {
 
  rbindlist(lapply(1:31, function(i) {
 
    # record format described here:
    # http://cdiac.ornl.gov/ftp/ushcn_daily/data_format.txt
 
    start <- 17 + (i-1)*8
 
    list(coop_id=substr(x, 1, 6),
         date=sprintf("%s-%02d-%02d", substr(x, 7, 10), as.numeric(substr(x, 11, 12)), i),
         element=substr(x, 13, 16),
         value=as.numeric(substr(x, start, start+4)),
         mflag=substr(x, start+5, start+5),
         qflag=substr(x, start+6, start+6),
         sflag=substr(x, start+7, start+7))
 
  }))
 
}))

Now, we’ll clean up the records even further by removing invalid entries (those with a `value` == `-9999`) and convert record dates to actual `Date` objects and filter out invalid dates:

snow_dat <- snow_dat %>% filter(value != -9999)
 
# since the data file has 31 days for each records regardless of whether
# that's valid or not we do a shortcut to remove invalid dates by doing the
# a vectorized Date conversion, then removing records with NA dates
 
snow_dat$date <- as.Date(snow_dat$date)
snow_dat <- snow_dat %>% filter(!is.na(date))
 
# having the year extracted is handy for filtering
snow_dat$year <- format(snow_dat$date, "%Y")

Given that Winter in the U.S. spans across two calendar years, we need a way to keep dates in January-May associated with the previous year (yes, that adds an inherent assumption that no first snow is in June, which might not hold true for Alaska). To facilitate this, we’ll convert each date to its corresponding day of year value then add the number of total days in the start year to those values for all months <= May. We really do need to do this, too, since there are many cases where the first snowfall will be in January-March for many states.

snow_dat$doy <- as.numeric(format(snow_dat$date, "%j"))
snow_dat$doy <- ifelse(snow_dat$doy<=180,
                       snow_dat$doy + as.numeric(format(as.Date(sprintf("%s-12-31", snow_dat$year)), "%j")),
                       snow_dat$doy)

Now, the fun begins. We use (mostly) `dplyr` to extract the first snowfall day from each year, then make a dot-line plot from the data:

first <- snow_dat %>%
  filter(value>0) %>%                           # ignore 0 values
  filter(date>=as.Date("1950-01-01")) %>%       # start at 1950 (arbitrary)
  merge(stations, by="coop_id", all.x=TRUE) %>% # merge station details
  group_by(coop_id, year) %>%                   # group by station and year
  arrange(doy) %>%                              # sort by our munged day of year
  filter(row_number(doy) == 1) %>%              # grab the first entry by group
  select(name, state, date, value, doy)         # we only need some variables
 
title_1 <- sprintf("First observed snowfall (historical) at %s, %s", stri_trans_totitle(unique(first$name)), unique(first$state))
 
gg <- ggplot(first, aes(y=year, x=doy))
gg <- gg + geom_segment(aes(xend=min(first$doy)-20, yend=year), color="#9ecae1", size=0.25)
gg <- gg + geom_point(aes(color=coop_id), shape=8, size=3, color="#3182bd")
gg <- gg + geom_text(aes(label=format(date, "%b-%d")), size=3, hjust=-0.2)
gg <- gg + scale_x_continuous(expand=c(0, 0), limits=c(min(first$doy)-20, max(first$doy)+20))
gg <- gg + labs(x=NULL, y=NULL, title=title_1)
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="none")
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.text.y=element_text(color="#08306b"))
by_year <- gg

While that will help us see the diversity across years, we have to do quite a bit of eye tracking to get the most likely date range for the first snowfall, so we’ll add a boxplot into the mix and use `summary` to figure out the quartiles (for labeling the chart) for the actual date values:

wx_range <- summary(as.Date(format(first$date, "2013-%m-%d")))
names(wx_range) <- NULL
min_wx <- gsub("2013-", "", wx_range[2])
max_wx <- gsub("2013-", "", wx_range[5])
 
title_2 <- sprintf("Most likely first snowfall will be between %s & %s", min_wx, max_wx)
 
gg <- ggplot(first %>% mutate(name="0000"), aes(name, doy))
gg <- gg + geom_boxplot(fill="#3182bd", color="#08306b", outlier.colour="#08306b")
gg <- gg + scale_y_continuous(expand=c(0, 0),
                              limits=c(min(first$doy)-20, max(first$doy)+20))
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title=title_2)
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="none")
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_line(color="white"))
gg <- gg + theme(axis.text.y=element_text(color="white"))
gg <- gg + theme(plot.title=element_text(size=11))
box_wx <- gg

Finally, we’ll combine them together to get the finished product:

grid.arrange(by_year, box_wx, nrow=2, heights=unit(c(0.9, 0.1), "npc"))

nh-2

And, do the same for Portland:

bothClick for larger

There are many more analyses and visualizations that can be performed on these data sets, but be wary when creating them as I’ve seen a few files with fixed-width formatting errors and have also noticed missing records for some observer stations.

You can find the complete, commented code up on [github](https://github.com/hrbrmstr/snowfirst).

The Washingon Post did another great story+vis, this time on states [Spending seized assets](http://www.washingtonpost.com/wp-srv/special/investigative/asset-seizures/).

According to their sub-head:

>_Since 2008, about 5,400 police agencies have spent $2.5 billion in proceeds from cash and property seized under federal civil forfeiture laws. Police suspected the assets were linked to crime, although in 81 percent of cases no one was indicted._

Their interactive visualization lets you drill down into each state to examine the spending in each category. Since the WaPo team made the [data available](http://www.washingtonpost.com/wp-srv/special/investigative/asset-seizures/data/all.json) [JSON] I thought it might be interesting to take a look at a comparison across states (i.e. who are the “big spenders” of this siezed hoarde). Here’s a snippet of the JSON:

{"states": [
  {
  "st": "AK",
  "stn": "Alaska",
  "total": 8470032,
  "cats":
     [{ "weapons": 1649832, 
     "electronicSurv": 402490, 
     "infoRewards": 760730, 
     "travTrain": 848128, 
     "commPrograms": 121664, 
     "salaryOvertime": 776766, 
     "other": 1487613, 
     "commComp": 1288439, 
     "buildImprov": 1134370 }],
  "agencies": [
     {
     "aid": "AK0012700",
     "aname": "Airport Police & Fire Ted Stevens Anch Int'L Arpt",
     "total": 611553,
     "cats":
        [{ "weapons": 214296, "travTrain": 44467, "other": 215464, "commComp": 127308, "buildImprov": 10019 }]
     },
     {
     "aid": "AK0010100",
     "aname": "Anchorage Police Department",
     "total": 3961497,
     "cats":
        [{ "weapons": 1104777, "electronicSurv": 94741, "infoRewards": 743230, "travTrain": 409474, "salaryOvertime": 770709, "other": 395317, "commComp": 249220, "buildImprov": 194029 }]
     },

Getting the data was easy (in R, of course!). Let’s setup the packages we’ll need:

library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
library(grid)
library(statebins)
library(gridExtra)

We also need `jsonlite`, but only to parse the data (which I’ve downloaded locally), so we’ll just do that in one standalone line:

data <- jsonlite::fromJSON("all.json", simplifyVector=FALSE)

It’s not fair (or valid) to just compare totals since some states have a larger population than others, so we’ll show the data twice, once in raw totals and once with a per-capita lens. For that, we’ll need population data:

pop <- read.csv("http://www.census.gov/popest/data/state/asrh/2013/files/SCPRC-EST2013-18+POP-RES.csv", stringsAsFactors=FALSE)
colnames(pop) <- c("sumlev", "region", "divison", "state", "stn", "pop2013", "pop18p2013", "pcntest18p")
pop$stn <- gsub(" of ", " Of ", pop$stn)

We have to fix the `District of Columbia` since the WaPo data capitalizes the `Of`.

Now we need to extract the agency data. This is really straightforward with some help from the `data.table` package:

agencies <- rbindlist(lapply(data$states, function(x) {
  rbindlist(lapply(x$agencies, function(y) {
    data.table(st=x$st, stn=x$stn, aid=y$aid, aname=y$aname, rbindlist(y$cats))
  }), fill=TRUE)
}), fill=TRUE)

The `rbindlist` `fill` option is super-handy in the event we have varying columns (and, we do in this case). It’s also wicked-fast.

Now, we use some `dplyr` and `tidyr` to integrate the population information and summarize our data (OK, we cheat and use `melt`, but some habits are hard to break):

c_st <- agencies %>%
  merge(pop[,5:6], all.x=TRUE, by="stn") %>%
  gather(category, value, -st, -stn, -pop2013, -aid, -aname) %>%
  group_by(st, category, pop2013) %>%
  summarise(total=sum(value, na.rm=TRUE), per_capita=sum(value, na.rm=TRUE)/pop2013) %>%
  select(st, category, total, per_capita)

Let’s use a series of bar charts to compare state-against state. We’ll do the initial view with just raw totals. There are 9 charts, so this graphic scrolls a bit and you can select it to make it larger:

# hack to ordering the bars by kohske : http://stackoverflow.com/a/5414445/1457051 #####
 
c_st <- transform(c_st, category2=factor(paste(st, category)))
c_st <- transform(c_st, category2=reorder(category2, rank(-total)))
 
# pretty names #####
 
levels(c_st$category) <- c("Weapons", "Travel, training", "Other",
                           "Communications, computers", "Building improvements",
                           "Electronic surveillance", "Information, rewards",
                           "Salary, overtime", "Community programs")
gg <- ggplot(c_st, aes(x=category2, y=total))
gg <- gg + geom_bar(stat="identity", aes(fill=category))
gg <- gg + scale_y_continuous(labels=dollar)
gg <- gg + scale_x_discrete(labels=c_st$st, breaks=c_st$category2)
gg <- gg + facet_wrap(~category, scales = "free", ncol=1)
gg <- gg + labs(x="", y="")
gg <- gg + theme_bw()
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme(strip.text=element_text(size=15, face="bold"))
gg <- gg + theme(panel.margin=unit(2, "lines"))
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.position="none")
gg

Comparison of Spending Category by State (raw totals)

raw-sm

There are definitely a few, repeating “big spenders” in that view, but is that the _real_ story? Let’s take another look, but factoring in state population:

# change bar order to match per-capita calcuation #####
 
c_st <- transform(c_st, category2=reorder(category2, rank(-per_capita)))
 
# per-capita bar plot #####
 
gg <- ggplot(c_st, aes(x=category2, y=per_capita))
gg <- gg + geom_bar(stat="identity", aes(fill=category))
gg <- gg + scale_y_continuous(labels=dollar)
gg <- gg + scale_x_discrete(labels=c_st$st, breaks=c_st$category2)
gg <- gg + facet_wrap(~category, scales = "free", ncol=1)
gg <- gg + labs(x="", y="")
gg <- gg + theme_bw()
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme(strip.text=element_text(size=15, face="bold"))
gg <- gg + theme(panel.margin=unit(2, "lines"))
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.position="none")
gg

Comparison of Spending Category by State (per-capita)

capita-sm

That certainly changes things! Alaska, West Virginia, and D.C. definitely stand out for “Weapons”, “Other” & “Information”, respectively, (what’s Rhode Island hiding in “Other”?!) and the “top 10” in each category are very different from the raw total’s view. We can look at this per-capita view with the `statebins` package as well:

st_pl <- vector("list", 1+length(unique(c_st$category)))
 
j <- 0
for (i in unique(c_st$category)) {
  j <- j + 1
  st_pl[[j]] <- statebins_continuous(c_st[category==i,], state_col="st", value_col="per_capita") +
    scale_fill_gradientn(labels=dollar, colours=brewer.pal(6, "PuBu"), name=i) +
    theme(legend.key.width=unit(2, "cm"))
}
st_pl[[1+length(unique(c_st$category))]] <- list(ncol=1)
 
grid.arrange(st_pl[[1]], st_pl[[2]], st_pl[[3]],
             st_pl[[4]], st_pl[[5]], st_pl[[6]],
             st_pl[[7]], st_pl[[8]], st_pl[[9]], ncol=3)

Per-capita “Statebins” view of WaPo Seizure Data

(Doing this exercise also showed me I need to add some flexibility to the `statebins` package).

The (https://gist.github.com/hrbrmstr/27b8f44f573539dc2971) shows how to build a top-level category data table (along with the rest of the code in this post). I may spin this data up into an interactive D3 visualization in the next week or two (as I think it might work better than large faceted bar charts), so stay tuned!

A huge thank you to the WaPo team for making data available to others. Go forth and poke at it with your own questions and see what you can come up with (perhaps comparing by area of state)!

Rick Wicklin (@[RickWicklin](https://twitter.com/RickWicklin)) made a
recent post to the SAS blog on
[An exploratory technique for visualizing the distributions of 100
variables](http://blogs.sas.com/content/iml/). It’s a very succinct tutorial on both the power of
boxplots and how to make them in SAS (of course). I’m not one to let R
be “out-boxed”, so I threw together a quick re-creation of his example,
mostly as tutorial for any nascent R folks that come across it. (As an
aside, I catch Rick’s and other cool, non-R stuff via the [Stats
Blogs](http://www.statsblogs.com/) blog aggregator.)

The R implementation (syntax notwithstanding) is extremely similar.
First, we’ll need some packages to assist with data reshaping and pretty
plotting:

library(reshape2)
library(ggplot2)

Then, we setup a list so we can pick from the same four distributions
and set the random seed to make this example reproducible:

dists <- c(rnorm, rexp, rlnorm, runif)
 
set.seed(1492)

Now, we generate a data frame of the `100` variables with `1,000`
observations, normalized from `0`-`1`:

many_vars <- data.frame(sapply(1:100, function(x) {
 
  # generate 1,000 random samples
  tmp <- sample(dists, 1)[[1]](1000)
 
  # normalize them to be between 0 & 1
  (tmp - min(tmp)) / (max(tmp) - min(tmp))
 
}))

The `sapply` iterates over the numbers `1` through `100`, passing each
number into a function. Each iteration samples an object from the
`dists` list (which are actual R functions) and then calls the function,
telling it to generate `1,000` samples and normalize the result to be
values between `0` & `1`. By default, R will generate column names that
begin with `X`:

str(many_vars[1:5]) # show the structure of the first 5 cols
 
## 'data.frame':    1000 obs. of  5 variables:
##  $ X1: num  0.1768 0.4173 0.5111 0.0319 0.0644 ...
##  $ X2: num  0.217 0.275 0.596 0.785 0.825 ...
##  $ X3: num  0.458 0.637 0.115 0.468 0.469 ...
##  $ X4: num  0.5186 0.0358 0.5927 0.1138 0.1514 ...
##  $ X5: num  0.2855 0.0786 0.2193 0.433 0.9634 ...

We’re going to plot the boxplots, sorted by the third quantile (just
like in Rick’s example), so we’ll calculate their rank and then use
those ranks (shortly) to order a factor varible:

ranks <- names(sort(rank(sapply(colnames(many_vars), function(x) {
  as.numeric(quantile(many_vars[,x], 0.75))
}))))

There’s alot going on in there. We pass the column names from the
`many_vars` data frame to a function that will return the quantile we
want. Since `sapply` preserves the names we passed in as well as the
values, we extract them (via `names`) after we rank and sort the named
vector, giving us a character vector in the order we’ll need:

str(ranks)
 
##  chr [1:100] "X29" "X8" "X92" "X43" "X11" "X52" "X34" ...

Just like in the SAS post, we’ll need to reshape the data into [long
format from wide
format](http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/),
which we can do with `melt`:

many_vars_m <- melt(as.matrix(many_vars))
 
str(many_vars_m)
 
## 'data.frame':    100000 obs. of  3 variables:
##  $ Var1 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Var2 : Factor w/ 100 levels "X1","X2","X3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ value: num  0.1768 0.4173 0.5111 0.0319 0.0644 ...

And, now we’ll use our ordered column names to ensure that our boxplots
will be presented in the right order (it would be in alpha order if
not). Factor variables in R are space-efficient and allow for handy
manipulations like this (amongst other things). By default,
`many_vars_m$Var2` was in alpha order and this call just re-orders that
factor.

many_vars_m$Var2 <- factor(many_vars_m$Var2, ranks)
 
str(many_vars_m)
 
## 'data.frame':    100000 obs. of  3 variables:
##  $ Var1 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Var2 : Factor w/ 100 levels "X29","X8","X92",..: 24 24 24 24 24 24 24 24 24 24 ...
##  $ value: num  0.1768 0.4173 0.5111 0.0319 0.0644 ...

Lastly, we plot all our hard work (click/touch for larger version):

gg <- ggplot(many_vars_m, aes(x=Var2, y=value))
gg <- gg + geom_boxplot(fill="#BDD7E7", notch=TRUE, outlier.size=1)
gg <- gg + labs(x="")
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(axis.text.x=element_text(angle=-45, hjust=0.001, size=5))
gg

unnamed-chunk-9

Here’s the program in it’s entirety:

library(reshape2)
library(ggplot2)
 
dists <- c(rnorm, rexp, rlnorm, runif)
 
set.seed(1)
many_vars <- data.frame(sapply(1:100, function(x) {
  tmp <- sample(dists, 1)[[1]](1000)
  (tmp - min(tmp)) / (max(tmp) - min(tmp))
}))
 
ranks <- names(sort(rank(sapply(colnames(many_vars), function(x) {
  as.numeric(quantile(many_vars[,x], 0.75))
}))))
 
many_vars_m <- melt(as.matrix(many_vars))
 
many_vars_m$Var2 <- factor(many_vars_m$Var2, ranks)
 
gg <- ggplot(many_vars_m, aes(x=Var2, y=value))
gg <- gg + geom_boxplot(fill="steelblue", notch=TRUE, outlier.size=1)
gg <- gg + labs(x="")
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(axis.text.x=element_text(angle=-45, hjust=0.001))
gg

I tweaked the boxplot, using a
[notch](https://sites.google.com/site/davidsstatistics/home/notched-box-plots)
and making the outliers take up a fewer pixels.

I’m definitely in agreement with Rick that this is an excellent way to
compare many distributions.

Bonus points for the commenter who shows code to color the bars by which distribution generated them!

I shot a quick post over at the [Data Driven Security blog](http://bit.ly/1hyqJiT) explaining how to separate Twitter data gathering from R code via the Ruby `t` ([github repo](https://github.com/sferik/t)) command. Using `t` frees R code from having to be a Twitter processor and lets the analyst focus on analysis and visualization, plus you can use `t` as a substitute for Twitter GUIs if you’d rather play at the command-line:

$ t timeline ddsecblog
   @DDSecBlog
   Monitoring Credential Dumps Plus Using Twitter As a Data Source http://t.co/ThYbjRI9Za
 
   @DDSecBlog
   Nice intro to R + stats // Data Analysis and Statistical Inference free @datacamp_com course
   http://t.co/FC44FF9DSp
 
   @DDSecBlog
   Very accessible paper & cool approach to detection // Nazca: Detecting Malware Distribution in
   Large-Scale Networks http://t.co/fqrSaFvUK2
 
   @DDSecBlog
   Start of a new series by new contributing blogger @spttnnh! // @AlienVault rep db Longitudinal
   Study Part 1 : http://t.co/XM7m4zP0tr
 
   ...

The DDSec post shows how to mine the well-formatted output from the @dumpmon Twitter bot to visualize dump trends over time:

and has the code in-line and over at the [DDSec github repo](https://github.com/ddsbook/blog/blob/master/extra/src/R/dumpmon.R) [R].