Use quick formula functions in purrr::map (+ base vs tidtyverse idiom comparisons/examples)

I’ve converted the vast majority of my *apply usage over to purrr functions. In an attempt to make this a quick post, I’ll refrain from going into all the benefits of the purrr package. Instead, I’ll show just one thing that’s super helpful: formula functions.

After seeing this Quartz article using a visualization to compare the frequency and volume of mass shootings, I wanted to grab the data to look at it with a stats-eye (humans are ++gd at visually identifying patterns, but we’re also ++gd as misinterpreting them, plus stats validates visual assumptions). I’m not going into that here, but will use the grabbing of the data to illustrate the formula functions. Note that there’s quite a bit of “setup” here for just one example, so I guess I kinda am attempting to shill the purrr package and the “piping tidyverse” just a tad.

If you head on over to the site with the data you’ll see you can download files for all four years. In theory, these are all individual years, but the names of the files gave me pause:

MST Data 2013 - 2015.csv
MST Data 2014 - 2015.csv
MST Data 2015 - 2015.csv
Mass Shooting Data 2016 - 2016.csv

So, they may all be individual years, but the naming consistency isn’t there and it’s better to double check than to assume.

First, we can check to see if the column names are the same (we can eyeball this since there are only four files and a small # of columns):

library(purrr)
library(readr)

dir() %>% 
  map(read_csv) %>% 
  map(colnames)

## [[1]]
## [1] "date"                        "name_semicolon_delimited"   
## [3] "killed"                      "wounded"                    
## [5] "city"                        "state"                      
## [7] "sources_semicolon_delimited"
## 
## [[2]]
## [1] "date"                        "name_semicolon_delimited"   
## [3] "killed"                      "wounded"                    
## [5] "city"                        "state"                      
## [7] "sources_semicolon_delimited"
## 
## [[3]]
## [1] "date"                        "name_semicolon_delimited"   
## [3] "killed"                      "wounded"                    
## [5] "city"                        "state"                      
## [7] "sources_semicolon_delimited"
## 
## [[4]]
## [1] "date"                        "name_semicolon_delimited"   
## [3] "killed"                      "wounded"                    
## [5] "city"                        "state"                      
## [7] "sources_semicolon_delimited"

A quick inspection of the date column shows it’s in month/day/year format and we want to know if each file only spans one year. This is where the elegance of the formula function comes in:

library(lubridate)

dir() %>% 
  map(read_csv) %>% 
  map(~range(mdy(.$date))) # <--- the *entire* post was to show this one line ;-)

## [[1]]
## [1] "2016-01-06" "2016-07-25"
## 
## [[2]]
## [1] "2013-01-01" "2013-12-31"
## 
## [[3]]
## [1] "2014-01-01" "2014-12-29"
## 
## [[4]]
## [1] "2015-01-01" "2015-12-31"

To break that down a bit:

dir() returns a vector of filenames in the current directory
the first map() reads each of those files in and creates a list with four elements, each being a tibble (data_frame / data.frame)
the second map() iterates over those data frames and calls a newly created anonymous function which converts the date column to a proper Date data type then gets the range of those dates, ultimately resulting in a four element list, with each element being a two element vector of Dates

For you “basers” out there, this is what that looks like old school style:

fils <- dir()
dfs <- lapply(fils, read.csv, stringsAsFactors=FALSE)
lapply(dfs, function(x) range(as.Date(x$date, format="%m/%e/%Y")))

lapply(dir(), function(x) {
  df <- read.csv(x, stringsAsFactors=FALSE)
  range(as.Date(df$date, format="%m/%e/%Y"))
})

You eliminate the function(x) { } and get pre-defined vars (either .x or . and, if needed, .y) to compose your maps and pipes very cleanly and succinctly, but still being super-readable.

After performing this inspection (i.e. that each file does contain only a incidents for a single year), we can now automate the data ingestion:

library(rvest)
library(purrr)
library(readr)
library(dplyr)
library(lubridate)

read_html("https://www.massshootingtracker.org/data") %>% 
  html_nodes("a[href^='https://docs.goo']") %>% 
  html_attr("href") %>% 
  map_df(read_csv) %>% 
  mutate(date=mdy(date)) -> shootings

Here’s what that looks like w/o the tidyverse/piping:

library(XML)

doc <- htmlParse("http://www.massshootingtracker.org/data") # note the necessary downgrade to "http"

dfs <- xpathApply(doc, "//a[contains(@href, 'https://docs.goo')]", function(x) {
  csv <- xmlGetAttr(x, "href")
  df <- read.csv(csv, stringsAsFactors=FALSE)
  df$date <- as.Date(df$date, format="%m/%e/%Y")
  df
})

shootings <- do.call(rbind, dfs)

Even hardcore “basers” may have to admit that the piping/tidyverse version is ultimately better.

Give the purrr package a first (or second) look if you haven’t switched over to it. Type safety, readable anonymous functions and C-backed fast functional idioms will mean that your code may ultimately be purrrfect.

UPDATE #1

I received a question in the comments regarding how I came to that CSS selector for the gdocs CSV URLs, so I made a quick video of the exact steps I took. Exposition below the film.

Right-click “Inspect” in Chrome is my go-to method for finding what I’m after. This isn’t the place to dive deep into the dark art of web page spelunking, but in this case, when I saw there were four similar anchor (<a>) tags that pointed to the CSV “files”, I took the easy way out and just built a selector based on the href attribute value (or, more specifically, the characters at the start of the href attribute). However, all four ways below end up targeting the same four elements:

pg <- read_html("https://www.massshootingtracker.org/data")

html_nodes(pg, "a.btn.btn-default")
html_nodes(pg, "a[href^='https://docs.goo']")
html_nodes(pg, xpath=".//a[@class='btn btn-default']")
html_nodes(pg, xpath=".//a[contains(@href, 'https://docs.goo')]")

UPDATE #2

Due to:

@hrbrmstr haha yes. Only basers use list.files() ?

— Hadley Wickham (@hadleywickham) July 28, 2016

I swapped out list.files() in favour of dir() (though, as someone who detests DOS/Windows, typing that function name is super-painful).

8 Comments

- timothyjkiely
- Posted 2016-07-26 at 15:16
- Permalink
- Reply
Great post! I’m struggling with rvest a bit… in your above example, what was your strategy for determining the appropriate css for html_node? Using SelectorGadget, the css I get when I click on the download links is “.btn-default”, which works, but it’s different from your css above
- - hrbrmstr
  - Posted 2016-07-26 at 15:37
  - Permalink
  - Reply
  Did a quick screencast and posted it and and update at the end of the post.
- tonyfischetti
- Posted 2016-07-27 at 11:04
- Permalink
- Reply
Great article! By the way, the best way I’ve found to find CSS targets is to use Selector Gadget (http://selectorgadget.com).
- - hrbrmstr
  - Posted 2016-07-27 at 11:06
  - Permalink
  - Reply
  Aye. SG is a great tool but I prefer manual inspection (with both Developer Tools and View Source).
  - - pssguy
    - Posted 2016-07-29 at 22:15
    - Permalink
    - Reply
    Yes Selector Gadget is a bit flaky (and I believe cannot do copy and paste) Useful sometimes, though
- Forester
- Posted 2016-12-20 at 17:50
- Permalink
- Reply
Nice article. As a newbie to purrr, can you explain in very simple terms (like ELI5) when you use the tilde ~ and when you don’t. In the first example with colnames() you don’t use a ~ but with the range() and mdy() you do. Are there some simple rules of thumb?
- - hrbrmstr
  - Posted 2016-12-20 at 22:27
  - Permalink
  - Reply
  Thx for taking the time to read the post and ask a great question.
  
  I’ve got a few personal rules that I’m following for the “functional” side of the purrr verbs. First, I do try to make the calls “atomic”; that is, using a single function call on the functional side. I’ll use toy examples starting with map_chr(LETTERS, tolower) (tolower() is vectorized so this is truly a toy example). This applies even if the function takes additional parameters, such as map_chr(state.name, substr, start=2, stop=4) (again, not rly necessary since substr() is vectorized). I get lazy and not-name the extra parameters at times, but I think it’s important to name them for readable code.
  
  I have 2 personal rules for the ~ shortcut (it, essentially, makes an anonymous function … function(x) { ... } … around the bits past the ~). The first (which is the range()/mdy() example) is that it should be used with a limited number of nested parenthesis calls (2-3 max) and/or limited %>% chains (2-5). If you go beyond that another personal rule (which I break far too much) is to not have it be an anonymous function but to be a full-on named function that’s called in the non-~ way. The second is when you need to supply the . (or .x / .y) multiple times such as map_chr(LETTERS, ~sprintf("%s or %s", ., tolower(.))).
  
  Hopefully that helps a bit and I’ll have more examples in January.
  - - Forester
    - Posted 2016-12-22 at 11:40
    - Permalink
    - Reply
    Thanks for the detailed explanation. I have been inspired by your opening sentence: “I’ve converted the vast majority of my *apply usage over to purrr functions.” to look at purrr again. I’ve been trying to determine the utility of purrr, and its been a struggle to figure this out. But now understanding that it’s a replacement for the somewhat inconsistent *apply functions, it makes more sense. Yesterday I tried map() on a multistep explicit function yesterday and it worked like a charm.