I’ve converted the vast majority of my *apply
usage over to purrr
functions. In an attempt to make this a quick post, I’ll refrain from going into all the benefits of the purrr
package. Instead, I’ll show just one thing that’s super helpful: formula functions.
After seeing this Quartz article using a visualization to compare the frequency and volume of mass shootings, I wanted to grab the data to look at it with a stats-eye (humans are ++gd at visually identifying patterns, but we’re also ++gd as misinterpreting them, plus stats validates visual assumptions). I’m not going into that here, but will use the grabbing of the data to illustrate the formula functions. Note that there’s quite a bit of “setup” here for just one example, so I guess I kinda am attempting to shill the purrr
package and the “piping tidyverse” just a tad.
If you head on over to the site with the data you’ll see you can download files for all four years. In theory, these are all individual years, but the names of the files gave me pause:
MST Data 2013 - 2015.csv
MST Data 2014 - 2015.csv
MST Data 2015 - 2015.csv
Mass Shooting Data 2016 - 2016.csv
So, they may all be individual years, but the naming consistency isn’t there and it’s better to double check than to assume.
First, we can check to see if the column names are the same (we can eyeball this since there are only four files and a small # of columns):
library(purrr)
library(readr)
dir() %>%
map(read_csv) %>%
map(colnames)
## [[1]]
## [1] "date" "name_semicolon_delimited"
## [3] "killed" "wounded"
## [5] "city" "state"
## [7] "sources_semicolon_delimited"
##
## [[2]]
## [1] "date" "name_semicolon_delimited"
## [3] "killed" "wounded"
## [5] "city" "state"
## [7] "sources_semicolon_delimited"
##
## [[3]]
## [1] "date" "name_semicolon_delimited"
## [3] "killed" "wounded"
## [5] "city" "state"
## [7] "sources_semicolon_delimited"
##
## [[4]]
## [1] "date" "name_semicolon_delimited"
## [3] "killed" "wounded"
## [5] "city" "state"
## [7] "sources_semicolon_delimited"
A quick inspection of the date
column shows it’s in month/day/year
format and we want to know if each file only spans one year. This is where the elegance of the formula function comes in:
library(lubridate)
dir() %>%
map(read_csv) %>%
map(~range(mdy(.$date))) # <--- the *entire* post was to show this one line ;-)
## [[1]]
## [1] "2016-01-06" "2016-07-25"
##
## [[2]]
## [1] "2013-01-01" "2013-12-31"
##
## [[3]]
## [1] "2014-01-01" "2014-12-29"
##
## [[4]]
## [1] "2015-01-01" "2015-12-31"
To break that down a bit:
dir()
returns a vector of filenames in the current directory- the first
map()
reads each of those files in and creates a list with four elements, each being atibble
(data_frame
/data.frame
) - the second
map()
iterates over those data frames and calls a newly created anonymous function which converts thedate
column to a properDate
data type then gets the range of those dates, ultimately resulting in a four element list, with each element being a two element vector ofDate
s
For you “basers” out there, this is what that looks like old school style:
fils <- dir()
dfs <- lapply(fils, read.csv, stringsAsFactors=FALSE)
lapply(dfs, function(x) range(as.Date(x$date, format="%m/%e/%Y")))
or
lapply(dir(), function(x) {
df <- read.csv(x, stringsAsFactors=FALSE)
range(as.Date(df$date, format="%m/%e/%Y"))
})
You eliminate the function(x) { }
and get pre-defined vars (either .x
or .
and, if needed, .y
) to compose your map
s and pipes very cleanly and succinctly, but still being super-readable.
After performing this inspection (i.e. that each file does contain only a incidents for a single year), we can now automate the data ingestion:
library(rvest)
library(purrr)
library(readr)
library(dplyr)
library(lubridate)
read_html("https://www.massshootingtracker.org/data") %>%
html_nodes("a[href^='https://docs.goo']") %>%
html_attr("href") %>%
map_df(read_csv) %>%
mutate(date=mdy(date)) -> shootings
Here’s what that looks like w/o the tidyverse/piping:
library(XML)
doc <- htmlParse("http://www.massshootingtracker.org/data") # note the necessary downgrade to "http"
dfs <- xpathApply(doc, "//a[contains(@href, 'https://docs.goo')]", function(x) {
csv <- xmlGetAttr(x, "href")
df <- read.csv(csv, stringsAsFactors=FALSE)
df$date <- as.Date(df$date, format="%m/%e/%Y")
df
})
shootings <- do.call(rbind, dfs)
Even hardcore “basers” may have to admit that the piping/tidyverse version is ultimately better.
Give the purrr
package a first (or second) look if you haven’t switched over to it. Type safety, readable anonymous functions and C-backed fast functional idioms will mean that your code may ultimately be purrrfect.
UPDATE #1
I received a question in the comments regarding how I came to that CSS selector for the gdocs CSV URLs, so I made a quick video of the exact steps I took. Exposition below the film.
Right-click “Inspect” in Chrome is my go-to method for finding what I’m after. This isn’t the place to dive deep into the dark art of web page spelunking, but in this case, when I saw there were four similar anchor (<a>
) tags that pointed to the CSV “files”, I took the easy way out and just built a selector based on the href
attribute value (or, more specifically, the characters at the start of the href
attribute). However, all four ways below end up targeting the same four elements:
pg <- read_html("https://www.massshootingtracker.org/data")
html_nodes(pg, "a.btn.btn-default")
html_nodes(pg, "a[href^='https://docs.goo']")
html_nodes(pg, xpath=".//a[@class='btn btn-default']")
html_nodes(pg, xpath=".//a[contains(@href, 'https://docs.goo')]")
UPDATE #2
Due to:
@hrbrmstr haha yes. Only basers use list.files() ?
— Hadley Wickham (@hadleywickham) July 28, 2016
I swapped out list.files()
in favour of dir()
(though, as someone who detests DOS/Windows, typing that function name is super-painful).
8 Comments
Great post! I’m struggling with rvest a bit… in your above example, what was your strategy for determining the appropriate css for html_node? Using SelectorGadget, the css I get when I click on the download links is “.btn-default”, which works, but it’s different from your css above
Did a quick screencast and posted it and and update at the end of the post.
Great article! By the way, the best way I’ve found to find CSS targets is to use Selector Gadget (http://selectorgadget.com).
Aye. SG is a great tool but I prefer manual inspection (with both Developer Tools and View Source).
Yes Selector Gadget is a bit flaky (and I believe cannot do copy and paste) Useful sometimes, though
Nice article. As a newbie to purrr, can you explain in very simple terms (like ELI5) when you use the tilde ~ and when you don’t. In the first example with colnames() you don’t use a ~ but with the range() and mdy() you do. Are there some simple rules of thumb?
Thx for taking the time to read the post and ask a great question.
I’ve got a few personal rules that I’m following for the “functional” side of the
purrr
verbs. First, I do try to make the calls “atomic”; that is, using a single function call on the functional side. I’ll use toy examples starting withmap_chr(LETTERS, tolower)
(tolower()
is vectorized so this is truly a toy example). This applies even if the function takes additional parameters, such asmap_chr(state.name, substr, start=2, stop=4)
(again, not rly necessary sincesubstr()
is vectorized). I get lazy and not-name the extra parameters at times, but I think it’s important to name them for readable code.I have 2 personal rules for the
~
shortcut (it, essentially, makes an anonymous function …function(x) { ... }
… around the bits past the~
). The first (which is therange()
/mdy()
example) is that it should be used with a limited number of nested parenthesis calls (2-3 max) and/or limited%>%
chains (2-5). If you go beyond that another personal rule (which I break far too much) is to not have it be an anonymous function but to be a full-on named function that’s called in the non-~
way. The second is when you need to supply the.
(or.x
/.y
) multiple times such asmap_chr(LETTERS, ~sprintf("%s or %s", ., tolower(.)))
.Hopefully that helps a bit and I’ll have more examples in January.
Thanks for the detailed explanation. I have been inspired by your opening sentence: “I’ve converted the vast majority of my *apply usage over to purrr functions.” to look at purrr again. I’ve been trying to determine the utility of purrr, and its been a struggle to figure this out. But now understanding that it’s a replacement for the somewhat inconsistent *apply functions, it makes more sense. Yesterday I tried map() on a multistep explicit function yesterday and it worked like a charm.
One Trackback/Pingback
[…] article was first published on R – rud.is, and kindly contributed to […]