Skip navigation

Category Archives: ggplot

I was socially engineered by @yoniceedee into creating today’s post due to being prodded with this tweet:

Since there aren’t nearly enough sf and geom_sf examples out on the wild, wild web, here’s a short one that shows how to do basic sf operations, including how to plot sf objects in ggplot2 and animate a series of them with magick.

I’m hoping someone riffs off of this to make an interactive version with Shiny. If you do, definitely drop a link+note in the comments!

(If folks want some exposition here, let me know in the comments and I’ll rig up an R Markdown document with more step-by-step details.)

Full RStudio project file (with pre-cached data) is on GitHub.

library(rprojroot)
library(sf)
library(magick)
library(tidyverse) # NOTE: Needs github version of ggplot2

root <- find_rstudio_root_file()

# "borrow" the files from SmokyMountains.com, but be nice and cache them to
# avoid hitting their web server for every iteration

c("https://smokymountains.com/wp-content/themes/smcom-2015/to-delete/js/us.json",
  "https://smokymountains.com/wp-content/themes/smcom-2015/js/foliage2.tsv",
  "https://smokymountains.com/wp-content/themes/smcom-2015/js/foliage-2017.csv") %>%
  walk(~{
    sav_tmp <- file.path(root, "data", basename(.x))
    if (!file.exists(sav_tmp)) download.file(.x, sav_tmp)
  })

# next, we read in the GeoJSON file twice. first, to get the counties
states_sf <- read_sf(file.path(root, "data", "us.json"), "states", stringsAsFactors = FALSE)

# we only want the continental US
states_sf <- filter(states_sf, !(id %in% c("2", "15", "72", "78")))

# it doesn't have a CRS so we give it one
st_crs(states_sf) <- 4326

# I ran into hiccups using coord_sf() to do this, so we convert it to Albers here
states_sf <- st_transform(states_sf, 5070)


# next we read in the states
counties_sf <- read_sf(file.path(root, "data", "us.json"), "counties", stringsAsFactors = FALSE)
st_crs(counties_sf) <- 4326
counties_sf <- st_transform(counties_sf, 5070)

# now, we read in the foliage data
foliage <- read_tsv(file.path(root, "data", "foliage-2017.csv"),
                    col_types = cols(.default=col_double(), id=col_character()))

# and, since we have a lovely sf tidy data frame, bind it together
left_join(counties_sf, foliage, "id") %>%
  filter(!is.na(rate1)) -> foliage_sf

# now, we do some munging so we have better labels and so we can
# iterate over the weeks
gather(foliage_sf, week, value, -id, -geometry) %>%
  mutate(value = factor(value)) %>%
  filter(week != "rate1") %>%
  mutate(week = factor(week,
                       levels=unique(week),
                       labels=format(seq(as.Date("2017-08-26"),
                                         as.Date("2017-11-11"), "1 week"),
                                     "%b %d"))) -> foliage_sf

# now we make a ggplot object for each week and save it out to a png
pb <- progress_estimated(nlevels(foliage_sf$week))
walk(1:nlevels(foliage_sf$week), ~{

  pb$tick()$print()

  xdf <- filter(foliage_sf, week == levels(week)[.x])

  ggplot() +
    geom_sf(data=xdf, aes(fill=value), size=0.05, color="#2b2b2b") +
    geom_sf(data=states_sf, color="white", size=0.125, fill=NA) +
    viridis::scale_fill_viridis(
      name=NULL,
      discrete = TRUE,
      labels=c("No Change", "Minimal", "Patchy", "Partial", "Near Peak", "Peak", "Past Peak"),
      drop=FALSE
    ) +
    labs(title=sprintf("Foliage: %s ", unique(xdf$week))) +
    ggthemes::theme_map() +
    theme(panel.grid=element_line(color="#00000000")) +
    theme(panel.grid.major=element_line(color="#00000000")) +
    theme(legend.position="right") -> gg

  ggsave(sprintf("%02d.png", .x), gg, width=5, height=3)

})

# we read them all back in and animate the foliage
sprintf("%02d.png", 1:nlevels(foliage_sf$week)) %>%
  map(image_read) %>%
  image_join() %>%
  image_animate(1)

Political machinations are a tad insane in the U.S. these days & I regularly hit up @ProPublica & @GovTrack sites (& sub to the GovTrack e-mail updates) as I try to be an informed citizen, especially since I’ve got a Senator and Representative who seem to be in the sway of ?.

I’ve always appreciated the ProPublica and GovTrack cartograms as they present a great deal of information in a compact space (especially the House versions). Something nudged me into starting an R package to let folks create them in R (mainly with ggplot2 but an htmlwidget version is planned), which I’ve dubbed voteogram.

With the voteogram package, you can:

  • pull ProPublica roll call vote data for the 101st Congress up through today (via roll_call())
  • plot ProPublica-esque Senate roll call vote cartograms
  • plot ProPublica-esque House roll call vote cartograms
  • plot GovTrack-esque House roll call vote cartograms

GovTrack uses — what I’ve seen @thosjleeper refer to as — a “parliamentary plot” for their version of the Senate roll call cartogram and sir Leeper already has that type of plot covered in ggparliament, so I’ve just focused on the other ones here.

Roll Call

You need data for these cartogram generation functions and you can specify your own populated data frame (the needed columns are in the manual pages for the cartogram plotters). However, you’ll likely want to plot existing data that others have tallied and ProPublica makes that super simple since each vote is in a standalone JSON file. All you have to do is specify whether you want the roll call vote for the house or senate, the Congress number (current one is 115), the session number (current one is 1) and the roll call vote number.

For example, we can see all the idiots Representatives who voted, recently, to kill people repeal the ACA with the following function call:

(h256 <- roll_call("house", 115, 1, 256))
## 115th Congress / Session: 1 / House Roll Call: 256 / May  4, 2017
## 
## American Health Care Act
## 
## Result: Passed

str(h256, max.level = 1)
## List of 29
##  $ vote_id              : chr "H_115_1_256"
##  $ chamber              : chr "House"
##  $ year                 : int 2017
##  $ congress             : chr "115"
##  $ session              : chr "1"
##  $ roll_call            : int 256
##  $ needed_to_pass       : int 216
##  $ date_of_vote         : chr "May  4, 2017"
##  $ time_of_vote         : chr "02:18 PM"
##  $ result               : chr "Passed"
##  $ vote_type            : chr "RECORDED VOTE"
##  $ question             : chr "On Passage"
##  $ description          : chr "American Health Care Act"
##  $ nyt_title            : chr "On Passage"
##  $ total_yes            : int 217
##  $ total_no             : int 213
##  $ total_not_voting     : int 1
##  $ gop_yes              : int 217
##  $ gop_no               : int 20
##  $ gop_not_voting       : int 1
##  $ dem_yes              : int 0
##  $ dem_no               : int 193
##  $ dem_not_voting       : int 0
##  $ ind_yes              : int 0
##  $ ind_no               : int 0
##  $ ind_not_voting       : int 0
##  $ dem_majority_position: chr "No"
##  $ gop_majority_position: chr "Yes"
##  $ votes                :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  435 obs. of  11 variables:
##  - attr(*, "class")= chr [1:2] "pprc" "list"

As you can see, it has a custom print function and the usable data (for cartographic needs) is in $votes. You can go to town with just that information, making bar charts or tracking individual Congress-critter votes.

Do your best to cache this data as you retrieve it. ProPublica is a non-profit and the JSON files are on AWS. While there’s a certain number of free bits of bandwidth-per-month allotted buy Amazon’s S3 service, best to make sure you’re not tipping them over on any given month. Plus, the vote data doesn’t change once it’s recorded. Consider donating to them if you decided to always grab fresh copies.

There’s a fortify function for this object (it’s classed pprc) so you can pass it right into ggplot() for use or pipe it into a dplyr chain for aggregation & filtering.

House Rules

With the data in hand, we can make some cartograms (the real purpose of the package). I riffed off the ProPublica colors (and haven’t fully finished copying them yet as I need to search for 2 more categories of Independent voting colors) but you can replace them with anything you want. Just reset the scale and use the names in the exposed color value vectors.

There’s also a theme_voteogram() which is designed to augment any base theme (like hrbrthemes::theme_ipsum_rc()) (it’s much like ggthemes::theme_map()).

Here’s the ProPublica view for that particular vote:

house_carto(rep) +
  labs(x=NULL, y=NULL, 
       title="House Vote 256 - Passes American Health Care Act,\nRepealing Obamacare") +
  theme_ipsum_rc(plot_title_size = 24) +
  theme_voteogram()

The house_carto() function defaults to the ProPublica cartogram, but you can easily change that:

house_carto(rep, "gt") +
  labs(x=NULL, y=NULL, 
       title="House Vote 256 - Passes American Health Care Act,\nRepealing Obamacare") +
  theme_ipsum_rc(plot_title_size = 24) +
  theme_voteogram()

Senate Drools

Again, the senate_carto() function only has the ProPublica-esque cartogram available and works pretty much the same way after getting the Senate vote data:

sen <- roll_call("senate", 115, 1, 110)

senate_carto(sen) +
  labs(title="Senate Vote 110 - Invokes Cloture on Neil Gorsuch Nomination") +
  theme_ipsum_rc(plot_title_size = 24) +
  theme_voteogram()

FIN

There’s a bit of work left to do in the package (including an htmlwidget version). You’re invited to file PRs or Issues as you are so moved.

I caught this “gem” in the Wall Street Journal tonight:

It’s pretty hard to compare store-to-store, even though it is fairly clear which ones are going-going-gone. If we want to see the relative percentage of each store closing and also want to see how they stack up against each other, then let’s make a column of 100% bars and label total stores in each:

library(hrbrthemes)
library(tidyverse)

read.table(text='store,closing,total
"Radio Shack",550,1500
"Payless",400,2600
"Rue21",400,1100
"The Limited",250,250
"bebe",180,180
"Wet Seal",170,170
"Crocs",160,560
"JCPenny",138,1000
"American Apparel",110,110
"Kmart",109,735
"hhgregg",88,220
"Sears",41,695', sep=",", header=TRUE, stringsAsFactors=FALSE) %>% 
  as_tibble() %>% 
  mutate(remaining = total - closing,
         gone = round((closing/total) * 100)/100,
         stay = 1-gone,
         rem_lab = ifelse(remaining == 0, "", scales::comma(remaining))) %>% 
  arrange(desc(stay)) %>% 
  mutate(store=factor(store, levels=store)) -> closing_df

update_geom_font_defaults(font_rc)

ggplot(closing_df) +
  geom_segment(aes(0, store, xend=gone, yend=store, color="Closing"), size=8) +
  geom_segment(aes(gone, store, xend=gone+stay, yend=store, color="Remaining"), size=8) +
  geom_text(aes(x=0, y=store, label=closing), color="white", hjust=0, nudge_x=0.01) +
  geom_text(aes(x=1, y=store, label=rem_lab), color="white", hjust=1, nudge_x=-0.01) +
  scale_x_percent() +
  scale_color_ipsum(name=NULL) +
  labs(x=NULL, y=NULL, 
       title="Selected 2017 Store closings (estimated)",
       subtitle="Smaller specialty chains such as Bebe and American Apparel are closing their stores,\nwhile lareger chains such as J.C. Penny and Sears are scaling back their footprint.") +
  theme_ipsum_rc(grid="X") +
  theme(axis.text.x=element_text(hjust=c(0, 0.5, 0.5, 0.5, 1))) +
  theme(legend.position=c(0.875, 1.025)) +
  theme(legend.direction="horizontal")

One might try circle packing or a treemap to show both relative store count and percentage, but I think the bigger story is the percent reduction for each retail chain. It’d be cool to see what others come up with.

By now, word of the forcible deplanement of a medical professional by United has reached even the remotest of outposts in the universe. Since the news brought this practice to global attention, I found some aggregate U.S. Gov data made a quick, annual, aggregate look at this soon after the incident:

While informative, that visualization left me wanting for more granular data. Alas, a super-quick search turned up empty.

However, within 24 hours I had a quick glance at a tweet (a link to it in the comments wld be ++gd if anyone fav’d it) that had a screen capture from a PDF from the U.S. DoT Air Travel Consumer Reports site.

There are individual pages for each monthly report which can be derived from the annual index pages. I crafted the URL scraping code below before inspecting an individual PDF. It turns out grabbing all the PDFs was not necessary since they don’t provide monthly figures for the involuntary disembarking. But, I wrote the code and it’ll likely be useful to someone out there so here it is:

library(rvest)
library(stringi)
library(pdftools)
library(hrbrthemes)
library(tidyverse)

# some URLs generate infinite redirection loops so be safe out there
safe_read_html <- safely(read_html)

# grab the individual page URLs for each month available in each year
c("https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2017",
  "https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2016",
  "https://www.transportation.gov/airconsumer/air-travel-consumer-reports-2015") %>%
  map(function(x) {
    read_html(x) %>%
      html_nodes("a[href*='air-travel-consumer-report']") %>%
      html_attr('href')
  }) %>%
  flatten_chr() %>%
  discard(stri_detect_regex, "feedback|/air-travel-consumer-reports") %>% # filter out URLs we don't need
  sprintf("https://www.transportation.gov%s", .) -> main_urls # make them useful

# now, read in all the individual pages. 
# do this separate from URL grabbing above and the PDF URL extraction
# below just to be even safer. 
map(main_urls, safe_read_html) -> pages

# URLs that generate said redirection loops will not have a valid
# result so ignor ethem and find the URLs for the monthly reports
discard(pages, ~is.null(.$result)) %>%
  map("result") %>%
  map(~html_nodes(., "a[href*='pdf']") %>%
        html_attr('href') %>%
        keep(stri_detect_fixed, "ATCR")) %>%
  flatten_chr() -> pdf_urls

# download them, being kind to the DoT server and not re-downloading
# anything we've successfully downloaded already. I really wish this
# was built-in functionality to download.file()
dir.create("atcr_pdfs")
walk(pdf_urls, ~if (!file.exists(file.path("atcr_pdfs", basename(.))))
  download.file(., file.path("atcr_pdfs", basename(.))))

It also wasn’t a complete waste for me since the PDF reports have monthly data in other categories and it did provide me with 3 years of data to compare visually.

The table with annual data looks like this in the PDF:

and, that page looks like this after it gets processed by pdftools::pdf_text():

The format is mostly consistent across the three files, but there are enough differences to require edge-case handling. Still, it’s not too much code to get three, separate tables:

# read in each PDF; find the pages with the tables we need to scrape;
# enable the text table to be read with read.table() and save the
# results
c("2017MarchATCR.pdf", "2016MarchATCR_2.pdf", "2015MarchATCR_1.pdf") %>%
  file.path("atcr_pdfs", .) %>%
  map(pdf_text) %>%
  map(~keep(.x, stri_detect_fixed, "PASSENGERS DENIED BOARDING")[[2]]) %>%
  map(stri_split_lines) %>%
  map(flatten_chr) %>%
  map(function(x) {
    y <- which(stri_detect_regex(x, "Rank|RANK|TOTAL"))
    grep("^\ +[[:digit:]]", x[y[1]:y[2]], value=TRUE) %>%
      stri_trim() %>%
      stri_replace_all_regex("([[:alpha:]])\\*+", "$1") %>%
      stri_replace_all_regex(" ([[:alpha:]])", "_$1") %>%
      paste0(collapse="\n") %>%
      read.table(text=., header=FALSE, stringsAsFactors=FALSE)
  }) -> denied

denied

## [[1]]
##    V1                   V2      V3     V4          V5   V6      V7     V8          V9  V10
## 1   1   _HAWAIIAN_AIRLINES     326     49  10,824,495 0.05     358     29  10,462,344 0.03
## 2   2     _DELTA_AIR_LINES 129,825  1,238 129,281,098 0.10 145,406  1,938 125,044,855 0.15
## 3   3      _VIRGIN_AMERICA   2,375     94   7,945,329 0.12   1,722     80   6,928,805 0.12
## 4   4     _ALASKA_AIRLINES   6,806    931  23,390,900 0.40   5,412    740  22,095,126 0.33
## 5   5     _UNITED_AIRLINES  62,895  3,765  86,836,527 0.43  81,390  6,317  82,081,914 0.77
## 6   6     _SPIRIT_AIRLINES  10,444  1,117  19,418,650 0.58   6,589    496  16,010,164 0.31
## 7   7   _FRONTIER_AIRLINES   2,096    851  14,666,332 0.58   2,744  1,232  12,343,540 1.00
## 8   8   _AMERICAN_AIRLINES  54,259  8,312 130,894,653 0.64  50,317  7,504  97,091,951 0.77
## 9   9     _JETBLUE_AIRWAYS   1,705  3,176  34,710,003 0.92   1,841     73  31,949,251 0.02
## 10 10    _SKYWEST_AIRLINES  41,476  2,935  29,986,918 0.98  51,829  5,079  28,562,760 1.78
## 11 11  _SOUTHWEST_AIRLINES  88,628 14,979 150,655,354 0.99  96,513 15,608 143,932,752 1.08
## 12 12 _EXPRESSJET_AIRLINES  33,590  3,182  21,139,038 1.51  42,933  4,608  24,736,601 1.86
##
## [[2]]
##    V1                   V2      V3     V4          V5   V6      V7     V8          V9  V10
## 1   1     _JETBLUE_AIRWAYS   1,841     73  31,949,251 0.02   2,006    650  29,264,332 0.22
## 2   2   _HAWAIIAN_AIRLINES     358     29  10,462,344 0.03     366    116  10,084,811 0.12
## 3   3      _VIRGIN_AMERICA   1,722     80   6,928,805 0.12     910     57   6,438,023 0.09
## 4   4     _DELTA_AIR_LINES 145,406  1,938 125,044,855 0.16 107,706  4,052 115,737,180 0.35
## 5   5     _SPIRIT_AIRLINES   6,589    496  16,010,164 0.31    ****   ****        **** ****
## 6   6     _ALASKA_AIRLINES   5,412    740  22,095,126 0.33   4,176    864  19,838,878 0.44
## 7   7     _UNITED_AIRLINES  81,390  6,317  82,081,914 0.77  64,968  9,078  77,317,281 1.17
## 8   8   _AMERICAN_AIRLINES  50,317  7,504  97,091,951 0.77  35,152  3,188  77,065,600 0.41
## 9   9   _FRONTIER_AIRLINES   2,744  1,232  12,343,540 1.00   3,864  1,616  11,787,602 1.37
## 10 10  _SOUTHWEST_AIRLINES  96,513 15,608 143,932,752 1.08  82,039 12,041 116,809,601 1.03
## 11 11    _SKYWEST_AIRLINES  51,829  5,079  28,562,760 1.78  42,446  7,170  26,420,593 2.71
## 12 12 _EXPRESSJET_AIRLINES  42,933  4,608  24,736,601 1.86  55,525  7,961  29,344,974 2.71
## 13 13           _ENVOY_AIR  18,125  2,792  11,901,028 2.35  18,615  2,501  15,441,723 1.62
##
## [[3]]
##    V1                   V2      V3     V4          V5   V6     V7    V8          V9  V10
## 1   1      _VIRGIN_AMERICA     910     57   6,438,023 0.09    351    26   6,244,574 0.04
## 2   2   _HAWAIIAN_AIRLINES     366    116  10,084,811 0.12  1,147   172   9,928,830 0.17
## 3   3     _JETBLUE_AIRWAYS   2,006    650  29,264,332 0.22    502    19  28,166,771 0.01
## 4   4     _DELTA_AIR_LINES 107,706  4,052 115,737,180 0.35 81,025 6,070 106,783,155 0.57
## 5   5   _AMERICAN_AIRLINES  60,924  7,471 135,748,581 0.55     **    **          **   **
## 6   6     _ALASKA_AIRLINES   4,176    864  19,838,878 0.44  3,834   714  18,517,953 0.39
## 7   7  _SOUTHWEST_AIRLINES  88,921 13,899 125,381,374 1.11    ***   ***         ***  ***
## 8   8     _UNITED_AIRLINES  64,968  9,078  77,317,281 1.17 57,716 9,015  77,212,471 1.17
## 9   9   _FRONTIER_AIRLINES   3,864  1,616  11,787,602 1.37  3,493 1,272  10,361,896 1.23
## 10 10           _ENVOY_AIR  18,615  2,501  15,441,723 1.62 19,659 1,923  16,939,092 1.14
## 11 11 _EXPRESSJET_AIRLINES  55,525  7,961  29,344,974 2.71 47,844 6,422  31,356,714 2.05
## 12 12    _SKYWEST_AIRLINES  42,446  7,170  26,420,593 2.71 35,942 6,768  26,518,312 2.55

And, it’s not too much more work to get that into a usable, single data frame:

map2_df(2016:2014, denied, ~{
  .y$year <- .x
  set_names(.y[,c(1:6,11)],
            c("rank", "airline", "voluntary_denied", "involuntary_denied",
              "enplaned_ct", "involuntary_db_per_10k", "year")) %>%
    mutate(airline = stri_trans_totitle(stri_trim(stri_replace_all_fixed(airline, "_", " ")))) %>%
    readr::type_convert() %>%
    tbl_df()
}) %>%
  select(-rank) -> denied

glimpse(denied)

## Observations: 37
## Variables: 6
## $ airline                <chr> "Hawaiian Airlines", "Delta Air Lines", "Virgin Americ...
## $ voluntary_denied       <dbl> 326, 129825, 2375, 6806, 62895, 10444, 2096, 54259, 17...
## $ involuntary_denied     <dbl> 49, 1238, 94, 931, 3765, 1117, 851, 8312, 3176, 2935, ...
## $ enplaned_ct            <dbl> 10824495, 129281098, 7945329, 23390900, 86836527, 1941...
## $ involuntary_db_per_10k <dbl> 0.05, 0.10, 0.12, 0.40, 0.43, 0.58, 0.58, 0.64, 0.92, ...
## $ year                   <int> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...

denied

## # A tibble: 37 × 6
##              airline voluntary_denied involuntary_denied enplaned_ct
##                <chr>            <dbl>              <dbl>       <dbl>
## 1  Hawaiian Airlines              326                 49    10824495
## 2    Delta Air Lines           129825               1238   129281098
## 3     Virgin America             2375                 94     7945329
## 4    Alaska Airlines             6806                931    23390900
## 5    United Airlines            62895               3765    86836527
## 6    Spirit Airlines            10444               1117    19418650
## 7  Frontier Airlines             2096                851    14666332
## 8  American Airlines            54259               8312   130894653
## 9    Jetblue Airways             1705               3176    34710003
## 10  Skywest Airlines            41476               2935    29986918
## # ... with 27 more rows, and 2 more variables: involuntary_db_per_10k <dbl>, year <int>

Airlines merge and the PDF does account for that (to some degree) but I’m not writing a news story and only care about the airlines with three years of data since I — for the most part — have only ever flown on ones in that list, so the last step is to filter the list to those with three years of data and make a multi-column slopegraph/bumps chart based on the involuntary disembarking rate by 10k passengers (normalized rates FTW!):

select(denied, airline, year, involuntary_db_per_10k) %>%
  group_by(airline) %>%
  mutate(yr_ct = n()) %>%
  ungroup() %>%
  filter(yr_ct == 3) %>%
  select(-yr_ct) %>%
  mutate(year = factor(year, rev(c(max(year)+1, unique(year))))) -> plot_df

update_geom_font_defaults(font_rc, size = 3)

ggplot() +
  geom_line(data = plot_df, aes(year, involuntary_db_per_10k, group=airline, colour=airline)) +
  geom_text(data = filter(plot_df, year=='2016') %>% mutate(lbl = sprintf("%s (%s)", airline, involuntary_db_per_10k)),
            aes(x=year, y=involuntary_db_per_10k, label=lbl, colour=airline), hjust=0,
            nudge_y=c(0,0,0,0,0,0,0,0,-0.0005,0.03,0), nudge_x=0.015) +
  scale_x_discrete(expand=c(0,0), labels=c(2014:2016, ""), drop=FALSE) +
  scale_y_continuous(trans="log1p") +
  ggthemes::scale_color_tableau() +
  labs(x=NULL, y=NULL,
       title="Involuntary Disembark Rate Per 10K Passengers",
       subtitle="Y-axis log scale; Only included airlines with 3-year span data",
       caption="Source: U.S. DoT Air Travel Consumer Reports <https://www.transportation.gov/airconsumer/air-travel-consumer-reports>") +
  theme_ipsum_rc(grid="X") +
  theme(plot.caption=element_text(hjust=0)) +
  theme(legend.position="none")

I’m really glad I don’t fly on JetBlue much anymore.

FIN

The code and a CSV of the cleaned data is in this gist and the code is also in this RPub.

I’m also glad to now know about a previously hidden, helpful resource for consumers who have to fly on U.S. carriers.

Back in 2014 I blogged about first snowfall dates for a given U.S. state. It’s April 1, 2017 and we’re slated to get 12-18″ of snow up here in Maine and @mrshrbrmstr asked how often this — snow in May — has occurred near us.

As with all of these “R⁶ posts, expository is minimal and the focus is generally to demonstrate one small concept.

What I’ve done here (first) is make a full tidyverse update to the snowfirst code posted in the aforementioned blog post. You’ll need to clone that repo if you’re trying to work verbatim from the code below (otherwise just change file path code):

library(rprojroot)
library(stringi)
library(hrbrthemes)
library(tidyverse)

pre <- find_rstudio_root_file()

# Get and read in Maine precip ------------------------------------------------------

URL <- "http://cdiac.ornl.gov/ftp/ushcn_daily/state17_ME.txt.gz"
fil <- file.path(pre, "data", basename(URL))
if (!file.exists(fil)) download.file(URL, fil)

read_fwf(file = fil,
         col_positions = fwf_widths(c(6, 4, 2, 4, rep(c(5, 1, 1, 1), 31)),
                                    col_names = c("coop_id", "year", "month", "element",
                                                  flatten_chr(map(1:31, ~paste("r_", c("v", "fm", "fq", "fs"),
                                                                               .x, sep=""))))),
         col_types = paste0("ciic", paste0(rep("iccc", 31), collapse=""), collapse=""),
         na = c("", "NA", "-", "-9999")) %>%
  gather(day, value, starts_with("r_v")) %>%
  select(-starts_with("r_")) %>%
  mutate(day = as.numeric(stri_replace_first_fixed(day, "r_v", ""))) %>%
  mutate(date = sprintf("%s-%02d-%02d", year, month, day)) -> daily_wx

# Read in stations ------------------------------------------------------------------

URL <- "http://cdiac.ornl.gov/ftp/ushcn_daily/ushcn-stations.txt"
fil <- file.path(pre, "data", basename(URL))
if (!file.exists(fil)) download.file(URL, fil)

read_fwf(file = file.path(pre, "data", "ushcn-stations.txt"),
         col_positions = fwf_widths(c(6, 9, 10, 7, 3, 31, 7, 7, 7, 3),
                                    col_names = c("coop_id", "latitude", "longitude",
                                                  "elevation", "state", "name",
                                                  "component_1", "component_2",
                                                  "component_3", "utc_offset")),
         col_types = "cdddcccccc") -> stations

closestStation <- function(stations, lat, lon, restrict_to = NULL) {
  if (!is.null(restrict_to)) stations <- filter(stations, state == restrict_to)
  index <- which.min(sqrt((stations$latitude-lat)^2 +
                            (stations$longitude-lon)^2))
  stations[index,]
}

# compute total snow amounts per month ----------------------------------------------

(near_me <- closestStation(stations, 43.2672, -70.8617, restrict_to="ME"))

Now that we have the data, the short lesson here is just exposing the fact that you can get blank facets for free with ggplot2. I’m pointing this out as many folks seem to not like reading R documentation or miss things in said documentation (in fact, I had to be instructed today by @thomasp85 about a ggplot2 theme element setting that I didn’t know about and should have since I do try to keep up).

filter(daily_wx, coop_id == near_me$coop_id, element=="SNOW", value>0) %>%
  count(year, month, wt=value) %>%
  ungroup() %>%
  mutate(
    n = n / 10, # readings are in 10ths of inches
    date = as.Date(sprintf("%s-%02d-01", year, month)),
    month_name = lubridate::month(date, TRUE, FALSE)
  ) %>%
  ggplot(aes(x=date, y=n)) +
  geom_segment(aes(xend=date, yend=0), size=0.75, color="#9ecae1") +
  scale_y_continuous(limits=c(0, 65)) +
  facet_wrap(~month_name, ncol=3, drop=FALSE, scales="free") +
  labs(x=NULL, y="inches", title="Total snowfall in a given month by year",
       subtitle="Data for Station id 176905 — Portland (Maine) Jetport") +
  theme_ipsum_rc(grid="Y", axis_text_size=8)

Without ggplot2 helping us out we would have had to do some work to have those no-value facets to show up. I also like how there are no x-axis labels since there’s no data. ggplot2::facet_wrap() has many, very granular options for customizing the appearance of facets:

facet_wrap(facets, nrow = NULL, ncol = NULL, scales = "fixed",
           shrink = TRUE, labeller = "label_value", as.table = TRUE,
           switch = NULL, drop = TRUE, dir = "h", strip.position = "top")

If you haven’t played with them, you can use this example to try them out.

Fin

Even though that visualization gets the message across, I kinda like this view a bit better:

filter(daily_wx, coop_id == near_me$coop_id, element=="SNOW", value>0) %>%
  count(year, month, wt=value) %>%
  ungroup() %>%
  mutate(n = n / 10) %>%
  complete(year, month=1:12) %>%
  mutate(
    date = as.Date(sprintf("%s-%02d-01", year, month)),
    month_name = factor(lubridate::month(date, TRUE, FALSE), levels=rev(month.name))
  ) %>% 
  ggplot(aes(year, month_name)) +
  geom_tile(aes(fill=n), color="#b2b2b2", size=0.15) +
  scale_x_continuous(expand=c(0,0.15), position="top") +
  viridis::scale_fill_viridis(name = "Total inches", na.value="white") +
  labs(x=NULL, y=NULL, title="Total snowfall in a given month by year",
       subtitle="Data for Station id 176905 — Portland (Maine) Jetport") +
  theme_ipsum_rc(grid="", axis_text_size = 10) +
  guides(fill=guide_colourbar(label.position = "top", direction = "horizontal", title.vjust = 0)) +
  theme(legend.title = element_text(size=10)) +
  theme(legend.key.height = unit(0.5, "lines")) +
  theme(legend.position = c(0.9, 1.25))

The precision is lacking in the heatmap view, but you get a quick impression of when it has/hasn’t snowed. Plus you get to use viridis ?

All the updated code in in the snowfirst repo.

Crank you your own, small code snippets or ideas to the R community. R⁶ is an open tag and perhaps we can band together to make a distributed cadre of helpful, digestible posts the R community can benefit from.

@mrshrbrmstr hinted that she would like this post by @RickWicklin translated into R for her stats class. She’s quite capable of cranking out the translation of the core component of that post — a call to chisq.test — but she wanted to show the entire post (in R) and really didn’t have time (she’s teaching a full load of classes and is department chair + a mom). I suggested that I, too, was a bit short on time which resulted in her putting out a call to the twitterverse for assistance which ultimately ended up coercing me into tackling the problem.

I won’t re-create Rick’s post or my riff of it here since you can check out the RPubs page for it and also get the source (you can get the source from the Rmd, too, but some folks like gists better).

So, why a blog post if not to present the translation?

Two reasons: I needed tidy Goodman simultaneous confidence intervals (SCIs) and Rick’s final plot was just begging to have “real” M&M’s as the point “geom”.

S[c]imple & Tidy SCIs

We’ve got options for calculating simultaneous CIs in R and I could have just used DescTools::MultinomCI except that I wanted a tibble and it returns a matrix plus it only has three of the more common methods implemented (yes, I am the ultimate package snob). I recalled that the CoinMinD package was tailor made for working with SCIs and has many more methods implemented, but the output is actually only that: print()ed to console output.

Yes, I shouted in disbelief at the glowing rectangle in front of me when I noticed that almost as loudly as you did when you read that sentence.

The algorithms implemented in CoinMinD are just dandy and the package is coming up on it’s 4th birthday. So, as a present from it (via me) to the R community, I whipped together scimple which generates tidy tibbles and has a function scimple_ci() which is similar to binom::binom.confint() in that it will generate the SCIs for all the available (non-Bayesian) methods, including Goodman.

Kick the tyres (pls!) and drop issues and/or PRs as you see fit.

You can’t plot just one

Rick’s post analyzes distributions of M&M’s so I went to the official M&M’s site to grab the official colors for the ones in his data set. I casually went about making the rest of the post with standard points with a superimposed white “m” when it dawned on me that the M&M’s site used those lentils (yes, it seems the candies are called lentils, or at least their icons are) were all over the site. After some site spelunking with Chrome Developer Tools I had the URLs for the candies in question and managed to use the nascent ggimage::geom_image() to place them on the plot:

The plot is a bit sparse as you have to get the aspect ratio just right to keep those tasty, tiny circles as circles.

The new geom_image() opens up many new possibilities for R visualizations (and not all are good possibilities). I think @mrshrbrmstr’s students got a kick out of a stats-y plot having real M&M’s on it so it worked OK this time. Just be wary of using gratuitous imagery and overdoing your watermarking.

As stated earlier you can get the code and see how you can improve upon Rick’s original post and my attempt at a quick riff. If you do end up cranking something out, drop a comment here or a tweet (@hrbrmstr) to show off your creation(s)!

I’m pleased to announce the inaugural release of my hrbrthemes (0.1.0) package on CRAN

The primary goal of said package is to provide opinionated typographical and other aesthetic defaults for ggplot2 charts.

Two core themes are included:

The Roboto Condensed Google Font comes with the package along with an installer for said font (it’s an R installer, you still need to install it on your OS manually).

Other niceties include:

  • scale_[xy]_comma() — shortcut for expand=c(0,0), labels=scales::comma
  • scale_[xy]_percent() — shortcut for expand=c(0,0), labels=scales::percent
  • scale_[color|fill]_ipsum() — discrete scale with 9 colors
  • gg_check() — pass-through spell checker for ggplot2 label elements

Source version is tracked on GitHub.

Critiques, bug reports and enhancement requests are most welcome as GitHub issues.

The kind folks over at @RStudio gave a nod to my recently CRAN-released epidata package in their January data package roundup and I thought it might be useful to give it one more showcase using the recent CRAN update to ggalt and the new hrbrthemes (github-only for now) packages.

Labor force participation rate

The U.S. labor force participation rate (LFPR) is an oft-overlooked and under- or mis-reported economic indicator. I’ll borrow the definition from Investopedia:

The participation rate is a measure of the active portion of an economy’s labor force. It refers to the number of people who are either employed or are actively looking for work. During an economic recession, many workers often get discouraged and stop looking for employment, resulting in a decrease in the participation rate.

Population age distributions and other factors are necessary to honestly interpret this statistic. Parties in power usually dismiss/ignore this statistic outright and their opponents tend to wholly embrace it for criticism (it’s an easy target if you’re naive). “Yay” partisan democracy.

Since the LFPR has nuances when looked at categorically, let’s take a look at it by attained education level to see how that particular view has changed over time (at least since the gov-quants have been tracking it).

We can easily grab this data with epidata::get_labor_force_participation_rate()(and, we’ll setup some library() calls while we’re at it:

library(epidata)
library(hrbrthemes) # devtools::install_github("hrbrmstr/hrbrthemes")
library(ggalt)
library(tidyverse)
library(stringi)

part_rate <- get_labor_force_participation_rate("e")

glimpse(part_rate)
## Observations: 457
## Variables: 7
## $ date              <date> 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-01, 1979-04-01, 1979-05-01...
## $ all               <dbl> 0.634, 0.634, 0.635, 0.636, 0.636, 0.637, 0.637, 0.637, 0.638, 0.638, 0...
## $ less_than_hs      <dbl> 0.474, 0.475, 0.475, 0.475, 0.475, 0.474, 0.474, 0.473, 0.473, 0.473, 0...
## $ high_school       <dbl> 0.690, 0.691, 0.692, 0.692, 0.693, 0.693, 0.694, 0.694, 0.695, 0.696, 0...
## $ some_college      <dbl> 0.709, 0.710, 0.711, 0.712, 0.712, 0.713, 0.712, 0.712, 0.712, 0.712, 0...
## $ bachelor's_degree <dbl> 0.771, 0.772, 0.772, 0.773, 0.772, 0.772, 0.772, 0.772, 0.772, 0.773, 0...
## $ advanced_degree   <dbl> 0.847, 0.847, 0.848, 0.848, 0.848, 0.848, 0.847, 0.847, 0.848, 0.848, 0...

One of the easiest things to do is to use ggplot2 to make a faceted line chart by attained education level. But, let’s change the labels so they are a bit easier on the eyes in the facets and switch the facet order from alphabetical to something more useful:

gather(part_rate, category, rate, -date) %>% 
  mutate(category=stri_replace_all_fixed(category, "_", " "),
         category=stri_trans_totitle(category),
         category=stri_replace_last_regex(category, "Hs$", "High School"),
         category=factor(category, levels=c("Advanced Degree", "Bachelor's Degree", "Some College", 
                                            "High School", "Less Than High School", "All")))  -> part_rate

Now, we’ll make a simple line chart, tweaking the aesthetics just a bit:

ggplot(part_rate) +
  geom_line(aes(date, rate, group=category)) +
  scale_y_percent(limits=c(0.3, 0.9)) +
  facet_wrap(~category, scales="free") +
  labs(x=paste(format(range(part_rate$date), "%Y-%b"), collapse=" to "), 
       y="Participation rate (%)", 
       title="U.S. Labor Force Participation Rate",
       caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
  theme_ipsum_rc(grid="XY", axis="XY")

The “All” view is interesting in that the LFPR has held fairly “steady” between 60% & 70%. Those individual and fractional percentage points actually translate to real humans, so the “minor” fluctuations do matter.

It’s also interesting to see the direct contrast between the starting historical rate and current rate (you could also do the same with min/max rates, etc.) We can use a “dumbbell” chart to compare the 1978 value to today’s value, but we’ll need to reshape the data a bit first:

group_by(part_rate, category) %>% 
  arrange(date) %>% 
  slice(c(1, n())) %>% 
  spread(date, rate) %>% 
  ungroup() %>% 
  filter(category != "All") %>% 
  mutate(category=factor(category, levels=rev(levels(category)))) -> rate_range

filter(part_rate, category=="Advanced Degree") %>% 
  arrange(date) %>% 
  slice(c(1, n())) %>% 
  mutate(lab=lubridate::year(date)) -> lab_df

(We’ll be using the extra data frame to add labels the chart.)

Now, we can compare the various ranges, once again tweaking aesthetics a bit:

ggplot(rate_range) +
  geom_dumbbell(aes(y=category, x=`1978-12-01`, xend=`2016-12-01`),
                size=3, color="#e3e2e1",
                colour_x = "#5b8124", colour_xend = "#bad744",
                dot_guide=TRUE, dot_guide_size=0.25) +
  geom_text(data=lab_df, aes(x=rate, y=5.25, label=lab), vjust=0) +
  scale_x_percent(limits=c(0.375, 0.9)) +
  labs(x=NULL, y=NULL,
       title=sprintf("U.S. Labor Force Participation Rate %s-Present", lab_df$lab[1]),
       caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
  theme_ipsum_rc(grid="X")

Fin

One takeaway from both these charts is that it’s probably important to take education level into account when talking about the labor force participation rate. The get_labor_force_participation_rate() function — along with most other functions in the epidata package — also has options to factor the data by sex, race and age, so you can pull in all those views to get a more nuanced & informed understanding of this economic health indicator.