ggplot Archives - Page 4 of 8

Category Archives: ggplot

ggalt 0.4.0 now on CRAN

I’m uncontainably excited to report that the ggplot2 extension package ggalt is now on CRAN.

The absolute best part of this package is the R community members who contributed suggestions and new geoms, stats, annotations and integration features. This release would not be possible without the PRs from:

Ben Bolker
Ben Marwick
Jan Schulz
Carson Sievert

and a host of folks who have made suggestions and have put up with broken GitHub builds along the way. Y’all are awesome.

Please see the vignette and graphics-annotated help pages for info on everything that’s available. Some highlights include:

multiple ways to render splines (so you can make those cool, smoothed D3-esque line charts :-)
geom_cartogram() which replicates the old functionality of geom_map() so your old mapping code doesn’t break anymore
a re-re-mended coord_proj() (but, read on about why you should be re-thinking of how you do maps in ggplot2)
lollipop charts (geom_lollipop())
dumbbell charts (geom_dumbbell())
step-ribbon charts
the ability to easily encircle points (beyond those boring ellipses)
byte formatters (i.e. turn 1024 to 1 Kb, etc)
better integration with plotly

If you do any mapping in ggplot2, please follow the machinations of geom_sf() and the sf package. Ed and the rest of the R spatial community have done 100% outstanding work here and it is going to change how you think and work spatially in R forever (in an awesome way). I hope to retire coord_proj() and geom_cartogram() some day soon thanks to all their hard work.

Your contributions, feedback and suggestions are welcome and encouraged. The next steps for me w/r/t ggalt are ensuring 100% plotly coverage since it’s the best way to make your ggplot2 plots interactive. There are a few more additions that didn’t make it into this release that I’ll also be integrating.

Please make sure to say “thank you” to the above contributors if you see them in person on on the internets. They’ve done great work and are exemplary examples of how awesome and talented the R community is.

Geom❤️

ggplot() +
  geom_heart() +
  coord_equal() +
  labs(title="Happy Valentine's Day") +
  theme_heart()

Presented without exposition (since it’s a silly Geom)

This particular ❤️ math pilfered this morning from @dmarcelinobr:

library(ggplot2)

geom_heart <- function(..., colour = "#67001f", size = 0.5, fill = "#b2182b",
                       mul = 1.0, na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) {
  
  data <- data.frame(t=seq(0, 10*pi, by=0.1))
  
  x <- function(t) 16*sin(t)^3
  y <- function(t) 13*cos(t) - 5*cos(2*t) - 2*cos(3*t) - cos(4*t)
  
  data$x <- x(data$t) * mul
  data$y <- y(data$t) * mul
  
  data <- rbind(data, data[1,])
  
  layer(
    data = data,
    mapping = aes(x=x, y=y),
    stat = "identity",
    geom = ggplot2::GeomPolygon,
    position = "identity",
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      na.rm = na.rm,
      size = size,
      colour = colour,
      fill = fill,
      ...
    )
  )
  
}

theme_heart <- function() {
  ggthemes::theme_map(base_family = "Zapfino") +
    theme(plot.title=element_text(hjust=0.5, size=28)) +
    theme(plot.margin=margin(30,30,30,30))
}

Workout Wednesday Redux (2017 Week 3)

I had started a “52 Vis” initiative back in 2016 to encourage folks to get practice making visualizations since that’s the only way to get better at virtually anything. Life got crazy, 52 Vis fell to the wayside and now there are more visible alternatives such as Makeover Monday and Workout Wednesday. They’re geared towards the “T” crowd (I’m not giving a closed source and locked-in-data product any more marketing than two links) but that doesn’t mean R, Python or other open-tool/open-data communities can’t join in for the ride and learning experience.

This week’s workout is a challenge to reproduce or improve upon a chart by Matt Stiles. You should go to both (give them the clicks and eyeballs they both deserve since they did great work). They both chose a line chart, but the whole point of these exercises is to try out new things to help you learn how to communicate better. I chose to use geom_segment() to make mini-column charts since that:

eliminates the giant rose-coloured rectangles that end up everywhere
helps show the differences a bit better (IMO), and
also helps highlight some of the states that have had more difficulties than others

Click/tap to “embiggen”. I kept the same dimensions that Andy did but unlike Matt’s creation this is a plain ol’ PNG as I didn’t want to deal with web fonts (I’m on a Museo Sans Condensed kick at the moment but don’t have it in my TypeKit config yet). I went with official annual unemployment numbers as they may be calculated/adjusted differently (I didn’t check, but I knew that data source existed, so I used it).

One reason I’m doing this is a quote on the Workout Wednesday post:

This will be a very tedious exercise. To provide some context, this took me 2-3 hours to create. Don’t get discouraged and don’t feel like you have to do it all in one sitting. Basically, try to make yours look identical to mine.

This took me 10 minutes to create in R:

#' ---
#' output:
#'  html_document:
#'    keep_md: true
#' ---
#+ message=FALSE
library(ggplot2)
library(hrbrmisc)
library(readxl)
library(tidyverse)

# Use official BLS annual unemployment data vs manually calculating the average
# Source: https://data.bls.gov/timeseries/LNU04000000?years_option=all_years&periods_option=specific_periods&periods=Annual+Data
read_excel("~/Data/annual.xlsx", skip=10) %>%
  mutate(Year=as.character(as.integer(Year)), Annual=Annual/100) -> annual_rate


# The data source Andy Kriebel curated for you/us: https://1drv.ms/x/s!AhZVJtXF2-tD1UVEK7gYn2vN5Hxn #ty Andy!
read_excel("~/Data/staadata.xlsx") %>%
  left_join(annual_rate) %>%
  filter(State != "District of Columbia") %>%
  mutate(
    year = as.Date(sprintf("%s-01-01", Year)),
    pct = (Unemployed / `Civilian Labor Force Population`),
    us_diff = -(Annual-pct),
    col = ifelse(us_diff<0,
               "Better than U.S. National Average",
               "Worse than U.S. National Average")
  ) -> df

credits <- "Notes: Excludes the District of Columbia. 2016 figure represents October rate.\nData: U.S. Bureau of Labor Statistics <https://www.bls.gov/lau/staadata.txt>\nCredit: Matt Stiles/The Daily Viz <thedailyviz.com>"

#+ state_of_us, fig.height=21.5, fig.width=8.75, fig.retina=2
ggplot(df, aes(year, us_diff, group=State)) +
  geom_segment(aes(xend=year, yend=0, color=col), size=0.5) +
  scale_x_date(expand=c(0,0), date_labels="'%y") +
  scale_y_continuous(expand=c(0,0), label=scales::percent, limit=c(-0.09, 0.09)) +
  scale_color_manual(name=NULL, expand=c(0,0),
                     values=c(`Better than U.S. National Average`="#4575b4",
                              `Worse than U.S. National Average`="#d73027")) +
  facet_wrap(~State, ncol=5, scales="free_x") +
  labs(x=NULL, y=NULL, title="The State of U.S. Jobs: 1976-2016",
       subtitle="Percentage points below or above the national unemployment rate, by state. Negative values represent unemployment rates\nthat were lower — or better, from a jobs perspective — than the national rate.",
       caption=credits) +
  theme_hrbrmstr_msc(grid="Y", strip_text_size=9) +
  theme(panel.background=element_rect(color="#00000000", fill="#f0f0f055")) +
  theme(panel.spacing=unit(0.5, "lines")) +
  theme(plot.subtitle=element_text(family="MuseoSansCond-300")) +
  theme(legend.position="top")

Swap out ~/Data for where you stored the files.

The “weird” looking comments enable me to spin the script and is pretty much just the inverse markup for knitr R Markdown documents. As the comments say, you should really thank Andy for curating the BLS data for you/us.

If I really didn’t pine over aesthetics it would have taken me 5 minutes (most of that was waiting for re-rendering). Formatting the blog post took much longer. Plus, I can update the data source and re-run this in the future without clicking anything. This re-emphasizes a caution I tell my students: beware of dragon droppings (“drag-and-drop data science/visualization tools”).

Hopefully you presently follow or will start following Workout Wednesday and Makeover Monday and dedicate some time to hone your skills with those visualization katas.

2017-01 Authored Package Updates

The rest of the month is going to be super-hectic and it’s unlikely I’ll be able to do any more to help the push to CRAN 10K, so here’s a breakdown of CRAN and GitHub new packages & package updates that I felt were worth raising awareness on:

epidata

I mentioned this one last week but it wasn’t really a package announcement post. epidata is now on CRAN and is a package to pull data from the Economic Policy Institute (U.S. gov economic data, mostly). Their “hidden” API is well thought out and the data has been nicely curated (and seems to update monthly). It makes it super easy to do things like the following:

library(epidata)
library(tidyverse)
library(stringi)
library(hrbrmisc) # devtools::install_github("hrbrmstr/hrbrmisc")

us_unemp <- get_unemployment("e")

glimpse(us_unemp)
## Observations: 456
## Variables: 7
## $ date            <date> 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-0...
## $ all             <dbl> 0.061, 0.061, 0.060, 0.060, 0.059, 0.059, 0.05...
## $ less_than_hs    <dbl> 0.100, 0.100, 0.099, 0.099, 0.099, 0.099, 0.09...
## $ high_school     <dbl> 0.055, 0.055, 0.054, 0.054, 0.054, 0.053, 0.05...
## $ some_college    <dbl> 0.050, 0.050, 0.050, 0.049, 0.049, 0.049, 0.04...
## $ college         <dbl> 0.032, 0.031, 0.031, 0.030, 0.030, 0.029, 0.03...
## $ advanced_degree <dbl> 0.021, 0.020, 0.020, 0.020, 0.020, 0.020, 0.02...

us_unemp %>%
  gather(level, rate, -date) %>%
  mutate(level=stri_replace_all_fixed(level, "_", " ") %>%
           stri_trans_totitle() %>%
           stri_replace_all_regex(c("Hs$"), c("High School")),
         level=factor(level, levels=unique(level))) -> unemp_by_edu

col <- ggthemes::tableau_color_pal()(10)

ggplot(unemp_by_edu, aes(date, rate, group=level)) +
  geom_line(color=col[1]) +
  scale_y_continuous(labels=scales::percent, limits =c(0, 0.2)) +
  facet_wrap(~level, scales="free") +
  labs(x=NULL, y="Unemployment rate",
       title=sprintf("U.S. Monthly Unemployment Rate by Education Level (%s)", paste0(range(format(us_unemp$date, "%Y")), collapse=":")),
       caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
  theme_hrbrmstr(grid="XY")

us_unemp %>%
  select(date, high_school, college) %>%
  mutate(date_num=as.numeric(date)) %>%
  ggplot(aes(x=high_school, xend=college, y=date_num, yend=date_num)) +
  geom_segment(size=0.125, color=col[1]) +
  scale_x_continuous(expand=c(0,0), label=scales::percent, breaks=seq(0, 0.12, 0.02), limits=c(0, 0.125)) +
  scale_y_reverse(expand=c(0,100), label=function(x) format(as_date(x), "%Y")) +
  labs(x="Unemployment rate", y="Year ↓",
       title=sprintf("U.S. monthly unemployment rate gap (%s)", paste0(range(format(us_unemp$date, "%Y")), collapse=":")),
       subtitle="Segment width shows the gap between those with a high school\ndegree and those with a college degree",
       caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
  theme_hrbrmstr(grid="X") +
  theme(panel.ontop=FALSE) +
  theme(panel.grid.major.x=element_line(size=0.2, color="#2b2b2b25")) +
  theme(axis.title.x=element_text(family="Arial", face="bold")) +
  theme(axis.title.y=element_text(family="Arial", face="bold", angle=0, hjust=1, margin=margin(r=-14)))

(right edge is high school, left edge is college…I’ll annotate it better next time)

censys

Censys is a search engine by one of the cybersecurity research partners we publish data to at work (free for use by all). The API is moderately decent (it’s mostly a thin shim authentication layer to pass on Google BigQuery query strings to the back-end) and the R package to interface to it censys is now on CRAN.

waffle

The seminal square pie chart package waffle has been updated on CRAN to work better with recent ggplot2 2.x changes and has some additional parameters you may want to check out.

cdcfluview

The viral package cdcfluview has had some updates on the GitHub version to add saner behaviour when specifying dates and had to be updated as the CDC hidden API switched to all https URLs (major push in .gov-land to do that to get better scores on their cyber report cards). I’ll be adding some features before the next CRAN push to enable retrieval of additional mortality data.

sergeant

If you work with Apache Drill (if you don’t, you should), the sergeant package (GitHub) will help you whip it into shape. I’ve mentioned it before on the blog but it has a nigh-complete dplyr interface now that works pretty well. It also has a direct REST API interface and RJDBC interface plus many helper utilities that help you avoid typing SQL strings to get cluster status info. Once I add the ability to create parquet files with it I’ll push it up to CRAN.

The one thing I’d like to do with this package is support any user-defined functions (UDFs in Drill-speak) folks have written. So, if you have a UDF you’ve written or use and you want it wrapped in the package, just drop an issue and I’ll layer it in. I’ll be releasing some open source cybersecurity-related UDFs via the work github in a few weeks.

zkcmd

Drill (in non-standalone mode) relies on Apache Zookeeper to keep everything in sync and it’s sometimes necessary to peek at what’s happening inside the zookeeper cluster, so sergeant has a sister package zkcmd that provides an R interface to zookeeper instances.

ggalt

Some helpful folks tweaked ggalt for better ggplot2 2.x compatibility (#ty!) and I added a new geom_cartogram() (before you ask if it makes warped shapefiles: it doesn’t) that restores the old (and what I believe to be the correct/sane/proper) behaviour of geom_map(). I need to get this on CRAN soon as it has both fixes and many new geoms folks will want to play with in a non-GitHub context.

FIN

There have been some awesome packages released by others in the past month+ and you should add R Weekly to your RSS feeds if you aren’t following it already (there are other things you should have there for R updates as well, but that’s for another blog). I’m definitely looking forward to new packages, visualizations, services and utilities that will be coming this year to the R community.

The Most Important Commodity in 2017 is Data

2017-01-04 – 17:41
Posted in Data Analysis, Data Visualization, data wrangling, ggplot, R
Tagged post
Comments (5)

Despite being in cybersecurity nigh forever (a career that quickly turns one into a determined skeptic if you’re doing your job correctly) I have often trusted various (not to be named) news sources, reports and data sources to provide honest and as-unbiased-as-possible information. The debacle in the U.S. in late 2016 has proven (to me) that we’re all on our own when it comes to validating posited truth/fact. I’m teaching two data science college courses this Spring and one motivation for teaching is helping others on the path to data literacy so they don’t have to just believe what’s tossed at them.

It’s also one of the reasons I write R packages and blog/speak (when I can).

Today, I saw a chart in a WSJ article on the impending minimum wage changes (paywalled, so do the Google hack to read it) set to take effect in 19 states.

It’s a great chart and they cite the data source as coming from the Economic Policy Institute. I found some minimum wage data there and manually went through a bit of it to get enough of a feel that the WSJ wasn’t “mis-truthing” or truth-spinning. (As an aside, the minimum wage should definitely be higher, indexed and adjusted for inflation, but that’s a discussion for another time.)

While on EPI’s site, I did notice they had a “State of Working America Data Library” @ http://www.epi.org/data/. The data there is based on U.S. government published data and I validated that EPI isn’t fabricating anything (but you should not just take my word for it and do your own math from U.S. gov sources). I also noticed that the filtering interactions were delayed a bit and posited said condition was due to the site making AJAX/XHR calls to retrieve data. So, I peeked under the covers and discovered a robust, consistent, hidden API that is now wrapped in an R package dubbed epidata.

You can get the details of what’s available via the EPI site or through package docs.

What can you do with this data?

They seem to update the data (pretty much) monthly and it’s based on U.S. gov publications, so you can very easily validate “news” reports, potentially even easier than via packages that wrap the Bureau of Labor Statistics (BLS) API. On a lark, I wanted to compare the unemployment rate vs median wage over time (mostly to test the API and make a connected scatterplot). If you didn’t add the level of detail to the aesthetics I did, the following code is pretty small to do just that:

library(tidyverse)
library(epidata)
library(ggrepel)

unemployment <- get_unemployment()
wages <- get_median_and_mean_wages()

glimpse(wages)
## Observations: 43
## Variables: 3
## $ date    <int> 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, ...
## $ median  <dbl> 16.53, 16.17, 16.05, 16.15, 16.07, 16.36, 16.15, 16.07...
## $ average <dbl> 19.05, 18.67, 18.64, 18.87, 18.77, 18.83, 19.06, 18.66...

glimpse(unemployment)
## Observations: 456
## Variables: 2
## $ date <date> 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-01, 1979-04-...
## $ all  <dbl> 0.061, 0.061, 0.060, 0.060, 0.059, 0.059, 0.059, 0.058, 0...

group_by(unemployment, date=as.integer(lubridate::year(date))) %>%
  summarise(rate=mean(all)) %>%
  left_join(select(wages, date, median), by="date") %>%
  filter(!is.na(median)) %>%
  arrange(date) -> df

cols <- ggthemes::tableau_color_pal()(3)

ggplot(df, aes(rate, median)) +
  geom_path(color=cols[1], arrow=arrow(type="closed", length=unit(10, "points"))) +
  geom_point() +
  geom_label_repel(aes(label=date),
                   alpha=c(1, rep((4/5), (nrow(df)-2)), 1),
                   size=c(5, rep(3, (nrow(df)-2)), 5),
                   color=c(cols[2],
                           rep("#2b2b2b", (nrow(df)-2)),
                           cols[3]),
                   family="Hind Medium") +
  scale_x_continuous(name="Unemployment Rate", expand=c(0,0.001), label=scales::percent) +
  scale_y_continuous(name="Median Wage", expand=c(0,0.25), label=scales::dollar) +
  labs(title="U.S. Unemployment Rate vs Median Wage Since 1978",
       subtitle="Wage data is in 2015 USD",
       caption="Source: EPI analysis of Current Population Survey Outgoing Rotation Group microdata") +
  hrbrmisc::theme_hrbrmstr(grid="XY")

You can build your own “State of Working America Data Library” dashboard with this data (with flexdashboard and/or shiny) and be a bit of a citizen journalist, keeping the actual media in check and staying more informed an engaged.

As usual, drop issues & feature requests in the github repo. If you make some fun things with the data, drop a comment in the blog. Unless something major comes up the package should be on CRAN by the weekend.

Removing Personal Bias From Flu Severity Estimation (a.k.a. Misery Loves Data)

The family got hit pretty hard with the flu right as the Christmas festivities started and we were all pretty much bed-ridden zombies up until today (2017-01-02). When in the throes of a very bad ILI it’s easy to imagine that you’re a victim of a severe outbreak, especially with ancillary data from others that they, too, either just had/have the flu or know others who do. Thankfully, I didn’t have to accept this emotional opine and could turn to the cdcfluview package to see just how this year is measuring up.

Influenza cases are cyclical, and that’s super-easy to see with a longer-view of the CDC national data:

library(cdcfluview)
library(tidyverse)
library(stringi)

flu <- get_flu_data("national", years=2010:2016)

mutate(flu, week=as.Date(sprintf("%s %s 1", YEAR, WEEK), format="%Y %U %u")) %>% 
  select(-`AGE 25-64`) %>% 
  gather(age_group, count, starts_with("AGE")) %>% 
  mutate(age_group=stri_replace_all_fixed(age_group, "AGE", "Ages")) %>% 
  mutate(age_group=factor(age_group, levels=c("Ages 0-4", "Ages 5-24", "Ages 25-49", "Ages 50-64", "Ages 65"))) %>%
  ggplot(aes(week, count, group=age_group)) +
  geom_line(aes(color=age_group)) +
  scale_y_continuous(label=scales::comma, limits=c(0,20000)) +
  facet_wrap(~age_group, scales="free") +
  labs(x=NULL, y="Count of reported ILI cases", 
       title="U.S. National ILI Case Counts by Age Group (2010:2011 flu season through 2016:2017)",
       caption="Source: CDC ILInet via CRAN cdcfluview pacakge") +
  ggthemes::scale_color_tableau(name=NULL) +
  hrbrmisc::theme_hrbrmstr(grid="XY") +
  theme(legend.position="none")

We can use the same data to zoom in on this season:

mutate(flu, week=as.Date(sprintf("%s %s 1", YEAR, WEEK), format="%Y %U %u")) %>% 
  select(-`AGE 25-64`) %>% 
  gather(age_group, count, starts_with("AGE")) %>% 
  mutate(age_group=stri_replace_all_fixed(age_group, "AGE", "Ages")) %>% 
  mutate(age_group=factor(age_group, levels=c("Ages 0-4", "Ages 5-24", "Ages 25-49", "Ages 50-64", "Ages 65"))) %>%
  filter(week >= as.Date("2016-07-01")) %>% 
  ggplot(aes(week, count, group=age_group)) +
  geom_line(aes(color=age_group)) +
  scale_y_continuous(label=scales::comma, limits=c(0,20000)) +
  facet_wrap(~age_group, scales="free") +
  labs(x=NULL, y="Count of reported ILI cases", 
       title="U.S. National ILI Case Counts by Age Group (2016:2017 flu season)",
       caption="Source: CDC ILInet via CRAN cdcfluview pacakge") +
  ggthemes::scale_color_tableau(name=NULL) +
  hrbrmisc::theme_hrbrmstr(grid="XY") +
  theme(legend.position="none")

So, things are trending up, but how severe is this year compared to others? While looking at the number/percentage of ILI cases is one way to understand severity, another is to look at the mortality rate. The cdcfluview package has a get_mortality_surveillance_data() function, but it’s region-based and I’m really only looking at national data in this post. A helpful individual pointed out a new CSV file at https://www.cdc.gov/flu/weekly/index.htm#MS which we can reproducibly programmatically target (so we don’t have to track filename changes by hand) with:

library(rvest)

pg <- read_html("https://www.cdc.gov/flu/weekly/index.htm#MS")
html_nodes(pg, xpath=".//a[contains(@href, 'csv') and contains(@href, 'NCHS')]") %>% 
  html_attr("href") -> mort_ref
mort_url <- sprintf("https://www.cdc.gov%s", mort_ref)

df <- readr::read_csv(mort_url)

We can, then, take a look at the current “outbreak” status (when real-world mortality events exceed the model threshold):

mutate(df, week=as.Date(sprintf("%s %s 1", Year, Week), format="%Y %U %u")) %>% 
  select(week, Expected, Threshold, `Percent of Deaths Due to Pneumonia and Influenza`) %>% 
  gather(category, percent, -week) %>% 
  mutate(percent=percent/100) %>% 
  ggplot() +
  geom_line(aes(week, percent, group=category, color=category)) +
  scale_x_date(date_labels="%Y-%U") +
  scale_y_continuous(label=scales::percent) +
  ggthemes::scale_color_tableau(name=NULL) +
  labs(x=NULL, y=NULL, title="U.S. Pneumonia & Influenza Mortality",
       subtitle="Data through week ending December 10, 2016 as of December 28, 2016",
       caption="Source: National Center for Health Statistics Mortality Surveillance System") +
  hrbrmisc::theme_hrbrmstr(grid="XY") +
  theme(legend.position="bottom")

That view is for all mortality events from both influenza and pneumonia. We can look at the counts for just influenza as well:

mutate(df, week=as.Date(sprintf("%s %s 1", Year, Week), format="%Y %U %u")) %>% 
  select(week, `Influenza Deaths`) %>% 
  ggplot() +
  geom_line(aes(week, `Influenza Deaths`), color=ggthemes::tableau_color_pal()(1)) +
  scale_x_date(date_labels="%Y-%U") +
  scale_y_continuous(label=scales::comma) +
  ggthemes::scale_color_tableau(name=NULL) +
  labs(x=NULL, y=NULL, title="U.S. Influenza Mortality (count of mortality events)",
       subtitle="Data through week ending December 10, 2016 as of December 28, 2016",
       caption="Source: National Center for Health Statistics Mortality Surveillance System") +
  hrbrmisc::theme_hrbrmstr(grid="XY") +
  theme(legend.position="bottom")

It’s encouraging that the overall combined mortality rate is trending downwards and that the mortality rate for influenza is very low. Go. Science.

I’ll be adding a function to cdcfluview to retrieve this new data set a bit later in the year.

Hopefully you’ll avoid the flu and enjoy a healthy and prosperous 2017.

U.S. Drought Animations with the “Witch’s Brew” (purrr + broom + magick)

This is another purrr-focused post but it’s also an homage to the nascent magick package (R interface to ImageMagick) by @opencpu.

We’re starting to see/feel the impact of the increasing drought up here in southern Maine. I’ve used the data from the U.S. Drought Monitor before on the blog, but they also provide shapefiles and this seemed like a good opportunity to further demonstrate the utility of purrr and make animations directly using magick. Plus, I wanted to see the progression of the drought. Putting library() statements for purrr, magick and broom together was completely random, but I now feel compelled to find a set of functions to put into a cauldron package. But, I digress.

What does this demonstrate?

Apart from giving you an idea of the extent of the drought, working through this will help you:

use the quietly() function (which automagically turns off warnings for a function)
see another example of a formula function
illustrate the utility map_df(), and
see how to create an animation pipeline for magick

Comments are in the code and the drought gif is at the end. I deliberately only had it loop once, so refresh the image if you want to see the progression again. Also, drop a note in the comments if anything needs more exposition. (NOTE: I was fairly bad and did virtually no file cleanup in the function, so you’ll have half a year’s shapefiles in your getwd(). Consider the cleanup an exercise for the reader :-)

library(rgdal)
library(sp)
library(albersusa) # devtools::install_github("hrbrmstr/albersusa")
library(ggplot2) # devtools::install_github("hadley/ggplot2")
library(ggthemes)
library(rgeos)

# the witch's brew
library(purrr)
library(broom)
library(magick)

#' Get a drought map shapefile and turn it into a PNG
drought_map <- function(wk) {

  # need to hush some chatty functions
  hush_tidy <- quietly(tidy)

  # some are more stubbon than others
  old_warn <- getOption("warn")
  options(warn=-1)

  week <- format(wk, "%Y%m%d")

  # get the drought shapefile only if we don't have it already
  URL <- sprintf("http://droughtmonitor.unl.edu/data/shapefiles_m/USDM_%s_M.zip", week)
  (fil <- basename(URL))
  if (!file.exists(fil)) download.file(URL, fil)
  unzip(fil)

  # read in the shapefile and reduce the polygon complexity
  dr <- readOGR(sprintf("USDM_%s.shp", week),
                sprintf("USDM_%s", week),
                verbose=FALSE,
                stringsAsFactors=FALSE)

  dr <- SpatialPolygonsDataFrame(gSimplify(dr, 0.01, TRUE), dr@data)

  # turn separate out each drought level into its own fortified data.frame
  map(dr$DM, ~subset(dr, DM==.)) %>%
    map(hush_tidy) %>%
    map_df("result", .id="DM") -> m

  # get a conus base map (prbly cld have done map_data("usa"), too)
  usa_composite() %>%
    subset(!(iso_3166_2 %in% c("AK", "HI"))) %>%
    hush_tidy() -> usa

  usa <- usa$result # an artifact of using quietly()

  # this is all Ushey's fault. the utility of cmd-enter to run
  # the entire ggplot2 chain (in RStudio) turns out to have a
  # greater productity boost (i measured it) than my shortcuts for
  # gg <- gg + snippets and hand-editing the "+" bits out when
  # editing old plot constructs. I'm not giving up on gg <- gg + tho

  # Note putting the "base" layer on top since we don't really
  # want to deal with alpha levels of the drought polygons and
  # we're only plotting the outline of the us/states, not filling
  # the interior(s).

  ggplot() +
    geom_map(data=m, map=m,
             aes(long, lat, fill=DM, map_id=id),
             color="#2b2b2b", size=0.05) +
    geom_map(data=usa, map=usa, aes(long, lat, map_id=id),
             color="#2b2b2b88", fill=NA, size=0.1) +
    scale_fill_brewer("Drought Level", palette="YlOrBr") +
    coord_map("polyconic", xlim=c(-130, -65), ylim=c(25, 50)) +
    labs(x=sprintf("Week: %s", wk)) +
    theme_map() +
    theme(axis.title=element_text()) +
    theme(axis.title.x=element_text()) +
    theme(axis.title.y=element_blank()) +
    theme(legend.position="bottom") +
    theme(legend.direction="horizontal") -> gg

  options(warn=old_warn) # put things back the way they were

  outfil <- sprintf("gg-dm-%s.png", wk)
  ggsave(outfil, gg, width=8, height=5)

  outfil

}

# - create a vector of weeks (minus the current one)
# - create the individual map PNGs
# - read the individual map PNGs into a list
# - join the images together
# - create the animated gif structure
# - write the gif to a file

seq(as.Date("2016-01-05"), Sys.Date(), by="1 week") %>%
  head(-1) %>%
  map(drought_map) %>%
  map(image_read) %>%
  image_join() %>%
  image_animate(fps=2, loop=1) %>%
  image_write("drought.gif")

NOTE: an updated, comment-free version of the above code block is in this gist and uses spdplyr::filter() vs subset(), keeps downloaded files tidy in a temporary directory and includes a progress bar vs raw, ugly download.file() messages.

Bridging The Political [Polygons] Gap with ggplot2

2016-07-07 – 16:27
Posted in Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (8)

The @pewresearch folks have been collecting political survey data for quite a while, and I noticed the [visualization below](http://www.people-press.org/2014/06/12/section-1-growing-ideological-consistency/#interactive) referenced in a [Tableau vis contest entry](https://www.interworks.com/blog/rrouse/2016/06/24/politics-viz-contest-plotting-political-polarization):

Those are filled [frequency polygons](http://onlinestatbook.com/2/graphing_distributions/freq_poly.html), which are super-easy to replicate in ggplot2, especially since Pew even _kind of_ made the data available via their interactive visualization (it’s available in other Pew resources, just not as compact). So, we can look at all 5 study years for both the general population and politically active respondents with `ggplot2` facets, incorporating the use of `V8`, `dplyr`, `tidyr`, `purrr` and some R spatial functions along the way.

The first code block has the “data”, data transformations and initial plot code. The “data” is really javascript blocks picked up from the `view-source:` of the interactive visualization. We use the `V8` package to get this data then bend it to our will for visuals.

library(V8)
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)  # devtools::install_github("hadley/ggplot2)
library(hrbrmisc) # devtools::install_github("hrbrmstr/hrbrmisc)
library(rgeos)
library(sp)

ctx <- v8()
ctx$eval("
	var party_data = [
		[{
			name: 'Dem',
			data: [0.57,1.60,1.89,3.49,3.96,6.56,7.23,8.54,9.10,9.45,9.30,9.15,7.74,6.80,4.66,4.32,2.14,1.95,0.87,0.57,0.12]
		},{
			name: 'REP',
			data: [0.03,0.22,0.28,1.49,1.66,2.77,3.26,4.98,5.36,7.28,7.72,8.16,8.86,8.88,8.64,8.00,6.20,5.80,4.87,4.20,1.34]
		}],
		[{
			name: 'Dem',
			data: [1.22,2.78,3.28,5.12,6.15,7.77,8.24,9.35,9.73,9.19,8.83,8.47,5.98,5.17,3.62,2.87,1.06,0.75,0.20,0.15,0.04]
		}, {
			name: 'REP',
			data: [0.23,0.49,0.65,2.23,2.62,4.06,5.02,7.53,7.70,7.28,7.72,8.15,8.87,8.47,7.08,6.27,4.29,3.99,3.54,2.79,1.03]
		}],
		[{
			name: 'Dem',
			data: [2.07,3.57,4.21,6.74,7.95,8.41,8.58,9.07,8.98,8.46,8.47,8.49,5.39,3.62,2.11,1.98,1.00,0.55,0.17,0.17,0.00]
		}, {
			name: 'REP',
			data: [0.19,0.71,1.04,2.17,2.07,3.65,4.92,7.28,8.26,9.64,9.59,9.55,7.91,7.74,6.84,6.01,4.37,3.46,2.09,1.65,0.86]
		}],
		[{
			name: 'Dem',
			data: [2.97,4.09,4.28,6.65,7.90,8.37,8.16,8.74,8.61,8.15,7.74,7.32,4.88,4.82,2.79,2.07,0.96,0.78,0.41,0.29,0.02]
		}, {
			name: 'REP',
			data: [0.04,0.21,0.28,0.88,1.29,2.64,3.08,4.92,5.84,6.65,6.79,6.92,8.50,8.61,8.05,8.00,7.52,7.51,5.61,4.17,2.50]
		}],
		[{
			name: 'Dem',
			data: [4.81,6.04,6.57,7.67,7.84,8.09,8.24,8.91,8.60,6.92,6.69,6.47,4.22,3.85,1.97,1.69,0.66,0.49,0.14,0.10,0.03]
		}, {
			name: 'REP',
			data: [0.11,0.36,0.49,1.23,1.35,2.35,2.83,4.63,5.09,6.12,6.27,6.41,7.88,8.03,7.58,8.26,8.12,7.29,6.38,5.89,3.34]
		}],
	];

	var party_engaged_data = [
		[{
			name: 'Dem',
			data: [0.88,2.19,2.61,4.00,4.76,6.72,7.71,8.45,8.03,8.79,8.79,8.80,7.23,6.13,4.53,4.31,2.22,2.01,1.05,0.66,0.13]
		}, {
			name: 'REP',
			data: [0.00,0.09,0.09,0.95,1.21,1.67,2.24,3.22,3.70,6.24,6.43,6.62,8.01,8.42,8.97,8.48,7.45,7.68,8.64,7.37,2.53]
		}],
		[{
			name: 'Dem',
			data: [1.61,3.35,4.25,6.75,8.01,8.20,8.23,9.14,8.94,8.68,8.46,8.25,4.62,3.51,2.91,2.63,1.19,0.74,0.24,0.17,0.12]
		},{
			name: 'REP',
			data: [0.21,0.38,0.68,1.62,1.55,2.55,3.99,4.65,4.31,5.78,6.28,6.79,8.47,9.01,8.61,8.34,7.16,6.50,6.10,4.78,2.25]
		}],
		[{
			name: 'Dem',
			data: [3.09,4.89,6.22,9.40,9.65,9.20,8.99,6.48,7.36,7.67,6.95,6.22,4.53,3.79,2.19,2.02,0.74,0.07,0.27,0.27,0.00]
		}, {
			name: 'REP',
			data: [0.29,0.59,0.67,2.11,2.03,2.67,4.12,6.55,6.93,8.42,8.79,9.17,7.33,6.84,7.42,7.25,6.36,5.32,3.35,2.57,1.24]
		}],
		[{
			name: 'Dem',
			data: [6.00,5.24,5.11,7.66,9.25,8.25,8.00,8.09,8.12,7.05,6.59,6.12,4.25,4.07,2.30,1.49,0.98,0.80,0.42,0.16,0.06]
		}, {
			name: 'REP',
			data: [0.00,0.13,0.13,0.48,0.97,2.10,2.73,3.14,3.64,5.04,5.30,5.56,6.87,6.75,8.03,9.33,11.01,10.49,7.61,6.02,4.68]
		}],
		[{
			name: 'Dem',
			data: [9.53,9.68,10.35,9.33,9.34,7.59,6.67,6.41,6.60,5.21,4.84,4.47,2.90,2.61,1.37,1.14,0.73,0.59,0.30,0.28,0.06]
		}, {
			name: 'REP',
			data: [0.15,0.11,0.13,0.46,0.52,1.18,1.45,2.46,2.84,4.15,4.37,4.60,6.36,6.66,7.34,9.09,11.40,10.53,10.58,9.85,5.76]
		}],
	];
")

years <- c(1994, 1999, 2004, 2001, 2014)

# Transform the javascript data -------------------------------------------

party_data <- ctx$get("party_data")
map_df(1:length(party_data), function(i) {
  x <- party_data[[i]]
  names(x$data) <- x$name
  dat <- as.data.frame(x$data)
  bind_cols(dat, data_frame(x=-10:10, year=rep(years[i], nrow(dat))))
}) -> party_data

party_engaged_data <- ctx$get("party_engaged_data")
map_df(1:length(party_engaged_data), function(i) {
  x <- party_engaged_data[[i]]
  names(x$data) <- x$name
  dat <- as.data.frame(x$data)
  bind_cols(dat, data_frame(x=-10:10, year=rep(years[i], nrow(dat))))
}) -> party_engaged_data

# We need it in long form -------------------------------------------------

gather(party_data, party, pct, -x, -year) %>%
  mutate(party=factor(party, levels=c("REP", "Dem"))) -> party_data_long

gather(party_engaged_data, party, pct, -x, -year) %>%
  mutate(party=factor(party, levels=c("REP", "Dem"))) -> party_engaged_data_long

# Traditional frequency polygon plots -------------------------------------

gg <- ggplot()
gg <- gg + geom_ribbon(data=party_data_long,
                       aes(x=x, ymin=0, ymax=pct, fill=party, color=party), alpha=0.5)
gg <- gg + scale_x_continuous(expand=c(0,0), breaks=c(-8, 0, 8),
                              labels=c("Consistently\nliberal", "Mixed", "Consistently\nconservative"))
gg <- gg + scale_y_continuous(expand=c(0,0), limits=c(0, 12))
gg <- gg + scale_color_manual(name=NULL, values=c(Dem="#728ea2", REP="#cf6a5d"),
                              labels=c(Dem="Democrats", REP="Republicans"))
gg <- gg + guides(color="none", fill=guide_legend(override.aes=list(alpha=1)))
gg <- gg + scale_fill_manual(name=NULL, values=c(Dem="#728ea2", REP="#cf6a5d"),
                             labels=c(Dem="Democrats", REP="Republicans"))
gg <- gg + facet_wrap(~year, ncol=2, scales="free_x")
gg <- gg + labs(x=NULL, y=NULL,
                title="Political Polarization, 1994-2014 (General Population)",
                caption="Source: http://www.people-press.org/2014/06/12/section-1-growing-ideological-consistency/iframe/")
gg <- gg + theme_hrbrmstr_an(grid="")
gg <- gg + theme(panel.margin=margin(t=30, b=30, l=30, r=30))
gg <- gg + theme(legend.position=c(0.75, 0.1))
gg <- gg + theme(legend.direction="horizontal")
gg <- gg + theme(axis.text.y=element_blank())
gg

gg <- ggplot()
gg <- gg + geom_ribbon(data=party_engaged_data_long,
                       aes(x=x, ymin=0, ymax=pct, fill=party, color=party), alpha=0.5)
gg <- gg + scale_x_continuous(expand=c(0,0), breaks=c(-8, 0, 8),
                              labels=c("Consistently\nliberal", "Mixed", "Consistently\nconservative"))
gg <- gg + scale_y_continuous(expand=c(0,0), limits=c(0, 12))
gg <- gg + scale_color_manual(name=NULL, values=c(Dem="#728ea2", REP="#cf6a5d"),
                              labels=c(Dem="Democrats", REP="Republicans"))
gg <- gg + guides(color="none", fill=guide_legend(override.aes=list(alpha=1)))
gg <- gg + scale_fill_manual(name=NULL, values=c(Dem="#728ea2", REP="#cf6a5d"),
                             labels=c(Dem="Democrats", REP="Republicans"))
gg <- gg + facet_wrap(~year, ncol=2, scales="free_x")
gg <- gg + labs(x=NULL, y=NULL,
                title="Political Polarization, 1994-2014 (Politically Active)",
                caption="Source: http://www.people-press.org/2014/06/12/section-1-growing-ideological-consistency/iframe/")
gg <- gg + theme_hrbrmstr_an(grid="")
gg <- gg + theme(panel.margin=margin(t=30, b=30, l=30, r=30))
gg <- gg + theme(legend.position=c(0.75, 0.1))
gg <- gg + theme(legend.direction="horizontal")
gg <- gg + theme(axis.text.y=element_blank())
gg

It provides a similar effect to the Pew & Interworks visuals using alpha transparency to blend the point of polygon intersections. But I _really_ kinda like the way both Pew & Interworks did their visualizations without alpha blending yet still highlighting the intersected areas. We can do that in R as well with a bit more work by:

– grouping each data frame by year
– turning each set of points (Dem & Rep) to R polygons
– computing the intersection of those polygons
– turning that intersection back into a data frame
– adding this new polygon to the plots while also removing the alpha blend

Here’s what that looks like in code:

# Setup a function to do the polygon intersection -------------------------

polysect <- function(df) {

  bind_rows(data_frame(x=-10, pct=0),
            select(filter(df, party=="Dem"), x, pct),
            data_frame(x=10, pct=0)) %>%
    as.matrix() %>%
    Polygon() %>%
    list() %>%
    Polygons(1) %>%
    list() %>%
    SpatialPolygons() -> dem

  bind_rows(data_frame(x=-10, pct=0),
            select(filter(df, party=="REP"), x, pct),
            data_frame(x=10, pct=0)) %>%
    as.matrix() %>%
    Polygon() %>%
    list() %>%
    Polygons(1) %>%
    list() %>%
    SpatialPolygons() -> rep

  inter <- gIntersection(dem, rep)
  inter <- as.data.frame(inter@polygons[[1]]@Polygons[[1]]@coords)[c(-1, -25),]
  inter <- mutate(inter, year=df$year[1])
  inter

}

# Get the intersected area ------------------------------------------------

group_by(party_data_long, year) %>%
  do(polysect(.)) -> general_sect

group_by(party_engaged_data_long, year) %>%
  do(polysect(.)) -> engaged_sect


# Try the plots again -----------------------------------------------------

gg <- ggplot()
gg <- gg + geom_ribbon(data=party_data_long,
                       aes(x=x, ymin=0, ymax=pct, fill=party, color=party))
gg <- gg + geom_ribbon(data=general_sect, aes(x=x, ymin=0, ymax=y), color="#666979", fill="#666979")
gg <- gg + scale_x_continuous(expand=c(0,0), breaks=c(-8, 0, 8),
                              labels=c("Consistently\nliberal", "Mixed", "Consistently\nconservative"))
gg <- gg + scale_y_continuous(expand=c(0,0), limits=c(0, 12))
gg <- gg + scale_color_manual(name=NULL, values=c(Dem="#728ea2", REP="#cf6a5d"),
                              labels=c(Dem="Democrats", REP="Republicans"))
gg <- gg + guides(color="none", fill=guide_legend(override.aes=list(alpha=1)))
gg <- gg + scale_fill_manual(name=NULL, values=c(Dem="#728ea2", REP="#cf6a5d"),
                             labels=c(Dem="Democrats", REP="Republicans"))
gg <- gg + facet_wrap(~year, ncol=2, scales="free_x")
gg <- gg + labs(x=NULL, y=NULL,
                title="Political Polarization, 1994-2014 (General Population)",
                caption="Source: http://www.people-press.org/2014/06/12/section-1-growing-ideological-consistency/iframe/")
gg <- gg + theme_hrbrmstr_an(grid="")
gg <- gg + theme(panel.margin=margin(t=30, b=30, l=30, r=30))
gg <- gg + theme(legend.position=c(0.75, 0.1))
gg <- gg + theme(legend.direction="horizontal")
gg <- gg + theme(axis.text.y=element_blank())
gg

gg <- ggplot()
gg <- gg + geom_ribbon(data=party_engaged_data_long,
                       aes(x=x, ymin=0, ymax=pct, fill=party, color=party))
gg <- gg + geom_ribbon(data=engaged_sect, aes(x=x, ymin=0, ymax=y), color="#666979", fill="#666979")
gg <- gg + scale_x_continuous(expand=c(0,0), breaks=c(-8, 0, 8),
                              labels=c("Consistently\nliberal", "Mixed", "Consistently\nconservative"))
gg <- gg + scale_y_continuous(expand=c(0,0), limits=c(0, 12))
gg <- gg + scale_color_manual(name=NULL, values=c(Dem="#728ea2", REP="#cf6a5d"),
                              labels=c(Dem="Democrats", REP="Republicans"))
gg <- gg + guides(color="none", fill=guide_legend(override.aes=list(alpha=1)))
gg <- gg + scale_fill_manual(name=NULL, values=c(Dem="#728ea2", REP="#cf6a5d"),
                             labels=c(Dem="Democrats", REP="Republicans"))
gg <- gg + facet_wrap(~year, ncol=2, scales="free_x")
gg <- gg + labs(x=NULL, y=NULL,
                title="Political Polarization, 1994-2014 (Politically Active)",
                caption="Source: http://www.people-press.org/2014/06/12/section-1-growing-ideological-consistency/iframe/")
gg <- gg + theme_hrbrmstr_an(grid="")
gg <- gg + theme(panel.margin=margin(t=30, b=30, l=30, r=30))
gg <- gg + theme(legend.position=c(0.75, 0.1))
gg <- gg + theme(legend.direction="horizontal")
gg <- gg + theme(axis.text.y=element_blank())
gg

Without much extra effort/work we now have what I believe to be a more striking set of visuals. (And, I should probably makes a `points_to_spatial_polys()` convenience function.)

You’ll find the “overall” group data as well as the party median values in [the Pew HTML source code](view-source:http://www.people-press.org/2014/06/12/section-1-growing-ideological-consistency/iframe/) if you want to try to fully replicate their visualizations.