Charts & Graphs Archives

Category Archives: Charts & Graphs

Shuttering Pies With Retiring Stores

2017-04-21 – 22:32
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (7)

I caught this “gem” in the Wall Street Journal tonight:

It’s pretty hard to compare store-to-store, even though it is fairly clear which ones are going-going-gone. If we want to see the relative percentage of each store closing and also want to see how they stack up against each other, then let’s make a column of 100% bars and label total stores in each:

library(hrbrthemes)
library(tidyverse)

read.table(text='store,closing,total
"Radio Shack",550,1500
"Payless",400,2600
"Rue21",400,1100
"The Limited",250,250
"bebe",180,180
"Wet Seal",170,170
"Crocs",160,560
"JCPenny",138,1000
"American Apparel",110,110
"Kmart",109,735
"hhgregg",88,220
"Sears",41,695', sep=",", header=TRUE, stringsAsFactors=FALSE) %>% 
  as_tibble() %>% 
  mutate(remaining = total - closing,
         gone = round((closing/total) * 100)/100,
         stay = 1-gone,
         rem_lab = ifelse(remaining == 0, "", scales::comma(remaining))) %>% 
  arrange(desc(stay)) %>% 
  mutate(store=factor(store, levels=store)) -> closing_df

update_geom_font_defaults(font_rc)

ggplot(closing_df) +
  geom_segment(aes(0, store, xend=gone, yend=store, color="Closing"), size=8) +
  geom_segment(aes(gone, store, xend=gone+stay, yend=store, color="Remaining"), size=8) +
  geom_text(aes(x=0, y=store, label=closing), color="white", hjust=0, nudge_x=0.01) +
  geom_text(aes(x=1, y=store, label=rem_lab), color="white", hjust=1, nudge_x=-0.01) +
  scale_x_percent() +
  scale_color_ipsum(name=NULL) +
  labs(x=NULL, y=NULL, 
       title="Selected 2017 Store closings (estimated)",
       subtitle="Smaller specialty chains such as Bebe and American Apparel are closing their stores,\nwhile lareger chains such as J.C. Penny and Sears are scaling back their footprint.") +
  theme_ipsum_rc(grid="X") +
  theme(axis.text.x=element_text(hjust=c(0, 0.5, 0.5, 0.5, 1))) +
  theme(legend.position=c(0.875, 1.025)) +
  theme(legend.direction="horizontal")

One might try circle packing or a treemap to show both relative store count and percentage, but I think the bigger story is the percent reduction for each retail chain. It’d be cool to see what others come up with.

hrbrthemes 0.1.0 is now on CRAN

I’m pleased to announce the inaugural release of my hrbrthemes (0.1.0) package on CRAN

The primary goal of said package is to provide opinionated typographical and other aesthetic defaults for ggplot2 charts.

Two core themes are included:

theme_ipsum() – an Arial Narrow-based theme
theme_ipsum_rc() – a Roboto Condensed-based theme.

The Roboto Condensed Google Font comes with the package along with an installer for said font (it’s an R installer, you still need to install it on your OS manually).

Other niceties include:

scale_[xy]_comma() — shortcut for expand=c(0,0), labels=scales::comma
scale_[xy]_percent() — shortcut for expand=c(0,0), labels=scales::percent
scale_[color|fill]_ipsum() — discrete scale with 9 colors
gg_check() — pass-through spell checker for ggplot2 label elements

Source version is tracked on GitHub.

Critiques, bug reports and enhancement requests are most welcome as GitHub issues.

Making “Time Rivers” in R

2016-06-28 – 17:18
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (5)

Once again, @albertocairo notices an interesting chart and spurs pondering in the visualization community with [his post](http://www.thefunctionalart.com/2016/06/defying-conventions-in-visualization.html) covering an unusual “vertical time series” chart produced for the print version of the NYTimes:

I’m actually less concerned about the vertical time series chart component here since I agree with TAVE^* Cairo that folks are smart enough to grok it and that it will be a standard convention soon enough given the prevalence of our collective tiny, glowing rectangles. The Times folks plotted Martin-Quinn (M-Q) scores for the U.S. Supreme Court justices which are estimates of how liberal or conservative a justice was in a particular term. Since they are estimates they aren’t exact and while it’s fine to plot the mean value (as suggested by the M-Q folks), if we’re going to accept the intelligence of the reader to figure out the nouveau time series layout, perhaps we can also show them some of the uncertainty behind these estimates.

What I’ve done below is take the data provided by the M-Q folks and make what I’ll call a vertical time series river plot using the mean, median and one standard deviation. This shows the possible range of real values the estimates can take and provides a less-precise but more forthright view of the values (in my opinion). You can see right away that they estimates are not so precise, but there is still an overall trend for the justices to become more liberal in modern times.

The ggplot2 code is a bit intricate, which is one reason I’m posting it. You need to reorient your labeling mind due to the need to use `coord_flip()`. I also added an arrow on the Y-axis to show how time flows. I think the vis community will need to help standardize on some good practices for how to deal with these vertical time series charts to help orient readers more quickly. In a more dynamic visualization, either using something like D3 or even just stop-motion animation, the flow could actually draw in the direction time flows, which would definitely make it easier immediately orient the reader.

However, the main point here is to not be afraid to show uncertainty. In fact, the more we all work at it, the better we’ll all be able to come up with effective ways to show it.

^* == “The Awesome Visualization Expert” since he winced at my use of “Dr. Cairo” :-)

library(dplyr)
library(readr)
library(ggplot2)  # devtools::install_github("hadley/ggplot2")
library(hrbrmisc) # devtools::install_github("hrbrmstr/hrbrmisc")
library(grid)
library(scales)

URL <- "http://mqscores.berkeley.edu/media/2014/justices.csv"
fil <- basename(URL)
if (!file.exists(fil)) download.file(URL, fil)

justices <- read_csv(fil)

justices %>%
  filter(term>=1980,
         justiceName %in% c("Thomas", "Scalia", "Alito", "Roberts", "Kennedy",
                            "Breyer", "Kagan", "Ginsburg", "Sotomayor")) %>%
  mutate(col=ifelse(justiceName %in% c("Breyer", "Kagan", "Ginsburg", "Sotomayor"),
                    "Democrat", "Republican")) -> recent

just_labs <- data_frame(
  label=c("Thomas", "Scalia", "Alito", "Roberts", "Kennedy", "Breyer", "Kagan", "Ginsburg", "Sotomayor"),
      x=c(  1990.5,   1985.5,  2004.5,    2004.5,    1986.5,      1994,   2010,     1992.5,      2008.5),
      y=c(     2.9,      1.4,    1.35,       1.7,       1.0,      -0.1,   -0.9,       -0.1,          -2)
)

gg <- ggplot(recent)
gg <- gg + geom_hline(yintercept=0, alpha=0.5)
gg <- gg + geom_label(data=data.frame(x=c(0.1, -0.1),
                                      label=c("More →\nconservative", "← More\nliberal"),
                                      hjust=c(0, 1)), aes(y=x, x=1982, hjust=hjust, label=label),
                      family="Arial Narrow", fontface="bold", size=4, label.size=0, vjust=1)
gg <- gg + geom_ribbon(aes(ymin=post_mn-post_sd, ymax=post_mn+post_sd, x=term,
                             group=justice, fill=col, color=col), size=0.1, alpha=0.3)
gg <- gg + geom_line(aes(x=term, y=post_med, color=col, group=justice), size=0.1)
gg <- gg + geom_text(data=just_labs, aes(x=x, y=y, label=label),
                     family="Arial Narrow", size=2.5)
gg <- gg + scale_x_reverse(expand=c(0,0), limits=c(2014, 1982),
                           breaks=c(2014, seq(2010, 1990, -10), 1985, 1982),
                           labels=c(2014, seq(2010, 1990, -10), "1985\nTERM\n↓", ""))
gg <- gg + scale_y_continuous(expand=c(0,0), labels=c(-2, "0\nM-Q Score", 2, 4))
gg <- gg + scale_color_manual(name=NULL, values=c(Democrat="#2166ac", Republican="#b2182b"), guide=FALSE)
gg <- gg + scale_fill_manual(name="Nominated by a", values=c(Democrat="#2166ac", Republican="#b2182b"))
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL,
                title="Martin-Quinn scores for selected justices, 1985-2014",
                subtitle="Ribbon band derived from mean plus one standard deviation. Inner line is the M-Q median.",
                caption="Data source: http://mqscores.berkeley.edu/measures.php")
gg <- gg + theme_hrbrmstr_an(grid="XY")
gg <- gg + theme(plot.subtitle=element_text(margin=margin(b=15)))
gg <- gg + theme(legend.title=element_text(face="bold"))
gg <- gg + theme(legend.position=c(0.05, 0.6))
gg <- gg + theme(plot.margin=margin(20,20,20,20))
gg

Yes, I manually positioned the names of the justices, hence the weird spacing for those lines. Also, after publishing this post, I tweaked the line-height of the “More Liberal”/”More Conservative” top labels a bit and would definitely suggest doing that to anyone attempting to reproduce this code (the setting I used was `0.9`).

A Call to Arms[list] Data Analysis!

2016-06-19 – 20:53
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (3)

The NPR vis team contributed to a recent [story](http://n.pr/1USSliN) about Armslist, a “craigslist for guns”. Now, I’m neither pro-“gun” or anti-“gun” since this subject, like most heated ones, has more than two sides. What I _am_ is pro-*data*, and the U.S. Congress is so [deep in the pockets of the NRA](http://abcnews.go.com/Health/cdc-launched-comprehensive-gun-study-15-years/story?id=39873289) that there’s no way for there to be any Federally-supported, data-driven research on gun injuries/deaths. Thankfully, California is going to [start funding research](http://www.wired.com/2016/06/congress-refuses-california-funds-gun-violence-research-center/), so we may see some evidence-based papers in the (hopefully) not-too-distant future.

When I read the NPR story I couldn’t believe it was easier to get a gun than it is get [pick your vice or other bit of dangerous contraband]. The team at NPR ended up [scraping the Armslist site](http://blog.apps.npr.org/2016/06/17/scraping-tips.html) and provided a [CSV of the data](http://apps.npr.org/armslist-analysis/armslist-listings-2016-06-16.csv). Their own blog post admirably started off with a “Can you scrape?” section. This is an area I see so many python, R and other folks totally ignore since they seem to feel that just because you _can_ do something also gives you license to do so.

I’m glad the NPR team provided the CSV of their results since I suspect that Armslist will be adding some “no scraping” language to their Terms of Service. Interestingly enough, the Armslist site owners spend a great deal of verbiage across their site indemnifying themselves (that’s how proud of their service they are).

Since they provided the CSV, I poked at it a bit and produced some alternate views of the data. One bit of info I was interested in is how much the ask price was for the firearms. Since this is a craigslist-like site, some of the prices are missing and others are obviously either “filler” like `12345678` or are legitimately large (i.e. the price for a rare antique). Given the huge right-skew, I limited the initial view to “affordable” ones (which I defined as between $0.00 & $2,500 USD and if you look at the data yourself you’ll see why). I then computed the bandwidth for the density estimate and did some other basic maths to see what price range of the offers made up at least 50% of the overall listings. I probably should have excluded the $1 offers but the data is there for you to use to augment anything I’ve done here.

Most of these firearms are quite affordable (even if you ignore the $1.00 USD offers).

One other view I wanted to see was that of the listings-per-day.

Info from the NPR vis team suggests this is not a 100% accurate view since the listings “age out” and they did a point-in-time scrape. It would be interesting to start a daily scraper for this site or ask to work with the raw data from the site itself (but it’s unlikely Armslist would have the courage to make said data available to news organizations or researchers). Also, the value for the last segment-bar does not appear to be from a fully day’s scrape. Nothing says ‘Murica like selling guns in a sketchy way for Memorial Day.

Finally, I wanted a different view of the ranked states.

(The `ggplot2` code for this one is kinda interesting for any R folk who are curious). This segment-bar chart is a bit of an eye strain (click on it to make it larger) but the main thing I wanted to see was if Ohio was as gun-nutty for the three less-than-valid (IMO) types of firearms sales (which is a different view than automatic vs semi-automatic). Sure enough, Ohio leads the pack (in other news, the same states are in the top 5 across these three categories).

“Spinnable” R code for these charts is below, so go forth and see if you can tease out any other views from the data. There is a free-text listing description field which may be interesting to mine, and the R code has sorted lists by manufacturer and caliber you can view if you run/spin it. It might be interesting to get data [like this for Ohio](https://en.wikipedia.org/wiki/Gun_laws_in_Ohio) for other states and do some clustering based on the legal categories outlined in the table.

#' ---
#' output:
#'   html_document:
#'     keep_md: true
#' ---

#+ message=FALSE, echo=FALSE, warning=FALSE
library(dplyr)
library(readr)
library(ggalt)
library(hrbrmisc)
library(KernSmooth)
library(scales)
library(stringi)
library(extrafont)
library(DT)

loadfonts()

arms <- read_csv("armslist-listings-2016-06-16.csv")
arms <- mutate(arms,
               price=ifelse(price=="FREE", 0, price),
               price=ifelse(price=="Offer", NA, price),
               price=make_numeric(price))
arms <- mutate(arms,
               listed_date=gsub("^.*y, ", "", listed_date),
               listed_date=as.Date(listed_date, "%B %d, %Y"))

affordable <- filter(arms, price>0 & price<2500)

bw <- dpik(affordable$price, scalest="stdev")

a_dens <- bkde(affordable$price, bandwidth=bw,
               range.x=range(affordable$price),
               truncate=TRUE)

peaks <- data_frame(
  pk=which(diff(sign(diff(c(0, a_dens$y)))) == -2),
  x=a_dens$x[pk],
  y=a_dens$y[pk]
)

ann <- sprintf('%s (%s of all listings) firearms are\noffered between $1 & $600 USD',
               comma(nrow(filter(affordable, between(price, 1, 600)))),
               percent(nrow(filter(affordable, between(price, 1, 600)))/nrow(arms)))

grps <- setNames(1:6, unique(arms$category))

ggplot() +
  geom_segment(data=cbind.data.frame(a_dens), aes(x, xend=x, 0, yend=y),
               color="#2b2b2b", size=0.15) +
  geom_vline(data=peaks[c(1,8),], aes(xintercept=x), size=0.5,
             linetype="dotted", color="#b2182b") +
  geom_label(data=peaks[c(1,8),], label.size=0,
            aes(x, y, label=dollar(floor(x)), hjust=c(0, 0)),
            nudge_x=c(10, 10), vjust=0, size=3,
            family="Arial Narrow") +
  geom_label(data=data.frame(), hjust=0, label.size=0, size=3,
             aes(label=ann, x=800, y=min(a_dens$y) + sum(range(a_dens$y))*0.7),
             family="Arial Narrow") +
  scale_x_continuous(expand=c(0,0), breaks=seq(0, 2500, 500), label=dollar, limits=c(0, 2500)) +
  scale_y_continuous(expand=c(0,0), limits=c(0, max(a_dens$y*1.05))) +
  labs(x=NULL, y="density",
       title="Distribution of firearm ask prices on Armslist",
       subtitle=sprintf("Counts are across all firearm types (%s)",
                        stri_replace_last_regex(paste0(names(grps), collapse=", "), ",", " &")),
       caption="Source: NPR http://n.pr/1USSliN") +
  theme_hrbrmstr_an(grid="X=Y", subtitle_size=10) +
  theme(axis.text.x=element_text(hjust=c(0, rep(0.5, 4), 1))) +
  theme(axis.text.y=element_blank()) +
  theme(plot.margin=margin(10,10,10,10)) -> gg

#+ ask-prices, dev="png", fig.width=8, fig.height=4, fig.retina=2, message=FALSE, echo=FALSE, warning=FALSE
gg

count(arms, state, category) %>%
  group_by(category) %>%
  mutate(f=paste0(paste0(rep(" ", grps[category[1]]), collapse=""), state, collaspe="")) %>%
  ungroup() %>%
  arrange(desc(n)) %>%
  mutate(f=factor(f, levels=rev(f))) %>%
  filter(category %in% c("Handguns", "Rifles", "Shotguns")) %>%
  ggplot(aes(x=n, y=f)) +
  geom_segment(aes(yend=f, xend=0), size=0.5) +
  scale_x_continuous(expand=c(0,0), label=comma) +
  facet_wrap(~category, scales="free") +
  labs(x="Note: free x-axis scale", y=NULL,
       title="Distribution of firearm listing by state",
       subtitle="Listings of Antique Firearms, Muzzle Loaders & NFA Firearms are not included in this view",
       caption="Source: NPR http://n.pr/1USSliN") +
  theme_hrbrmstr_an(grid="X", subtitle_size=10) +
  theme(axis.text.y=element_text(size=6)) -> gg

#+ by-state, dev="png", fig.width=8, fig.height=6, fig.retina=2, message=FALSE, echo=FALSE, warning=FALSE
gg

count(arms, listed_date) %>%
  ggplot(aes(listed_date, n)) +
  geom_segment(aes(xend=listed_date, yend=0)) +
  geom_vline(xintercept=c(as.numeric(c(as.Date("2016-05-26"),
                                       as.Date("2016-05-28"),
                                       as.Date("2016-06-02")))), color="#b2182b", size=0.5, linetype="dotted") +
  geom_label(data=data.frame(), hjust=1, vjust=1, nudge_x=-0.5, label.size=0, size=3,
             aes(x=as.Date("2016-05-26"), y=1800, label="NYT & CNN Gun Editorials"),
             family="Arial Narrow", color="#b2182b") +
  geom_label(data=data.frame(), hjust=1, vjust=1, nudge_x=-0.5, label.size=0, size=3,
             aes(x=as.Date("2016-05-28"), y=8500, label="Memorial Day"),
             family="Arial Narrow", color="#b2182b") +
  geom_label(data=data.frame(), hjust=0, vjust=1, nudge_x=0.5,
             label.size=0, size=3, lineheight=0.9,
             aes(x=as.Date("2016-06-02"), y=7000,
                 label="National Gun\nViolence\nAwareness Day"),
             family="Arial Narrow", color="#b2182b") +
  scale_x_date(expand=c(0,1), label=date_format("%B 2016")) +
  scale_y_continuous(expand=c(0,0), label=comma, limit=c(0, 9000)) +
  labs(x=NULL, y=NULL,
       title="Armslist firearm new listings per day",
       subtitle="Period range: March 16, 2016 to June 16, 2016",
       caption="Source: NPR http://n.pr/1USSliN") +
  theme_hrbrmstr_an(grid="XY") +
  theme(plot.margin=margin(10,10,10,10)) -> gg

#+ per-day, dev="png", fig.width=8, fig.height=5, fig.retina=2, message=FALSE, echo=FALSE, warning=FALSE
gg

count(arms, manufacturer) %>%
  filter(!is.na(manufacturer)) %>%
  arrange(desc(n)) %>%
  select(Manufacturer=manufacturer, Count=n) %>%
  datatable() %>%
  formatCurrency(columns="Count", currency="")

count(arms, caliber) %>%
  filter(!is.na(caliber)) %>%
  arrange(desc(n)) %>%
  select(Caliber=caliber, Count=n) %>%
  datatable() %>%
  formatCurrency(columns="Count", currency="")

Global Temperature Change in R & D3 (without the vertigo)

2016-05-14 – 20:49
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, ggplot, HTML5, Javascript, R
Tagged d3, post
Comments (2)

This made the rounds on social media last week:

Spiraling global temperatures from 1850-2016 (full animation) https://t.co/YETC5HkmTr pic.twitter.com/Ypci717AHq

— Ed Hawkins (@ed_hawkins) May 9, 2016

One of the original versions was static and was not nearly as popular, but—as you can see—this one went viral.

Despite the public’s infatuation with circles (I’m lookin’ at you, pie charts), I’m not going to reproduce this polar coordinate visualization in ggplot2. I believe others have already done so (or are doing so) and you can mimic the animation pretty easily with `coord_polar()` and @drob’s enhanced ggplot2 animation tools.

NOTE: If you’re more interested in the stats/science than a spirograph or colorful D3 animation (below), Gavin Simpson (@ucfagls) has an [awesome post](http://www.fromthebottomoftheheap.net/2016/03/25/additive-modeling-global-temperature-series-revisited/) with a detailed view of the HadCRUT data set.

## HadCRUT in R

I noticed that [the original data source](http://www.metoffice.gov.uk/hadobs/hadcrut4/), had 12 fields, two of which (columns 11 & 12) are the lower+upper bounds of the 95% confidence interval of the combined effects of all the uncertainties described in the HadCRUT4 error model (measurement and sampling, bias and coverage uncertainties). The spinning vis of doom may be mesmerizing, but it only shows the median. I thought it might be fun to try to make a good looking visualization using the CI as well (you can pick one of the other pairs to try this at home), both in R and then in D3. I chose D3 for the animated version mostly to play with the new 4.0 main branch, but I think it’s possible to do more with dynamic visualizations in D3 than it is with R (and it doesn’t require stop-motion techniques).

The following code:

– reads in the data set (and saves it locally to be nice to their bandwidth bill)
– does some munging to get fields we need
– saves a version out for use with D3
– uses `geom_segment()` + `geom_point()` to do the heavy lifting
– colors the segments by year using the `viridis` palette (the Plasma version)
– labels the plot by decade using facets and some fun facet margin “tricks” to make it look like the x-axis labels are on top

library(readr)    # read_table() / write_csv()
library(dplyr)
library(zoo)      # as.yearmon()
library(ggplot2)  # devtools::install_github("hadley/ggplot2")
library(hrbrmisc) # devtools::install_github("hrbrmstr/hrbrmisc")
library(viridis)

URL <- "http://www.metoffice.gov.uk/hadobs/hadcrut4/data/current/time_series/HadCRUT.4.4.0.0.monthly_ns_avg.txt"
fil <- sprintf("data/%s", basename(URL))
if (!file.exists(fil)) download.file(URL, fil)

global_temps <- read_table(fil, col_names=FALSE)

global_temps %>%
  select(year_mon=1, median=2, lower=11, upper=12) %>%
  mutate(year_mon=as.Date(as.yearmon(year_mon, format="%Y/%m")),
         year=as.numeric(format(year_mon, "%Y")),
         decade=(year %/% 10) * 10,
         month=format(year_mon, "%b")) %>%
  mutate(month=factor(month, levels=month.abb)) %>%
  filter(year != 2016) -> global_temps

# for D3 vis
write_csv(global_temps, "data/temps.csv")

#+ hadcrut, fig.retina=2, fig.width=12, fig.height=6
gg <- ggplot(global_temps)
gg <- gg + geom_segment(aes(x=year_mon, xend=year_mon, y=lower, yend=upper, color=year), size=0.2)
gg <- gg + geom_point(aes(x=year_mon, y=median), color="white", shape=".", size=0.01)
gg <- gg + scale_x_date(name="Median in white", expand=c(0,0.5))
gg <- gg + scale_y_continuous(name=NULL, breaks=c(0, 1.5, 2),
                              labels=c("0°C", "1.5°C", "2.0°C"), limits=c(-1.6, 2.25))
gg <- gg + scale_color_viridis(option="C")
gg <- gg + facet_wrap(~decade, nrow=1, scales="free_x")
gg <- gg + labs(title="Global Temperature Change (1850-2016)",
                subtitle="Using lower and upper bounds of the 95% confidence interval of the combined effects of all the uncertainties described in the HadCRUT4 error model (measurement and sampling, bias and coverage uncertainties; fields 11 & 12)",
                caption="HadCRUT4 (http://www.metoffice.gov.uk/hadobs/hadcrut4/index.html)")
gg <- gg + theme_hrbrmstr_my(grid="XY")
gg <- gg + theme(panel.background=element_rect(fill="black", color="#2b2b2b", size=0.15))
gg <- gg + theme(panel.margin=margin(0,0,0,0))
gg <- gg + theme(panel.grid.major.y=element_line(color="#b2182b", size=0.25))
gg <- gg + theme(strip.text=element_text(hjust=0.5))
gg <- gg + theme(axis.title.x=element_text(hjust=0, margin=margin(t=-10)))
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.text.y=element_text(size=12, color="#b2182b"))
gg <- gg + theme(legend.position="none")
gg <- gg + theme(plot.margin=margin(10, 10, 10, 10))
gg <- gg + theme(plot.caption=element_text(margin=margin(t=-6)))
gg

(Click image for larger version)

My `theme_hrbrmstr_my()` required the Myriad Pro font, so you’ll need to use one of the other themes in the `hrbrmisc` package or fill in some `theme()` details on your own.

## HadCRUT in D3

While the static visualization is pretty, we can kick it up a bit with some basic animations. Rather than make a multi-file HTML+js+D3+CSS example, this is all self-contained (apart from the data) in a single `index.html` file (some folks asked for the next D3 example to be self-contained).

Some nice new features of D3 4.0 (that I ended up using here):

– easier to use `scale`s
– less verbose `axis` creation
– `viridis` is now a first-class citizen

Mike Bostock has spent much time refining the API for [D3 4.0](https://github.com/d3/d3) and it shows. I’m definitely looking forward to playing with it over the rest of the year.

The vis is below but you can bust the `iframe` via [https://rud.is/projects/hadcrut4/](https://rud.is/projects/hadcrut4/).

I have it setup as “click to view” out of laziness. It’s not hard to make it trigger on `div` scroll visibility, but this way you also get to repeat the visualization animation without it looping incessantly.

If you end up playing with the D3 code, definitely change the width. I had to make it a bit smaller to fit it into the blog theme.

## Fin

You can find the source for both the R & D3 visualizations [on github](https://github.com/hrbrmstr/hadcrut).

(ggplot2) Exercising with (ggalt) dumbbells

2016-04-17 – 10:43
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (24)

I follow the most excellent Pew Research folks on Twitter to stay in tune with what’s happening (statistically speaking) with the world. Today, they tweeted this excerpt from their 2015 Global Attitudes survey:

The age gap in social media use around the world https://t.co/0Dq1PcbExG pic.twitter.com/9HBM7gLxwR

— PewResearch Internet (@pewinternet) April 17, 2016

I thought it might be helpful to folks if I made a highly aesthetically tuned version of Pew’s chart (though I chose to go a bit more minimal in terms of styling than they did) with the new geom_dumbbell() in the development version of ggalt. The source (below) is annotated, but please drop a note in the comments if any of the code would benefit from more exposition.

I’ve also switched to using the Prism javascript library starting with this post after seeing how well it works in RStudio’s flexdashboard package. If the “light on black” is hard to read or distracting, drop a note here and I’ll switch the theme if enough folks are having issues.

library(ggplot2) # devtools::install_github("hadley/ggplot2")
library(ggalt)   # devtools::install_github("hrbrmstr/ggalt")
library(dplyr)   # for data_frame() & arrange()

# I'm not crazy enough to input all the data; this will have to do for the example
df <- data_frame(country=c("Germany", "France", "Vietnam", "Japan", "Poland", "Lebanon",
                           "Australia", "South\nKorea", "Canada", "Spain", "Italy", "Peru",
                           "U.S.", "UK", "Mexico", "Chile", "China", "India"),
                 ages_35=c(0.39, 0.42, 0.49, 0.43, 0.51, 0.57,
                           0.60, 0.45, 0.65, 0.57, 0.57, 0.65,
                           0.63, 0.59, 0.67, 0.75, 0.52, 0.48),
                 ages_18_to_34=c(0.81, 0.83, 0.86, 0.78, 0.86, 0.90,
                                 0.91, 0.75, 0.93, 0.85, 0.83, 0.91,
                                 0.89, 0.84, 0.90, 0.96, 0.73, 0.69),
                 diff=sprintf("+%d", as.integer((ages_18_to_34-ages_35)*100)))

# we want to keep the order in the plot, so we use a factor for country
df <- arrange(df, desc(diff))
df$country <- factor(df$country, levels=rev(df$country))

# we only want the first line values with "%" symbols (to avoid chart junk)
# quick hack; there is a more efficient way to do this
percent_first <- function(x) {
  x <- sprintf("%d%%", round(x*100))
  x[2:length(x)] <- sub("%$", "", x[2:length(x)])
  x
}

gg <- ggplot()
# doing this vs y axis major grid line
gg <- gg + geom_segment(data=df, aes(y=country, yend=country, x=0, xend=1), color="#b2b2b2", size=0.15)
# dum…dum…dum!bell
gg <- gg + geom_dumbbell(data=df, aes(y=country, x=ages_35, xend=ages_18_to_34),
                         size=1.5, color="#b2b2b2", point.size.l=3, point.size.r=3,
                         point.colour.l="#9fb059", point.colour.r="#edae52")
# text below points
gg <- gg + geom_text(data=filter(df, country=="Germany"),
                     aes(x=ages_35, y=country, label="Ages 35+"),
                     color="#9fb059", size=3, vjust=-2, fontface="bold", family="Calibri")
gg <- gg + geom_text(data=filter(df, country=="Germany"),
                     aes(x=ages_18_to_34, y=country, label="Ages 18-34"),
                     color="#edae52", size=3, vjust=-2, fontface="bold", family="Calibri")
# text above points
gg <- gg + geom_text(data=df, aes(x=ages_35, y=country, label=percent_first(ages_35)),
                     color="#9fb059", size=2.75, vjust=2.5, family="Calibri")
gg <- gg + geom_text(data=df, color="#edae52", size=2.75, vjust=2.5, family="Calibri",
                     aes(x=ages_18_to_34, y=country, label=percent_first(ages_18_to_34)))
# difference column
gg <- gg + geom_rect(data=df, aes(xmin=1.05, xmax=1.175, ymin=-Inf, ymax=Inf), fill="#efefe3")
gg <- gg + geom_text(data=df, aes(label=diff, y=country, x=1.1125), fontface="bold", size=3, family="Calibri")
gg <- gg + geom_text(data=filter(df, country=="Germany"), aes(x=1.1125, y=country, label="DIFF"),
                     color="#7a7d7e", size=3.1, vjust=-2, fontface="bold", family="Calibri")
gg <- gg + scale_x_continuous(expand=c(0,0), limits=c(0, 1.175))
gg <- gg + scale_y_discrete(expand=c(0.075,0))
gg <- gg + labs(x=NULL, y=NULL, title="The social media age gap",
                subtitle="Adult internet users or reported smartphone owners who\nuse social networking sites",
                caption="Source: Pew Research Center, Spring 2015 Global Attitudes Survey. Q74")
gg <- gg + theme_bw(base_family="Calibri")
gg <- gg + theme(panel.grid.major=element_blank())
gg <- gg + theme(panel.grid.minor=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(plot.title=element_text(face="bold"))
gg <- gg + theme(plot.subtitle=element_text(face="italic", size=9, margin=margin(b=12)))
gg <- gg + theme(plot.caption=element_text(size=7, margin=margin(t=12), color="#7a7d7e"))
gg

52Vis Week #3 – Waste Not, Want Not

2016-04-13 – 23:35
Posted in 52vis, Charts & Graphs, DataVis, DataViz, ggplot, maps, R
Tagged post
Comments (9)

The Wall Street Journal did a project piece [a while back](http://projects.wsj.com/waste-lands/) in the _”Waste Lands: America’s Forgotten Nuclear Legacy”_. They dug through [Department of Energy](http://www.lm.doe.gov/default.aspx?id=2602) and [CDC](http://www.cdc.gov/niosh/ocas/ocasawe.html) data to provide an overview of the lingering residue of this toxic time in America’s past (somehow, I have to believe the fracking phenomena of our modern era will end up doing far more damage in the long run).

Being a somewhat interactive piece, I was able to tease out the data source behind it for this week’s challenge. I’m, once again, removing the obvious vis and re-creating a non-interactive version of the WSJ’s main map view (with some additional details).

There’s definitely a story or two here, but I felt that the overall message fell a bit flat the way the WSJ folks told it. Can you find an angle or question that tells a tale in a more compelling fashion? I added some hints in the code snippet below (and in the repo) as to how you might find additional details for each toxic site (and said details are super-scrape-able with `rvest`). I also noticed some additional external data sets that could be brought in (but I’ll leave that detective work to our contestants).

If you’re up to the task, fork [this week’s repo](https://github.com/52vis/2016-15), create a subdirectory for your submission and shoot a PR my way (notifying folks here in the comments about your submission is also encouraged).

Entries accepted up until 2016-04-20 23:59:59 EDT.

Hadley has volunteered a signed book and I think I’ll take him up on the offer for this week’s prize (unless you really want a copy of Data-Driven Security :-).

One last note: I’ve secured `52vis.com` and will be getting that configured for next week’s contest. It’ll be a nice showcase site for all the submissions.

library(albersusa) # devtools::install_github("hrbrmstr/hrbrmisc")
library(rgeos)
library(maptools)
library(ggplot2)
library(ggalt)
library(ggthemes)
library(jsonlite)
library(tidyr)
library(dplyr)
library(purrr)
 
#' WSJ Waste Lands: http://projects.wsj.com/waste-lands/
#' Data from: http://www.lm.doe.gov/default.aspx?id=2602 &
#'            http://www.cdc.gov/niosh/ocas/ocasawe.html
 
sites <- fromJSON("sites.json", flatten=TRUE)
 
#' need to replace the 0-length data.frames with at least one row of `NA`s
#' so we can safely `unnest()` them later
 
sites$locations <- map(sites$locations, function(x) {
  if (nrow(x) == 0) {
    data_frame(latitude=NA, longitude=NA, postal_code=NA, name=NA, street_address=NA)
  } else {
    x
  }
})
 
#' we'll need this later
 
sites$site.rating <- factor(sites$site.rating,
                           levels=c(3:0),
                           labels=c("Remote or no potential for radioactive contamination, based on criteria at the time of FUSRAP review.",
                                    "Referred to another agency or program, no authority to clean up under FUSRAP, or status unclear.",
                                    "Cleanup declared complete under FUSRAP.",
                                    "Cleanup in progress under the Formerly Utilized Sites Remedial Action Program (FUSRAP)."))
 
#' One teensy discrepancy:
 
nrow(sites)
 
#' ## [1] 517
#'
#' The stacked bars total on the WSJ site is 515.
#' Further complication is that sites$locations is a list column with nested
#' data.frames:
 
sites <- unnest(sites)
 
nrow(sites)
 
#' ## [1] 549
 
sum(complete.cases(sites[,c("longitude", "latitude")]))
 
#' ## [1] 352
#'
#' So, just mapping long/lat is going to miss part of the story. But, I'm just
#' providing a kick-start for folks, so I'll just map long/lat :-)
 
glimpse(sites)
 
#' ## Observations: 549
#' ## Variables: 11
#' ## $ site.city      (chr) "Flint", "Albuquerque", "Buffalo", "Los...
#' ## $ site.name      (chr) "AC Spark Plug, Dort Highway Plant", "A...
#' ## $ site.rating    (fctr) Remote or no potential for radioactive...
#' ## $ site.state     (chr) "MI", "NM", "NY", "NM", "PA", "NY", "OH...
#' ## $ site.state_ap  (chr) "Mich.", "N.M.", "N.Y.", "N.M.", "Pa.",...
#' ## $ site.slug      (chr) "1-ac-spark-plug-dort-highway-plant", "...
#' ## $ latitude       (dbl) 43.02938, NA, NA, 35.88883, 39.95295, 4...
#' ## $ longitude      (dbl) -83.65525, NA, NA, -106.30502, -75.5927...
#' ## $ postal_code    (chr) "48506", NA, NA, "87544", "19382", "100...
#' ## $ name           (chr) "", NA, NA, "", "", "", "Former Buildin...
#' ## $ street_address (chr) "1300 North Dort Highway", NA, NA, "Pue...
 
#' Note that `site.slug` can be used with this URL:
#' `http://projects.wsj.com/waste-lands/site/SITE.SLUG.HERE/` to get to
#' detail pages on the WSJ site.
 
#' I want to use my `albersusa` mutated U.S. shapefile for this (NOTE: I'm moving
#' `albersus` into one of the rOpenSci pacakges soon vs publishing it standalone to CRAN)
#' so I need to mutate the Alaska points (there are no Hawaii points).
#' This step is *not necessary* unless you plan on displaying points on this
#' mutated map. I also realized I need to provide a mutated projection translation
#' function for AK & HI for the mutated Albers mapss.
 
tmp  <- data.frame(dplyr::select(filter(sites, site.state=="AK"), longitude, latitude))
coordinates(tmp) <- ~longitude+latitude
proj4string(tmp) <- CRS(us_longlat_proj)
tmp <- spTransform(tmp, CRS(us_laea_proj))
tmp <- elide(tmp, rotate=-50)
tmp <- elide(tmp, scale=max(apply(bbox(tmp), 1, diff)) / 2.3)
tmp <- elide(tmp, shift=c(-2100000, -2500000))
proj4string(tmp) <- CRS(us_laea_proj)
tmp <- spTransform(tmp, us_longlat_proj)
tmp <- as.data.frame(tmp)
 
sites[sites$site.state=="AK",]$longitude <- tmp$x
sites[sites$site.state=="AK",]$latitude <- tmp$y
 
#' and now we plot the sites
 
us_map <- fortify(usa_composite(), region="name")
 
gg <- ggplot()
gg <- gg + geom_map(data=us_map, map=us_map,
                    aes(x=long, y=lat, map_id=id),
                    color="#2b2b2b", size=0.15, fill="#e5e3df")
gg <- gg + geom_point(dat=sites, aes(x=longitude, y=latitude, fill=site.rating),
                      shape=21, color="white", stroke=1, alpha=1, size=3)
gg <- gg + scale_fill_manual(name="", values=c("#00a0b0", "#edc951", "#6a4a3c", "#eb6841"))
gg <- gg + coord_proj(us_laea_proj)
gg <- gg + guides(fill=guide_legend(override.aes=list(alpha=1, stroke=0.2, color="#2b2b2b", size=4)))
gg <- gg + labs(title="Waste Lands: America's Forgotten Nuclear Legacy",
                 caption="Data from the WSJ")
gg <- gg + theme_map()
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(legend.direction="vertical")
gg <- gg + theme(legend.key=element_blank())
gg <- gg + theme(plot.title=element_text(size=18, face="bold", hjust=0.5))
gg

Beating lollipops into dumbbells

2016-04-12 – 19:56
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, ggplot, R
Tagged post
Comments (10)

Shortly after I added lollipop charts to ggalt I had a few requests for a dumbbell geom. It wasn’t difficult to do modify the underlying lollipop Geoms to make a geom_dumbbell(). Here it is in action:

library(ggplot2)
library(ggalt) # devtools::install_github("hrbrmstr/ggalt")
library(dplyr)

# from: https://plot.ly/r/dumbbell-plots/
URL <- "https://raw.githubusercontent.com/plotly/datasets/master/school_earnings.csv"
fil <- basename(URL)
if (!file.exists(fil)) download.file(URL, fil)

df <- read.csv(fil, stringsAsFactors=FALSE)
df <- arrange(df, desc(Men))
df <- mutate(df, School=factor(School, levels=rev(School)))

gg <- ggplot(df, aes(x=Women, xend=Men, y=School))
gg <- gg + geom_dumbbell(colour="#686868",
                         point.colour.l="#ffc0cb",
                         point.colour.r="#0000ff",
                         point.size.l=2.5,
                         point.size.r=2.5)
gg <- gg + scale_x_continuous(breaks=seq(60, 160, by=20),
                              labels=sprintf("$%sK", comma(seq(60, 160, by=20))))
gg <- gg + labs(x="Annual Salary", y=NULL,
                title="Gender Earnings Disparity",
                caption="Data from plotly")
gg <- gg + theme_bw()
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(panel.grid.minor=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(axis.title.x=element_text(hjust=1, face="italic", margin=margin(t=-24)))
gg <- gg + theme(plot.caption=element_text(size=8, margin=margin(t=24)))
gg

The API isn't locked in, so definitely file an issue if you want different or additional functionality. One issue I personally still have is how to identify the left/right points (blue is male and pink is female in this one).

Working Out With Dumbbells

I thought folks might like to see behind the ggcurtain. It really only took the addition of two functions to ggalt: geom_dumbbell() (which you call directly) and GeomDumbbell() which acts behind the scenes.

There are a few additional, custom parameters to geom_dumbbell() and the mapped stat and position are hardcoded in the layer call. We also pass in these new parameters into the params list.

geom_dumbbell <- function(mapping = NULL, data = NULL, ...,
                          point.colour.l = NULL, point.size.l = NULL,
                          point.colour.r = NULL, point.size.r = NULL,
                          na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) {

  layer(
    data = data,
    mapping = mapping,
    stat = "identity",
    geom = GeomDumbbell,
    position = "identity",
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      na.rm = na.rm,
      point.colour.l = point.colour.l,
      point.size.l = point.size.l,
      point.colour.r = point.colour.r,
      point.size.r = point.size.r,
      ...
    )
  )
}

The exposed function eventually calls it's paired Geom. There we get to tell it what are required aes parameters and which ones aren't required, plus set some defaults.

We automagically add yend to the data in setup_data() (which gets called by the ggplot2 API).

Then, in draw_group() we create additional data.frames and return a list of three Geom layers (two points and one segment). Finally, we provide a default legend symbol.

GeomDumbbell <- ggproto("GeomDumbbell", Geom,
  required_aes = c("x", "xend", "y"),
  non_missing_aes = c("size", "shape",
                      "point.colour.l", "point.size.l",
                      "point.colour.r", "point.size.r"),
  default_aes = aes(
    shape = 19, colour = "black", size = 0.5, fill = NA,
    alpha = NA, stroke = 0.5
  ),

  setup_data = function(data, params) {
    transform(data, yend = y)
  },

  draw_group = function(data, panel_scales, coord,
                        point.colour.l = NULL, point.size.l = NULL,
                        point.colour.r = NULL, point.size.r = NULL) {

    points.l <- data
    points.l$colour <- point.colour.l %||% data$colour
    points.l$size <- point.size.l %||% (data$size * 2.5)

    points.r <- data
    points.r$x <- points.r$xend
    points.r$colour <- point.colour.r %||% data$colour
    points.r$size <- point.size.r %||% (data$size * 2.5)

    gList(
      ggplot2::GeomSegment$draw_panel(data, panel_scales, coord),
      ggplot2::GeomPoint$draw_panel(points.l, panel_scales, coord),
      ggplot2::GeomPoint$draw_panel(points.r, panel_scales, coord)
    )

  },

  draw_key = draw_key_point
)

In essence, this new geom saves calls to three additional geom_s, but does add more parameters, so it's not really clear if it saves much typing.

If you end up making anything interesting with geom_dumbbell() I encourage you to drop a note in the comments with a link.