Skip navigation

Category Archives: ggplot

I woke up this morning to a [headline story from the Washington Post](https://www.washingtonpost.com/news/the-fix/wp/2015/12/10/to-many-christian-terrorists-arent-true-christians-but-muslim-terrorists-are-true-muslims/) on _”Americans are twice as willing to distance Christian extremists from their religion as Muslims_”. This post is not about the content of the headline or story. It _is_ about the horrible pie chart WaPo led the article with:

Untitled

This isn’t just a rant of a madman against pie charts. While I _am_ vehemently opposed to them, we did cover them [in our book](https://books.google.com/books?id=7DqwAgAAQBAJ&pg=PA146&lpg=PA146&dq=data-driven+security+pie+chart&source=bl&ots=Cy1iJylsHd&sig=a6Hz1JB-QYLq6H0VZJpPleJgRkQ&hl=en&sa=X&ved=0ahUKEwj79uqt_tjJAhVG0iYKHS0uDn4Q6AEIMzAH#v=onepage&q=data-driven%20security%20pie%20chart&f=false) and my co-author (@jayjacobs) and the incredibly talented @annkemery both agree there are often cases where they are appropriate. Even using their less-sensitive sensibilities, this would not be one of those cases.

So, what—exactly—is the problem? WaPo tried to enable comparison between pies by exploding them and using colors to indicate similar fear levels, mapping shades to entries in the top legend. Your eye has to move around a bit to take everything in and remember the mapping as you focus on each slice (since you will end up doing that given that each category colored differently). Their whole goal was to enable the reader to see the change in sentiment towards terrorism since this time last year.

Hrm. Two dates. Small set of values. Desire to quickly compare change in value/slope. **This sounds like a job for a slopegraph!**

The article and graphic are based on a [survey](http://publicreligion.org/research/2015/12/survey-nearly-half-of-americans-worried-that-they-or-their-family-will-be-a-victim-of-terrorism/). Thankfully the [complete survey data was made available](http://publicreligion.org/site/wp-content/uploads/2015/12/December-2015-PRRI-RNS-Topline1.pdf), which made it easy to do a makeover (in R of course). Here’s the result:

unnamed-chunk-1-1

Each category change is clearly visible, you don’t need to remember color association and you even know the actual values*.

The R code is below and in [this gist](https://gist.github.com/hrbrmstr/9bf4f93dffc1df48fe27). How would you make the WaPo chart better (drop a note in the comments with a link to your own makeover)?

library(tidyr)
library(ggplot2)
library(ggthemes)
library(scales)
library(dplyr)
 
# Easiest way to transcribe the PDF table
# The slope calculation will enable us to color the lines/points based on up/down
dat <- data_frame(`2014-11-01`=c(0.11, 0.22, 0.35, 0.31, 0.01),
                  `2015-12-01`=c(0.17, 0.30, 0.30, 0.23, 0.00),
                  slope=factor(sign(`2014-11-01` - `2015-12-01`)),
                  fear_level=c("Very worried", "Somewhat worried", "Not too worried",
                               "Not at all", "Don't know/refused"))
 
# Transform that into something we can use
dat <- gather(dat, month, value, -fear_level, -slope)
 
# We need real dates for the X-axis manipulation
dat <- mutate(dat, month=as.Date(as.character(month)))
 
# Since 2 categories have the same ending value, we need to
# take care of that (this is one of a few "gotchas" in slopegraph preparation)
end_lab <- dat %>%
  filter(month==as.Date("2015-12-01")) %>%
  group_by(value) %>%
  summarise(lab=sprintf("%s", paste(fear_level, collapse=", ")))
 
gg <- ggplot(dat)
 
# line
gg <- gg + geom_line(aes(x=month, y=value, color=slope, group=fear_level), size=1)
# points
gg <- gg + geom_point(aes(x=month, y=value, fill=slope, group=fear_level),
                      color="white", shape=21, size=2.5)
 
# left labels
gg <- gg + geom_text(data=filter(dat, month==as.Date("2014-11-01")),
                     aes(x=month, y=value, label=sprintf("%s — %s  ", fear_level, percent(value))),
                     hjust=1, size=3)
# right labels
gg <- gg + geom_text(data=end_lab,
                     aes(x=as.Date("2015-12-01"), y=value,
                         label=sprintf("  %s — %s", percent(value), lab)),
                     hjust=0, size=3)
 
# Here we do some slightly tricky x-axis formatting to ensure we have enough
# space for the in-panel labels, only show the months we need and have
# the month labels display properly
gg <- gg + scale_x_date(expand=c(0.125, 0),
                        labels=date_format("%b\n%Y"),
                        breaks=c(as.Date("2014-11-01"), as.Date("2015-12-01")),
                        limits=c(as.Date("2014-02-01"), as.Date("2016-12-01")))
gg <- gg + scale_y_continuous()
 
# I used colors from the article
gg <- gg + scale_color_manual(values=c("#f0b35f", "#177fb9"))
gg <- gg + scale_fill_manual(values=c("#f0b35f", "#177fb9"))
gg <- gg + labs(x=NULL, y=NULL, title="Fear of terror attacks (change since last year)\n")
gg <- gg + theme_tufte(base_family="Helvetica")
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text.y=element_blank())
gg <- gg + theme(legend.position="none")
gg <- gg + theme(plot.title=element_text(hjust=0.5))
gg
* Well, it’s survey. To add insult to injury, it’s a sentiment-based survey given right after a likely-to-be-attributed-terrorism attack. Also, there is a margin of error that isn’t communicated in either visualization. So while there is “data”, trust it at your own peril.

Cybersecurity is a domain that really likes surveys, or at the very least it has many folks within it that like to conduct and report on surveys. One recent survey on threat intelligence is in it’s second year, so it sets about comparing answers across years. Rather than go into the many technical/statistical issues with this survey, I’d like to focus on alternate ways to visualize the comparison across years.

We’ll use the data that makes up this chart (Figure 3 from the report):

surveybars

since it’s pretty representative of the remainder of the figures.

Let’s start by reproducing this figure with ggplot2:

library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(scales)
library(ggthemes)
library(extrafont)

loadfonts(quiet=TRUE)

read.csv("question.csv", stringsAsFactors=FALSE) %>%
  gather(year, value, -belief) %>%
  mutate(year=factor(sub("y", "", year)),
         belief=str_wrap(belief, 40)) -> question

beliefs <- unique(question$belief)
question$belief <- factor(beliefs, levels=rev(beliefs[c(1,2,4,5,3,7,6)]))

gg <- ggplot(question, aes(belief, value, group=year))
gg <- gg + geom_bar(aes(fill=year), stat="identity", position="dodge",
                    color="white", width=0.85)
gg <- gg + geom_text(aes(label=percent(value)), hjust=-0.15,
                     position=position_dodge(width=0.8), size=3)
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,0.8))
gg <- gg + scale_fill_tableau(name="")
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

Now, the survey does caveat the findings and talks about non-response bias, sampling-frame bias and self-reporting bias. However, nowhere does it talk about the margin of error or anything relating to uncertainty. Thankfully, both the 2014 and 2015 reports communicate population and sample sizes, so we can figure out the margin of error:

library(samplesize4surveys)

moe_2014 <- e4p(19915, 701, 0.5)
## With the parameters of this function: N = 19915 n =  701 P = 0.5 DEFF =  1 conf = 0.95 . 
## The estimated coefficient of variation is  3.709879 . 
## The margin of error is 3.635614 . 
## 

moe_2015 <- e4p(18705, 692, 0.5)
## With the parameters of this function: N = 18705 n =  692 P = 0.5 DEFF =  1 conf = 0.95 . 
## The estimated coefficient of variation is  3.730449 . 
## The margin of error is 3.655773 .

They are both roughly 3.65% so let's take a look at our dodged bar chart again with this new information:

mutate(question, ymin=value-0.0365, ymax=value+0.0365) -> question

gg <- ggplot(question, aes(belief, value, group=year))
gg <- gg + geom_bar(aes(fill=year), stat="identity",
                    position=position_dodge(0.85),
                    color="white", width=0.85)
gg <- gg + geom_linerange(aes(ymin=ymin, ymax=ymax),
                         position=position_dodge(0.85),
                         size=1.5, color="#bdbdbd")
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,0.85))
gg <- gg + scale_fill_tableau(name="")
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

Hrm. There seems to be a bit of overlap. Let's just focus on that:

gg <- ggplot(question, aes(belief, value, group=year))
gg <- gg + geom_pointrange(aes(ymin=ymin, ymax=ymax),
                         position=position_dodge(0.25),
                         size=1, color="#bdbdbd", fatten=1)
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,1))
gg <- gg + scale_fill_tableau(name="")
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

The report actually makes hard claims based on the year-over-year change in the answers to many of the questions (not just this chart). Most have these overlapping intervals. Now, I understand that when a paying customer says they want a report that they wouldn't really be satisfied with a one-pager saying "See last years's report", but not communicating the uncertainty in these results seems like a significant omission.

But, I digress. There are better (or at least alternate) ways than bars to show this comparison. One is a "dumbbell chart".

question %>%
  group_by(belief) %>%
  mutate(line_col=ifelse(diff(value)<0, "2015", "2014"),
         hjust=ifelse(diff(value)<0, -0.5, 1.5)) %>%
  ungroup() -> question

gg <- ggplot(question)
gg <- gg + geom_path(aes(x=value, y=belief, group=belief, color=line_col))
gg <- gg + geom_point(aes(x=value, y=belief, color=year))
gg <- gg + geom_text(data=filter(question, year=="2015"),
                     aes(x=value, y=belief, label=percent(value),
                         hjust=hjust), size=2.5)
gg <- gg + scale_x_continuous(expand=c(0,0), limits=c(0,0.8))
gg <- gg + scale_color_tableau(name="")
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

I've used line color to indicate whether the 2015 value increased or decreased from 2014.

But, we still have the issue of communicating the margin of error. One way I came up with (which is not perfect) is to superimpose the dot-plot on top of the entire margin of error interval. While it doesn't show the discrete start/end margin for each year it does help to show that making definitive statements on the value comparisons is not exactly a good idea:

group_by(question, belief) %>%
  summarize(xmin=min(ymin), xmax=max(ymax)) -> band

gg <- ggplot(question)
gg <- gg + geom_segment(data=band,
                        aes(x=xmin, xend=xmax, y=belief, yend=belief),
                        color="#bdbdbd", alpha=0.5, size=3)
gg <- gg + geom_path(aes(x=value, y=belief, group=belief, color=line_col),
                     show.legend=FALSE)
gg <- gg + geom_point(aes(x=value, y=belief, color=year))
gg <- gg + geom_text(data=filter(question, year=="2015"),
                     aes(x=value, y=belief, label=percent(value),
                         hjust=hjust), size=2.5)
gg <- gg + scale_x_continuous(expand=c(0,0), limits=c(0,0.8))
gg <- gg + scale_color_tableau(name="")
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

Finally, the year-to-year nature of the data was just begging for a slopegraph:

question %>% mutate(vjust=0.5) -> question
question[(question$belief=="Makes threat data more actionable") &
           (question$year=="2015"),]$vjust <- -1
question[(question$belief=="Reduces the cost of detecting and\npreventing cyber attacks") &
           (question$year=="2015"),]$vjust <- 1.5

question$year <- factor(question$year, levels=c("2013", "2014", "2015", "2016", "2017", "2018"))

gg <- ggplot(question)
gg <- gg + geom_path(aes(x=year, y=value, group=belief, color=line_col))
gg <- gg + geom_point(aes(x=year, y=value), shape=21, fill="black", color="white")
gg <- gg + geom_text(data=filter(question, year=="2015"),
                     aes(x=year, y=value,
                         label=sprintf("\u2000%s %s", percent(value),
                                       gsub("\n", " ", belief)),
                         vjust=vjust), hjust=0, size=3)
gg <- gg + geom_text(data=filter(question, year=="2014"),
                     aes(x=year, y=value, label=percent(value)),
                     hjust=1.3, size=3)
gg <- gg + scale_x_discrete(expand=c(0,0.1), drop=FALSE)
gg <- gg + scale_color_tableau(name="")
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_blank())
gg <- gg + theme(legend.position="none")
gg <- gg + theme(plot.title=element_text(hjust=0.5))
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

It doesn't help communicate uncertainty but it's a nice alternative to bars.

Hopefully this helps provide some alternatives to bars for these types of comparisons and also ways to communicate uncertainty without confusing the reader (communicating uncertainty to a broad audience is hard).

Perhaps those conducting surveys (or data analyses in general) could subscribe to a "data visualizers" paraphrase of a quote from Epidemics, Book I, of the Hippocratic school:

"Practice two things in your dealings with data: either help or do not harm the reader."

The full Rmd and data for this post is in this gist.

This occurrence of the bi-annual corruption of the space-time continuum (i.e. changing to/from standard/daylight time) in the U.S. caused me to make a slight change to the code [from an older post](https://rud.is/b/2014/09/23/seeing-the-daylight-with-r/). The `daylight()` function now auto-discovers the date and location information (via [telize](http://www.telize.com/)) from the caller, which means all you have to do to get a plot like this:

RStudio

is to source the [new gist](https://gist.github.com/hrbrmstr/e435d4fa0c31b8e1a9d0) like this:

devtools::source_gist(“e435d4fa0c31b8e1a9d0″, sha1=”64e859227266dc5f9008b3b3959a19fea373fee6”)

Remember that you should verify any code before blindly `source`ing it (in R or anywhere else) and make sure to use the SHA1 hash so you know you’re sourcing the proper code (and not potentially being pwnd).

Note that the granularity/accuracy of the geolocation is only as good as the Telize service (which uses MaxMind). The fact that this shows Vermont instead of Maine should make you all think thrice about trusting IP geolocation in general, especially you world-mapping cybersecurity folks.

Sadly, the darkest of days is still yet to come.

Junk Charts [adeptly noted and fixed](http://junkcharts.typepad.com/junk_charts/2015/10/is-it-worth-the-drama.html) this excessively stylized chart from the WSJ this week:

6a00d8341e992c53ef01bb0885a274970d

Their take on it does reduce the ZOMGOSH WE ARE DOOMED! look and feel of the WSJ chart:

6a00d8341e992c53ef01bb0885a2ef970d

But, we can further reduce the drama by using a more neutral color encoding _and_ encode both the # of outbreaks and total size of the impacted flock populations _per week_ with a lollipop chart (and, thankfully the USDA makes this data readily available):

library(xml2)
library(rvest)
library(dplyr)
library(stringr)
library(ggplot2)
library(scales)
library(viridis)
library(ggthemes)
 
pg <- read_html("https://www.aphis.usda.gov/wps/portal/aphis/ourfocus/animalhealth/sa_animal_disease_information/sa_avian_health/sa_detections_by_states/ct_ai_pacific_flyway/!ut/p/a1/lVNNb-IwEP0tPewx2OSbI_QDwkdBRbuQXKyJ4yTWJnYUG1D-fZ10D7RqadcHS555M_PmPRkl6IgSAWdegOZSQNW_E58stwt7PMN2NN9PHnH0_OdpE64DZ7twDSA2APzFmeL39dtV5Pf1L3i2eBjjvYMOKEEJFbrRJYqhKbkiVArNhCYVT1tou19YAZGnluSSntTwAsFrqEjJoNLldSTjioFihItctvWwxFv6zEFc4zOmGe3TiqQdURo0M62pJsBJA5TnnJK86i7Q9fwayjMU57ZPAezccnwvtdwM21aah9hcGHtuCP6Y5v_0uLHwj_S8n08XbrA2CrqhjaMHUxxMNhhH_nf1g57fdBgAtyz7MGT-ODNDnta7YLW08cpDsSEZfMli4qL9f6q2_IEsdru53xSmLejS6g1Gx5vGv6WvjB8CnxmPjp8af5ihxJNBpIqeX1HJdPgQ8VSkTmiItCxnLWtHpVQaHS-Xy-ikMhgV8oya-ncdOh23_r6E2PGqYrerD9O7u1eBlNG5/?1dmy&urile=wcm%3apath%3a%2Faphis_content_library%2Fsa_our_focus%2Fsa_animal_health%2Fsa_animal_disease_information%2Fsa_avian_health%2Fsa_detections_by_states%2Fct_ai_full_list")
 
dat <- html_table(html_nodes(pg, "table"))[[1]]
 
dat %>% 
  mutate(`Confirmation date` = as.Date(`Confirmation date`, "%b %d, %Y"),
         week = format(`Confirmation date`, "%Y-%U"),
         week_start = as.Date(sprintf("%s-1", week), "%Y-%U-%u") ,
         `Flock size` = as.numeric(str_replace_all(`Flock size`, ",", ""))) %>% 
  select(week, week_start, `Flock size`) %>% 
  filter(!is.na(`Flock size`)) %>% 
  group_by(week_start) %>% 
  summarize(outbreaks=n(), 
            flock_total=sum(`Flock size`)) -> dat
 
first <- dat[2,]
last <- tail(dat, 1)
 
gg <- ggplot(dat, aes(x=week_start, y=outbreaks))
gg <- gg + geom_vline(xintercept=as.numeric(first$week_start), linetype="dashed", size=0.2, color="#7f7f7f")
gg <- gg + geom_text(data=first, aes(x=week_start, y=25), label=" First detection in 2015", hjust=0, size=3, color="#7f7f7f")
gg <- gg + geom_vline(xintercept=as.numeric(last$week_start), linetype="dashed", size=0.2, color="#7f7f7f")
gg <- gg + geom_text(data=last, aes(x=week_start, y=25), label="Last detection ", hjust=1, size=3, color="#7f7f7f")
gg <- gg + geom_segment(aes(x=week_start, xend=week_start, y=0, yend=outbreaks, color=flock_total), size=0.5)
gg <- gg + geom_point(aes(size=flock_total, fill=flock_total), shape=21)
gg <- gg + scale_size_continuous(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_color_viridis(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_fill_viridis(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_x_date(label=date_format("%b"))
gg <- gg + guides(color=guide_legend(), fill=guide_legend(), size=guide_legend())
gg <- gg + labs(x=NULL, y="# Outbreaks", title="Avian Flu Impact by Week (2015)")
gg <- gg + theme_tufte(base_family="Helvetica")
gg <- gg + theme(legend.key=element_rect(color=rgb(0,0,0,0)))
gg

RStudio

If we really want to see the discrete events, we can do that with our less-ZOMGOSH color scheme, too:

dat <- html_table(html_nodes(pg, "table"))[[1]]
dat %>% 
  mutate(`Confirmation date` = as.Date(`Confirmation date`, "%b %d, %Y"),
         `Flock size` = as.numeric(str_replace_all(`Flock size`, ",", ""))) %>% 
  filter(!is.na(`Flock size`)) %>% 
  rename(date=`Confirmation date`) %>% 
  arrange(date) -> dat
 
first <- dat[2,]
last <- tail(dat, 1)
 
gg <- ggplot(dat, aes(x=date, y=`Flock size`))
gg <- gg + geom_vline(xintercept=as.numeric(first$date), linetype="dashed", size=0.2, color="#7f7f7f")
gg <- gg + geom_text(data=first, aes(x=date, y=3000000), label=" First detection in 2015", hjust=0, size=3, color="#7f7f7f")
gg <- gg + geom_vline(xintercept=as.numeric(last$date), linetype="dashed", size=0.2, color="#7f7f7f")
gg <- gg + geom_text(data=last, aes(x=date, y=3000000), label="Last detection ", hjust=1, size=3, color="#7f7f7f")
gg <- gg + geom_segment(aes(x=date, xend=date, y=0, yend=`Flock size`, color=`Flock size`), size=0.5, alpha=1)
gg <- gg + scale_size_continuous(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_color_viridis(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_fill_viridis(name="Flock Impact", label=comma, guide="legend")
gg <- gg + scale_x_date(label=date_format("%b"))
gg <- gg + scale_y_continuous(label=comma)
gg <- gg + guides(color=guide_legend(), fill=guide_legend(), size=guide_legend())
gg <- gg + labs(x=NULL, y="Flock size", title="Avian Flu Impact (2015)")
gg <- gg + theme_tufte(base_family="Helvetica")
gg <- gg + theme(legend.key=element_rect(color=rgb(0,0,0,0)))
gg

RStudio 2

Neither of those is ever going to sell any ads, tho.

I saw this post over at NatGeo over the weekend and felt compelled to replicate this:

with ggplot2.

Three shapefiles later and we have it close enough to toss into a post (and I really don’t believe the continent names are necessary).

library(rgdal)
library(ggplot2)
library(ggthemes)
library(ggalt) # devtools::install_github("hrbrmstr/ggalt")

# grab these from http://rud.is/dl/quakefiles.tgz

world <- readOGR("countries.geo.json", "OGRGeoJSON", stringsAsFactors=FALSE)
plates <- readOGR("plates.json", "OGRGeoJSON", stringsAsFactors=FALSE)
quakes <- readOGR("quakes.json", "OGRGeoJSON", stringsAsFactors=FALSE)

world_map <- fortify(world)
plates_map <- fortify(plates)
quakes_dat <- data.frame(quakes)
quakes_dat$trans <- quakes_dat$mag %% 5

gg <- ggplot()
gg <- gg + geom_cartogram(data=world_map, map=world_map,
                          aes(x=long, y=lat, map_id=id),
                          color="white", size=0.15, fill="#d8d8d6")
gg <- gg + geom_cartogram(data=plates_map, map=plates_map,
                          aes(x=long, y=lat, map_id=id),
                          color="black", size=0.1, fill="#00000000", alpha=0)
gg <- gg + geom_point(data=quakes_dat,
                      aes(x=coords.x1, y=coords.x2, size=trans),
                      shape=1, alpha=1/3, color="#d47e5d", fill="#00000000")
gg <- gg + geom_point(data=subset(quakes_dat, mag>7.5),
                      aes(x=coords.x1, y=coords.x2, size=trans),
                      shape=1, alpha=1, color="black", fill="#00000000")
gg <- gg + geom_text(data=subset(quakes_dat, mag>7.5),
                     aes(x=coords.x1, y=coords.x2, label=sprintf("Mag %2.1f", mag)),
                     color="black", size=3, vjust=c(3.9, 3.9, 5), fontface="bold")
gg <- gg + scale_size(name="Magnitude", trans="exp", labels=c(5:8), range=c(1, 20))
gg <- gg + coord_map("mollweide")
gg <- gg + theme_map()
gg <- gg + theme(legend.position=c(0.05, 0.99))
gg <- gg + theme(legend.direction="horizontal")
gg <- gg + theme(legend.key=element_rect(color="#00000000"))
gg

unnamed-chunk-1-1

I can only imagine how many mouse clicks that would be in a GIS program.

Addendum

Junk Charts did a post on [Don’t pick your tool before having your design](http://junkcharts.typepad.com/junk_charts/2015/09/dont-pick-your-tool-before-having-your-design.html) and made a claim that this:

6a00d8341e992c53ef01b7c7d60168970b-300wi

_”cannot be produced directly from a tool (without contorting your body in various painful locations)”_.

I beg to differ.

With R & ggplot2, I get to both pick my tool and design at the same time since I have a very flexible and multi-purpose tool. I also don’t believe that the code below qualifies as “contortions”, though I am a ggplot2 fanboi. It’s no different than Excel folks clicking on radio buttons and color pickers, except my process is easily repeatable & scalable once finalized (this is not finalized as it’s not 100% parameterized but it’s not difficult to do that last part).

library(ggplot2)
 
dat <- data.frame(year=2010:2015,
                  penalties=c(627, 625, 653, 617, 661, 730))
 
avg <- data.frame(val=mean(head(dat$penalties, -1)),
                  last=dat$penalties[6],
                  lab="5-Yr\nAvg")
 
gg <- ggplot(dat, aes(x=year, y=penalties))
gg <- gg + geom_point()
gg <- gg + scale_x_continuous(breaks=c(2010, 2014, 2015), limits=c(NA, 2015.1))
gg <- gg + scale_y_continuous(breaks=c(600, 650, 700, 750), 
                              limits=c(599, 751), expand=c(0,0))
gg <- gg + geom_segment(data=avg, aes(x=2010, xend=2015, y=val, yend=val), linetype="dashed")
gg <- gg + geom_segment(data=avg, aes(x=2015, xend=2015, y=val, yend=last), color="steelblue")
gg <- gg + geom_point(data=avg, aes(x=2015, y=val), shape=4, size=3)
gg <- gg + geom_text(data=avg, aes(x=2015, y=val), label="5-Yr\nAvg", size=2.5, hjust=-0.3)
gg <- gg + geom_point(data=avg, aes(x=2015, y=700), shape=17, col="steelblue")
gg <- gg + geom_point(data=avg, aes(x=2015, y=730), shape=4, size=3)
gg <- gg + labs(x=NULL, y="Number of Penalties", 
                title="NFL Penalties Jumped 15% in the\nFirst 3 Weeks of the 2015 Season\n")
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid.minor=element_blank())
gg <- gg + theme(panel.grid.major.x=element_blank())
gg <- gg + theme(panel.grid.major.y=element_line(color="white"))
gg <- gg + theme(panel.background=element_rect(fill="#f3f2f7"))
gg <- gg + theme(axis.ticks=element_blank())
gg

forblog-1

A huge change is coming to ggplot2 and you can get a preview of it over at Hadley’s github repo. I’ve been keenly interested in this as I will be fixing, finishing & porting coord_proj to it once it’s done.

Hadley & Winston have re-built ggplot2 with an entirely new object-oriented system called ggproto. With ggproto it’s now possible to easily extend ggplot2 from within your own packages (since source() is so last century), often times with very little effort.

Before attempting to port coord_proj I wanted to work through adding a Geom and Stat since thought it would be cool to be able to have interpolated line charts (and it helps answer some recurring StackOverflow “spline”/ggplot2 questions) and also prefer KernSmooth::bkde over the built-in density function (which geom_density and stat_density both use).

To that end, I’ve made a new github-installable package called ggalt (h/t to @jayjacobs for the better package name than I originally came up with) where I’ll be adding new Geoms, Stats, Coords (et al) as I craft them. For now, let me introduce both geom_xspline() and geom_bkde() to show how easy it is to incorporate new functionality into ggplot2.

While not a requirement, I think it’s a going to be a good idea to make both a paired Geom and Stat when adding those types of functionality to ggplot2. I found it easier to work with custom parameters this way and it also makes it feel a bit more like the way ggplot2 itself works. For the interpolated line geom/stat I used R’s graphics::xpsline function. Here’s all it took to give ggplot2 lines some curves (you can find the commented version on github):

geom_xspline <- function(mapping = NULL, data = NULL, stat = "xspline",
                      position = "identity", show.legend = NA,
                      inherit.aes = TRUE, na.rm = TRUE,
                      spline_shape=-0.25, open=TRUE, rep_ends=TRUE, ...) {
  layer(
    geom = GeomXspline,
    mapping = mapping,
    data = data,
    stat = stat,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(spline_shape=spline_shape,
                  open=open,
                  rep_ends=rep_ends,
                  ...)
  )
}

GeomXspline <- ggproto("GeomXspline", GeomLine,
  required_aes = c("x", "y"),
  default_aes = aes(colour = "black", size = 0.5, linetype = 1, alpha = NA)
)

stat_xspline <- function(mapping = NULL, data = NULL, geom = "line",
                     position = "identity", show.legend = NA, inherit.aes = TRUE,
                     spline_shape=-0.25, open=TRUE, rep_ends=TRUE, ...) {
  layer(
    stat = StatXspline,
    data = data,
    mapping = mapping,
    geom = geom,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(spline_shape=spline_shape,
                  open=open,
                  rep_ends=rep_ends,
                  ...
    )
  )
}

StatXspline <- ggproto("StatXspline", Stat,

  required_aes = c("x", "y"),

  compute_group = function(self, data, scales, params,
                           spline_shape=-0.25, open=TRUE, rep_ends=TRUE) {
    tf <- tempfile(fileext=".png")
    png(tf)
    plot.new()
    tmp <- xspline(data$x, data$y, spline_shape, open, rep_ends, draw=FALSE, NA, NA)
    invisible(dev.off())
    unlink(tf)

    data.frame(x=tmp$x, y=tmp$y)
  }
)

If that seems like alot of code, it really isn't. What we have there are:

  • two functions that handle the Geom aspects &
  • two functions that handle the Stat aspects.

Let's look at the Stat functions first, though you can also just read the handy vignette, too.

Adding Stats

In this particular case, we have it easy. We get to use geom_line/GeomLine as the base geom_ for the layer since all we're doing is generating more points for it to draw line segments between. We create the creative interface to our new Stat with stat_xspline add three new parameters with default values:

  • spline_shape
  • open
  • rep_ends

"Added three new parameters to what?" you ask? GeomLine/geom_line default to StatIdentity/stat_identity and if you look at the source code, that Stat just returns the data back in the form it came in. We're going to take these three new parameters and pass them to xspline and then return entirely new values back for ggplot2/grid to draw for us, so we tell it to call our new computation engine by giving it the StatXspline value to the layer. By using GeomLine/geom_line as the geom parameter, all we have to do is ensure we pass back the proper values. We do that in compute_group since ggplot2 will segment the incoming data into groups (via the group aesthetic) for us. We take each group and run them through the xspline with the parameters the user specified. If I didn't have to use the hack to work around what seems to be errant plot device issues in xspline, the call would be one line.

Adding Geoms

We pair up the Stat with a very basic Geom "shim" so we can use them interchangeably. It's the same idiom, an "object" function and the user-callable function. In this case, it's super-lightweight since we're really having geom_line do all the work for us. In a [very] future post, I'll cover more complex Geoms that require use of the underlying grid graphics system, but I suspect most of your own additions may be able to use the lightweight idiom here (and that's covered in the vignette).

Putting Our New Functions To Work

With our new additions to ggplot2, we can compare the output of geom_smooth to geom_xspline with some test data:

set.seed(1492)
dat <- data.frame(x=c(1:10, 1:10, 1:10),
                  y=c(sample(15:30, 10), 2*sample(15:30, 10), 3*sample(15:30, 10)),
                  group=factor(c(rep(1, 10), rep(2, 10), rep(3, 10)))
)

ggplot(dat, aes(x, y, group=group, color=factor(group))) +
  geom_point(color="black") +
  geom_smooth(se=FALSE, linetype="dashed", size=0.5) +
  geom_xspline(size=0.5)

README-unnamed-chunk-4-3

The github page has more examples for the function, but you don't have to be envious of the smooth D3 curves any more.

I realize this particular addition is not extremely helpful/beneficial, but the next one is. We'll look at adding a new/more accurate density Stat/Geom in the next installment and then discuss the "on-steroids" roxygen2 comments you'll end up using for your creations in part 3.

Time for another Twitter-inspired blog post this week, this time from a tweet by @JonKalodimos:

I had seen and appreciated Ann’s post on her makeover of the main graphic in [NPR’s story](http://www.npr.org/sections/money/2014/10/21/357629765/when-women-stopped-coding) and did a quick mental check of how I’d do the same in ggplot2 as I was reading it. Jon’s question was a good prompt to dump physical memory to internet memory.

Here’s the NPR graphic:

When_Women_Stopped_Coding___Planet_Money___NPR

It is actually pretty darn good on it’s own, but I also agree with Ann that direct labeling could have made it better. Here’s her makeover:

Let’s see how to do this in ggplot2. We’ll use the actual data from NPR’s story since the graphic was built with D3 and, hence, the data is part of the graphic. Let’s get the `library` stuff out of the way:

library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)
library(scales)
library(gridExtra)
library(grid)

Now, we’ll grab the CSV that the NPR folks used for the graphic and take a look at it. I found it via Developer Tools in Chrome:

# use the NPR story data file ---------------------------------------------
# and be kind to NPR's bandwidth budget
url <- "http://apps.npr.org/dailygraphics/graphics/women-cs/data.csv"
fil <- "gender.csv"
if (!file.exists(fil)) download.file(url, fil)
 
gender <- read.csv(fil, stringsAsFactors=FALSE)
 
# take a look at the CSV structure ----------------------------------------
 
glimpse(gender)
 
## Observations: 48
## Variables:
## $ date              (int) 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, ...
## $ Medical.School    (dbl) 0.09, 0.10, 0.10, 0.09, 0.09, 0.11, 0.14, 0.17, 0.20, 0.22, 0.24, 0.25, 0.25, 0.25, 0.28, 0.29, 0.31, ...
## $ Law.School        (chr) "0.04", "0.04", "0.05", "0.07", "0.07", "0.1", "0.12", "0.16", "0.2", "0.24", "0.27", "0.28", "0.3", "...
## $ Physical.Sciences (chr) "0.14", "0.14", "0.14", "0.14", "0.14", "0.15", "0.16", "0.16", "0.17", "0.19", "0.2", "0.2", "0.22", ...
## $ Computer.science  (dbl) 0.146, 0.108, 0.120, 0.130, 0.129, 0.136, 0.136, 0.149, 0.164, 0.190, 0.198, 0.239, 0.258, 0.281, 0.30...
 
tail(gender)
 
##    date Medical School Law School Physical Sciences Computer science
## 43 2008           0.48       0.47              0.41            0.177
## 44 2009           0.48       0.47              0.42            0.179
## 45 2010           0.48       0.47              0.41            0.182
## 46 2011           0.47         tk                tk            0.177
## 47 2012           0.47         tk                tk            0.182
## 48 2013           0.46         tk                              0.179

Those `tk` values are referred to in the [code that makes the NPR graphic](http://apps.npr.org/dailygraphics/graphics/women-cs/js/graphic.js) so we’ll replace them with `NA`s and make all the columns numeric:

gender <- mutate_each(gender, funs(as.numeric))

We should also clean up the column names since we’ll be using them for the legend and the direct labels:

colnames(gender) <- str_replace(colnames(gender), "\\.", " ")
 
gender_long <- mutate(gather(gender, area, value, -date),
                      area=factor(area, levels=colnames(gender)[2:5],
                                  ordered=TRUE))

That that code link also has the colors NPR used for the graphic, so let’s define those now since we bothered to look at it:

gender_colors <- c('#11605E', '#17807E', '#8BC0BF','#D8472B')
names(gender_colors) <- colnames(gender)[2:5]

We’ll be needing those names later on, hence why I named the values in the vector.

With the data, labels and colors defined, we can make a “standard” ggplot:

chart_title <- expression(atop("What Happened To Women In Computer Science?",
                               atop(italic("% Of Women Majors, By Field"))))
 
gg <- ggplot(gender_long)
gg <- gg + geom_line(aes(x=date, y=value, group=area, color=area))
gg <- gg + scale_color_manual(name="", values=gender_colors)
gg <- gg + scale_y_continuous(label=percent)
gg <- gg + labs(x=NULL, y=NULL, title=chart_title)
gg <- gg + theme_bw(base_family="Helvetica")
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.key=element_blank())
gg

Rplot01

That’s also pretty good on it’s own. It’s possible to make it look a bit more like the NPR chart, but it’s hard to format a title & subtitle in a ggplot title _and_ have it left-justified, so I opted for font style. It’s also possible to make the legend look like NPR’s but that’s not the point of the post.

So, how do we make this look more like Ann’s makeover?

First we need to get the last values for each of the variables so we know what point on the `y` axis we need to place the labels. That’s made a bit trickier with the `NA`s:

last_vals <- sapply(colnames(gender)[2:5], function(x) last(na.exclude(gender[,x])))
last_date <- tail(gender$date)+1 # doing this ^ wld have made it a double

Next, we need to turn off the legend and increase the plot margin on the right-hand side:

gg <- gg + theme(legend.position="none")
gg <- gg + theme(plot.margin = unit(c(1, 7, 2, 1), "lines"))

I figured out those #’s by interactive trial-and-error, though I initially guessed `6` for the right-hand margin increase. Also, this should demonstrate one reason for the `gg <- gg +` madness you see in my code/posts since, when you start doing more in ggplot, you end up with that idiom more oft than not. Now, we add the labels. We do it with with custom annotations that are placed "one year" after the latest `x` value and at the same `y` value as the last reading of each area. We also color the label the same as the line, which is why we needed a named vector.

for (i in 1:length(last_vals)) {
  gg <- gg + annotation_custom(grob=textGrob(names(last_vals)[i], hjust=0,
                                             gp=gpar(fontsize=8, 
                                                     col=gender_colors[names(last_vals)[i]])),
                               xmin=2014, xmax=2014,
                               ymin=last_vals[i], ymax=last_vals[i])
}

Finally, we have to do some of the remaining work by hand since we have to turn off panel clipping and the only way I know how to do that is at the grob/gtable level, but it’s not that scary or complex of a task. Also, since we are manipulating the built ggplot object, we have to use `grid.draw` to present our chart:

gb <- ggplot_build(gg)
gt <- ggplot_gtable(gb)
 
gt$layout$clip[gt$layout$name=="panel"] <- "off"
 
grid.draw(gt)

Here’s the result:

Rplot02

I’ve deliberately left the fonts a bit small and not-changed their positions on the `y`-axis to give readers a bit of homework. They both _should_ be changed and the plot margins could also be tweaked a tad. You can find the complete code [on github](https://gist.github.com/hrbrmstr/83deb0baeabae0824389) so tweak away!

If you have another way to accomplish the same task or want to show off your tweaked version, drop a note in the comments or at that gist link.