Skip navigation

Tag Archives: post

>UPDATE: Since I put in a “pull request” requirement, I intended to put in a link to getting started with GitHub. Dr. Jenny Bryan’s @stat545 has a great [section on git](https://stat545-ubc.github.io/git00_index.html) that should hopefully make it a bit less painful.

### Why 52Vis?

In case folks are wondering why I’m doing this, it’s pretty simple. We need a society that has high data literacy and we need folks who are capable of making awesome, truthful data visualizations. The only way to do that is by working with data over, and over, and over, and over again.

Directed projects with some reward are one of the best Pavlovian ways to accomplish that :-)

### This week’s challenge

The Data is Plural folks have [done it again](http://tinyletter.com/data-is-plural/letters/data-is-plural-2016-04-06-edition) and there’s a neat and important data set in this week’s vis challenge.

From their newsletter:

>_Every January, at the behest of the U.S. Department of Housing and Urban Development, volunteers across the country attempt to count the homeless in their communities. The result: HUD’s “point in time” estimates, which are currently available for 2007–2015. The most recent estimates found 564,708 homeless people nationwide, with 75,323 of that count (more than 13%) living in New York City._

I decided to take a look at this data by seeing which states had the worst homeless problem per-capita (i.e. per 100K population). I’ve included the population data along with some ready-made wrangling of the HUD data.

But, before we do that…

### RULES UPDATE + Last week’s winner

I’ll be announcing the winner on Thursday since I:

– am horribly sick after being exposed to who knows what after rOpenSci last week in SFO :-)
– have been traveling like mad this week
– need to wrangle all the answers into the github repo and get @laneharrison (and his students) to validate my choice for winner (I have picked a winner)

Given how hard the wrangling has been, I’m going to need to request that folks both leave a blog comment and file a PR to [the github repo](https://github.com/52vis/2016-14) for this week. Please include the code you used as well as the vis (or a link to a working interactive vis)

### PRIZES UPDATE

Not only can I offer [Data-Driven Security](http://dds.ec/amzn), but Hadley Wickham has offered signed copies of his books as well, and I’ll keep in the Amazon gift card as a catch-all if you have more (NOTE: if any other authors want to offer up their tomes shoot me a note!).

### No place to roam

Be warned: this was a pretty depressing data set. I went in with the question of wanting to know which “states” had the worst problem and I assumed it’d be California or New York. I had no idea it would be what it was and the exercise shattered some assumptions.

NOTE: I’ve included U.S. population data for the necessary time period.

library(readxl)
library(purrr)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(scales)
library(grid)
library(hrbrmisc)
 
# grab the HUD homeless data
 
URL <- "https://www.hudexchange.info/resources/documents/2007-2015-PIT-Counts-by-CoC.xlsx"
fil <- basename(URL)
if (!file.exists(fil)) download.file(URL, fil, mode="wb")
 
# turn the excel tabs into a long data.frame
yrs <- 2015:2007
names(yrs) <- 1:9
homeless <- map_df(names(yrs), function(i) {
  df <- suppressWarnings(read_excel(fil, as.numeric(i)))
  df[,3:ncol(df)] <- suppressWarnings(lapply(df[,3:ncol(df)], as.numeric))
  new_names <- tolower(make.names(colnames(df)))
  new_names <- str_replace_all(new_names, "\\.+", "_")
  df <- setNames(df, str_replace_all(new_names, "_[[:digit:]]+$", ""))
  bind_cols(df, data_frame(year=rep(yrs[i], nrow(df))))
})
 
# clean it up a bit
homeless <- mutate(homeless,
                   state=str_match(coc_number, "^([[:alpha:]]{2})")[,2],
                   coc_name=str_replace(coc_name, " CoC$", ""))
homeless <- select(homeless, year, state, everything())
homeless <- filter(homeless, !is.na(state))
 
# read in the us population data
uspop <- read.csv("uspop.csv", stringsAsFactors=FALSE)
uspop_long <- gather(uspop, year, population, -name, -iso_3166_2)
uspop_long$year <- sub("X", "", uspop_long$year)
 
# normalize the values
states <- count(homeless, year, state, wt=total_homeless)
states <- left_join(states, albersusa::usa_composite()@data[,3:4], by=c("state"="iso_3166_2"))
states <- ungroup(filter(states, !is.na(name)))
states$year <- as.character(states$year)
states <- mutate(left_join(states, uspop_long), homeless_per_100k=(n/population)*100000)
 
# we want to order from worst to best
group_by(states, name) %>%
  summarise(mean=mean(homeless_per_100k, na.rm=TRUE)) %>%
  arrange(desc(mean)) -> ordr
 
states$year <- factor(states$year, levels=as.character(2006:2016))
states$name <- factor(states$name, levels=ordr$name)
 
# plot
#+ fig.retina=2, fig.width=10, fig.height=15
gg <- ggplot(states, aes(x=year, y=homeless_per_100k))
gg <- gg + geom_segment(aes(xend=year, yend=0), size=0.33)
gg <- gg + geom_point(size=0.5)
gg <- gg + scale_x_discrete(expand=c(0,0),
                            breaks=seq(2007, 2015, length.out=5),
                            labels=c("2007", "", "2011", "", "2015"),
                            drop=FALSE)
gg <- gg + scale_y_continuous(expand=c(0,0), labels=comma, limits=c(0,1400))
gg <- gg + labs(x=NULL, y=NULL,
                title="US Department of Housing & Urban Development (HUD) Total (Estimated) Homeless Population",
                subtitle="Counts aggregated from HUD Communities of Care Regional Surveys (normalized per 100K population)",
                caption="Data from: https://www.hudexchange.info/resource/4832/2015-ahar-part-1-pit-estimates-of-homelessness/")
gg <- gg + facet_wrap(~name, scales="free", ncol=6)
gg <- gg + theme_hrbrmstr_an(grid="Y", axis="", strip_text_size=9)
gg <- gg + theme(axis.text.x=element_text(size=8))
gg <- gg + theme(axis.text.y=element_text(size=7))
gg <- gg + theme(panel.margin=unit(c(10, 10), "pt"))
gg <- gg + theme(panel.background=element_rect(color="#97cbdc44", fill="#97cbdc44"))
gg <- gg + theme(plot.margin=margin(10, 20, 10, 15))
gg

percapita

I used one of HUD’s alternate, official color palette colors for the panel backgrounds.

Remember, this is language/tool-agnostic & go in with a good question or two, augment as you feel you need to and show us your vis!

Week 2’s content closes 2016-04-12 23:59 EDT

Contest GitHub Repo:

The [`iptools` package](https://github.com/hrbrmstr/iptools)—a toolkit for manipulating, validating and testing IP addresses and ranges, along with datasets relating to IP addresses—is flying through the internets and hitting a CRAN mirror near you, soon.

### What’s fixed?

[Tim Smith](https://github.com/tdsmith) fixed [a bug](https://github.com/hrbrmstr/iptools/issues/26) in `ip_in_range()` that occurred when the netmask was `/32` (thanks, Tim!).

### What’s new?

The `range_boundaries()` function now returns the three new fields that are pretty obvious once you see it in action:

range_boundaries("172.18.0.0/28")
##   minimum_ip  maximum_ip min_numeric max_numeric         range
## 1 172.18.0.0 172.18.0.15  2886860800  2886860815 172.18.0.0/28

They are tacked on the end, so if you were using positional or named columns previously, you’re still good to go.

We’ve added a new `country_ranges()` function to return all “assigned” CIDR blocks in a country. You just give it character vector of one or more ISO 3166-1 alpha-2 codes and you get back the CIDRs:

country_ranges("TO")
## $TO
## [1] "43.255.148.0/22"  "103.239.160.0/22" "103.242.126.0/23" "103.245.160.0/22" "175.176.144.0/21" "202.43.8.0/21"   
## [7] "202.134.24.0/21"

This data is updated daily and there’s some session caching built-into the function to speed up subsequent calls if you forgot to save the output. You can flush the session cache with `flush_country_cidrs()` and query it with `cached_country_cidrs()`.

### What’s next?

We’re waiting until the R 3.3.0 Windows toolchain is stable to add in MaxMind ASN lookups. If there are any IP-related functions you need added, drop us an [issue](https://github.com/hrbrmstr/iptools/issues). We’re at nearly 1,700 downloads from the RStudio mirror, which (IMO) is kinda cool for such a niche package. Many thanks to all our users and one more thank you to Dirk for the `AsioHeaders` package.

### Fin

If you want some “bad” IP addresses to play around with in `iptools`, check out the [`blocklist`](https://github.com/hrbrmstr/blocklist) package, which provides an interface to a subset of the [blocklist.de](http://www.blocklist.de/en/index.html) API.

>UPDATE: Deadline is now 2016-04-05 23:59 EDT; next vis challenge is 2016-04-06!

Per a suggestion, I’m going to try to find a neat data set (prbly one from @jsvine) to feature each week and toss up some sample code (99% of the time prbly in R) and offer up a vis challenge. Just reply in the comments with a link to a gist/repo/rpub/blog/etc (or post directly, though inserting code requires some markup that you can ping me abt) post containing the code & vis with a brief explanation. I’ll gather up everything into a new github organization I made for this context. You can also submit a PR right to [this week’s repo](https://github.com/52vis/2016-13).

Winners get a free digital copy of [Data-Driven Security](http://amzn.to/ddsec), and if you win more than once I’ll come up with other stuff to give away (either an Amazon gift card, a book or something Captain America related).

Submissions should include a story/angle/question you were trying to answer, any notes or “gotchas” that the code/comments doesn’t explain and a [beautiful] vis. You can use whatever language or tool (even Excel or _ugh_ Tableau), but you’ll have to describe what you did step-by-step for the GUI tools or record a video, since the main point about this contest is to help folks learn about asking questions, munging data and making visualizations. Excel & Tableau lock that knowledge in and Tableau even locks that data in.

### Droning on and on

Today’s data source comes from this week’s Data Is Plural newsletter and is all about drones. @jsvine linked to the [main FAA site](http://www.faa.gov/uas/law_enforcement/uas_sighting_reports/) for drone sightings and there’s enough ways to slice the data that it should make for some interesting story angles.

I will remove one of those angles with a simple bar chart of unmanned aircraft (UAS) sightings by week, using an FAA site color for the bars. I wanted to see if there were any overt visual patterns in the time of year or if the registration requirement at the end of 2015 caused any changes (I didn’t crunch the numbers to see if there were any actual patterns that could be found statistically, but that’s something y’all can do). I’m not curious as to what caused the “spike” in August/September 2015 and the report text may have that data.

I’ve put this week’s example code & data into the [52 vis repo](https://github.com/52vis/2016-13) for this week.

library(ggplot2)
library(ggalt)
library(ggthemes)
library(readxl)
library(dplyr)
library(hrbrmisc)
library(grid)
 
# get copies of the data locally
 
URL1 <- "http://www.faa.gov/uas/media/UAS_Sightings_report_21Aug-31Jan.xlsx"
URL2 <- "http://www.faa.gov/uas/media/UASEventsNov2014-Aug2015.xls"
 
fil1 <- basename(URL1)
fil2 <- basename(URL2)
 
if (!file.exists(fil1)) download.file(URL1, fil1)
if (!file.exists(fil2)) download.file(URL2, fil2)
 
# read it in
 
xl1 <- read_excel(fil1)
xl2 <- read_excel(fil2)
 
# munge it a bit so we can play with it by various calendrical options
 
drones <- setNames(bind_rows(xl2[,1:3],
                             xl1[,c(1,3,4)]), 
                   c("ts", "city", "state"))
drones <- mutate(drones, 
                 year=format(ts, "%Y"), 
                 year_mon=format(ts, "%Y%m"), 
                 ymd=as.Date(ts), 
                 yw=format(ts, "%Y%V"))
 
# let's see them by week
by_week <- mutate(count(drones, yw), wk=as.Date(sprintf("%s1", yw), "%Y%U%u")-7)
 
# this looks like bad data but I didn't investigate it too much
by_week <- arrange(filter(by_week, wk>=as.Date("2014-11-10")), wk)
 
# plot
 
gg <- ggplot(by_week, aes(wk, n))
gg <- gg + geom_bar(stat="identity", fill="#937206")
gg <- gg + annotate("text", by_week$wk[1], 49, label="# reports", 
                    hjust=0, vjust=1, family="Cabin-Italic", size=3)
gg <- gg + scale_x_date(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0))
gg <- gg + labs(y=NULL,
                title="Weekly U.S. UAS (drone) sightings",
                subtitle="As reported to the Federal Aviation Administration",
                caption="Data from: http://www.faa.gov/uas/law_enforcement/uas_sighting_reports/")
gg <- gg + theme_hrbrmstr(grid="Y", axis="X")
gg <- gg + theme(axis.title.x=element_text(margin=margin(t=-6)))
gg

RStudioScreenSnapz024

### Fin

I’ll still keep up a weekly vis from the Data Is Plural weekly collection even if this whole contest thing doesn’t take root with folks. You can never have too many examples for budding data folks to review.

Folks who’ve been tracking this blog on R-bloggers probably remember [this post](https://rud.is/b/2014/11/16/moving-the-earth-well-alaska-hawaii-with-r/) where I showed how to create a composite U.S. map with an Albers projection (which is commonly referred to as AlbersUSA these days thanks to D3).

I’m not sure why I didn’t think of this earlier, but you don’t _need_ to do those geographical machinations every time you want a prettier & more inclusive map (Alaska & Hawaii have been states for a while, so perhaps we should make more of an effort to include them in both data sets and maps). After doing the map transformations, the composite shape can be saved out to a shapefile, preferably GeoJSON since (a) you can use `geojsonio::geojson_write()` to save it and (b) it’s a single file vs a ZIP/directory.

I did just that and saved both state and country maps out with FIPS codes and other useful data slot bits and created a small data package : [`albersusa`](https://github.com/hrbrmstr/albersusa) : with some helper functions. It’s not in CRAN yet so you need to `devtools::install_github(“hrbrmstr/albersusa”)` to use it. The github repo has some basic examples, heres a slightly more complex one.

### Mapping Obesity

I grabbed an [obesity data set](http://www.cdc.gov/diabetes/data/county.html) from the CDC and put together a compact example for how to make a composite U.S. county choropleth to show obesity rates per county (for 2012, which is the most recent data). I read in the Excel file, pull out the county FIPS code and 2012 obesity rate, then build the choropleth. It’s not a whole lot of code, but that’s one main reason for the package!

library(readxl)
library(rgeos)
library(maptools)
library(ggplot2)   # devtools::install_github("hadley/ggplot2") only if you want subtitles/captions
library(ggalt)
library(ggthemes)
library(albersusa) # devtools::install_github("hrbrmstr/albersusa")
library(viridis)
library(scales)
 
# get the data and be nice to the server and keep a copy of the data for offline use
 
URL <- "http://www.cdc.gov/diabetes/atlas/countydata/OBPREV/OB_PREV_ALL_STATES.xlsx"
fil <- basename(URL)
if (!file.exists(fil)) download.file(URL, fil)
 
# it's not a horrible Excel file, but we do need to hunt for the data
# and clean it up a bit. we just need FIPS & 2012 percent info
 
wrkbk <- read_excel(fil)
obesity_2012 <- setNames(wrkbk[-1, c(2, 61)], c("fips", "pct"))
obesity_2012$pct <- as.numeric(obesity_2012$pct) / 100
 
# I may make a version of this that returns a fortified data.frame but
# for now, we just need to read the built-in saved shapefile and turn it
# into something ggplot2 can handle
 
cmap <- fortify(counties_composite(), region="fips")
 
# and this is all it takes to make the map below
 
gg <- ggplot()
gg <- gg + geom_map(data=cmap, map=cmap,
                    aes(x=long, y=lat, map_id=id),
                    color="#2b2b2b", size=0.05, fill=NA)
gg <- gg + geom_map(data=obesity_2012, map=cmap,
                    aes(fill=pct, map_id=fips),
                    color="#2b2b2b", size=0.05)
gg <- gg + scale_fill_viridis(name="Obesity", labels=percent)
gg <- gg + coord_proj(us_laea_proj)
gg <- gg + labs(title="U.S. Obesity Rate by County (2012)",
                subtitle="Content source: Centers for Disease Control and Prevention",
           caption="Data from http://www.cdc.gov/diabetes/atlas/countydata/County_ListofIndicators.html")
gg <- gg + theme_map(base_family="Arial Narrow")
gg <- gg + theme(legend.position=c(0.8, 0.25))
gg <- gg + theme(plot.title=element_text(face="bold", size=14, margin=margin(b=6)))
gg <- gg + theme(plot.subtitle=element_text(size=10, margin=margin(b=-14)))
gg

Fullscreen_3_29_16__9_06_AM

### Fin

Note that some cartographers think of this particular map view the way I look at a pie chart, but it’s a compact & convenient way to keep the states/counties together and will make it easier to include Alaska & Hawaii in your cartographic visualizations.

The composite GeoJSON files are in:

– `system.file(“extdata/composite_us_states.geojson.gz”, package=”albersusa”)`
– `system.file(“extdata/composite_us_counties.geojson.gz”, package=”albersusa”)`

if you want to use them in another program/context.

Drop an issue [on github](https://github.com/hrbrmstr/albersusa) if you want any more default fields in the data slot and if you “need” territories (I’d rather have a PR for the latter tho :-).

As I said, I’m kinda obsessed with the “nuclear” data set. So much so that I made a D3 version that’s similar to the R version I made the other day. I tried not to code much today (too much Easter fun going on), so I left off the size & color legends, but it drops the bombs and fills the radiation meters bars as each year progresses.

Working in raw D3 reminds me how nice it is to work in R. Much cruft is abstracted away and the “API” in R (at least list ops and ggplot2) seem more consistent or at least less verbose.

One big annoyance (besides tweaking projection settings) is that I had to set:

script-src 'self' 'unsafe-eval';

on the [content security policy](http://content-security-policy.com/) since D3 uses “eval” a bit.

It’s embedded below and you can bust the frame via [this link](/projects/nucleard3/index.html). I made no effort to have it fit on tiny screens (it just requires fidgeting with the heights/widths of things), but it’s pretty complete code (including browser/touch icons, content security policy & open graph tags) and it’s annotated to help R folks map between ggplot2 and D3. If you build anything on it, drop a note in the comments.



> UPDATE: For folks interested in what this might look like in D3, [take a gander](https://rud.is/b/2016/03/27/nuclear-animations-in-d3/)

@jsvine (Data Editor at BuzzFeed) cleaned up and posted a [data sets of historical nuclear explosions](https://github.com/data-is-plural/nuclear-explosions) earlier this week. I used it to show a few plotting examples in class, but it’s also a really fun data set to play around with: categorial countries; time series; lat/long pairs; and, of course, nuclear explosions!

My previous machinations with the data set were all static, so I though it’d be fun to make a WarGames-esque world map animation to show who boomed where and how many booms each ended up making. The code is below (after the video). It’s commented and there are some ggplot2 tricks in there that those new to R might be interested in.

As I say in the code, I ended up tweaking `ffmpeg` parameters to get this video working on Twitter (the video file ended up being much, much smaller than the animated gif I originally toyed with making). Using `convert` from ImageMagick will quickly make a nice, local animated gif, or you can explore how to incorporate the ggplot2 frames into the `animation` package.

library(purrr)
library(dplyr)
library(tidyr)
library(sp)
library(maptools)
library(maps)
library(grid)
library(scales)
library(ggplot2)   # devtools::install_github("hadley/ggplot2")
library(ggthemes)
library(gridExtra)
library(ggalt)

# read and munge the data, being kind to github's servers
URL <- "https://raw.githubusercontent.com/data-is-plural/nuclear-explosions/master/data/sipri-report-explosions.csv"
fil <- basename(URL)
if (!file.exists(fil)) download.file(URL, fil)

read.csv(fil, stringsAsFactors=FALSE) %>%
  tbl_df() %>%
  mutate(date=as.Date(as.character(date_long), format="%Y%m%d"),
         year=as.character(year),
         yr=as.Date(sprintf("%s-01-01", year)),
         country=sub("^PAKIST$", "PAKISTAN", country)) -> dat

# doing this so we can order things by most irresponsible country to least
count(dat, country) %>%
  arrange(desc(n)) %>%
  mutate(country=factor(country, levels=unique(country))) -> booms

# Intercourse Antarctica
world_map <- filter(map_data("world"), region!="Antarctica")

scary <- "#2b2b2b" # a.k.a "slate black"
light <- "#bfbfbf" # a.k.a "off white"

proj <- "+proj=kav7" # Winkel-Tripel is *so* 2015

# In the original code I was using to play around with various themeing
# and display options this encapsulated theme_ function really helped alot
# but to do the grid.arrange with the bars it only ended up saving a teensy
# bit of typing.
#
# Also, Tungsten is ridiculously expensive but I have access via corporate
# subscriptions, so I'd suggest going with Arial Narrow vs draining your
# bank account since I really like the detailed kerning pairs but also think
# that it's just a tad too narrow. It seemed fitting for this vis, tho.

theme_scary_world_map <- function(scary="#2b2b2b", light="#8f8f8f") {
  theme_map() +
    theme(text=element_text(family="Tungsten-Light"),
          title=element_text(family="Tungsten-Semibold"),
          plot.background=element_rect(fill=scary, color=scary),
          panel.background=element_rect(fill=scary, color=scary),
          legend.background=element_rect(fill=scary),
          legend.key=element_rect(fill=scary, color=scary),
          legend.text=element_text(color=light, size=10),
          legend.title=element_text(color=light),
          axis.title=element_text(color=light),
          axis.title.x=element_text(color=light, family="Tungsten-Book", size=14),
          axis.title.y=element_blank(),
          plot.title=element_text(color=light, face="bold", size=16),
          plot.subtitle=element_text(color=light, family="Tungsten-Light",
                                     size=13, margin=margin(b=14)),
          plot.caption=element_text(color=light, family="Tungsten-ExtraLight",
                                    size=9, margin=margin(t=10)),
          plot.margin=margin(0, 0, 0, 0),
          legend.position="bottom")
}

# I wanted to see booms by unique coords
count(dat, year, country, latitude, longitude) %>%
  ungroup() %>%
  mutate(country=factor(country, levels=unique(booms$country))) -> dat_agg

years <- as.character(seq(1945, 1998, 1))

# place to hold the pngs
dir.create("booms", FALSE)

# I ended up lovingly hand-crafting ffmpeg parameters to get the animation to
# work with Twitter's posting guidelines. A plain 'old ImageMagick "convert"
# from multiple png's to animated gif will work fine for a local viewing

til <- length(years)
pb <- progress_estimated(til)
suppressWarnings(walk(1:til, function(i) {

  pb$tick()$print()
  
  # data for map
  tmp_dat <- filter(dat_agg, year<=years[i])

  # data for bars
  count(tmp_dat, country, wt=n) %>%
    arrange(desc(nn)) %>%
    mutate(country=factor(country, levels=unique(country))) %>%
    complete(country, fill=list(nn=0)) -> boom2 # this gets us all the countries on the barplot x-axis even if the had no booms yet

  gg <- ggplot()
  gg <- gg + geom_map(data=world_map, map=world_map,
                      aes(x=long, y=lat, map_id=region),
                      color=light, size=0.1, fill=scary)
  gg <- gg + geom_point(data=tmp_dat,
                        aes(x=longitude, y=latitude, size=n, color=country),
                        shape=21, stroke=0.3)

  # the "trick" here is to force the # of labeled breaks so ggplot2 doesn't
  # truncate the range on us (it's nice that way and that feature is usually helpful)
  gg <- gg + scale_radius(name="", range=c(2, 8), limits=c(1, 50),
                          breaks=c(5, 10, 25, 50),
                          labels=c("1-4", "5-9", "10-24", "25-50"))

  gg <- gg + scale_color_brewer(name="", palette="Set1", drop=FALSE)
  gg <- gg + coord_proj(proj)
  gg <- gg + labs(x=years[i], y=NULL, title="Nuclear Explosions, 1945–1998",
                  subtitle="Stockholm International Peace Research Institute (SIPRI) and Sweden's Defence Research Establishment",
                  caption=NULL)

  # order doesn't actually work but it will after I get a PR into ggplot2
  # the tweaks here let us make the legends look like we want vs just mapped
  # to the aesthetics
  gg <- gg + guides(size=guide_legend(override.aes=list(color=light, stroke=0.5)),
                    color=guide_legend(override.aes=list(alpha=1, shape=16, size=3), nrow=1))

  gg <- gg + theme_scary_world_map(scary, light)
  gg <- gg + theme(plot.margin=margin(t=6, b=-1.5, l=4, r=4))
  gg_map <- gg
  
  gg

  gg <- ggplot(boom2, aes(x=country, y=nn))
  gg <- gg + geom_bar(stat="identity", aes(fill=country), width=0.5, color=light, size=0.05)
  gg <- gg + scale_x_discrete(expand=c(0,0))
  gg <- gg + scale_y_continuous(expand=c(0,0), limits=c(0, 1100))
  gg <- gg + scale_fill_brewer(name="", palette="Set1", drop=FALSE)
  gg <- gg + labs(x=NULL, y=NULL, title=NULL, subtitle=NULL,
                  caption="Data from https://github.com/data-is-plural/nuclear-explosions")
  gg <- gg + theme_scary_world_map(scary, light)
  gg <- gg + theme(axis.text=element_text(color=light))
  gg <- gg + theme(axis.text.x=element_text(color=light, size=11, margin=margin(t=2)))
  gg <- gg + theme(axis.text.y=element_text(color=light, size=6, margin=margin(r=5)))
  gg <- gg + theme(axis.title.x=element_blank())
  gg <- gg + theme(plot.margin=margin(l=20, r=20, t=-1.5, b=5))
  gg <- gg + theme(panel.grid=element_line(color=light, size=0.15))
  gg <- gg + theme(panel.margin=margin(0, 0, 0, 0))
  gg <- gg + theme(panel.grid.major.x=element_blank())
  gg <- gg + theme(panel.grid.major.y=element_line(color=light, size=0.05))
  gg <- gg + theme(panel.grid.minor=element_blank())
  gg <- gg + theme(panel.grid.minor.x=element_blank())
  gg <- gg + theme(panel.grid.minor.y=element_blank())
  gg <- gg + theme(axis.line=element_line(color=light, size=0.1))
  gg <- gg + theme(axis.line.x=element_line(color=light, size=0.1))
  gg <- gg + theme(axis.line.y=element_blank())
  gg <- gg + theme(legend.position="none")
  gg_bars <- gg

  # dimensions arrived at via trial and error

  png(sprintf("./booms/frame_%03d.png", i), width=639.5*2, height=544*2, res=144, bg=scary)
  grid.arrange(gg_map, gg_bars, ncol=1, heights=c(0.85, 0.15), padding=unit(0, "null"), clip="on")
  dev.off()

}))

UPDATE: The dev version of ggalt has a new geom_stateface() which encapsulates the tedious bits and also included the TTF font.

I’m a huge fan of ProPublica. They have a super-savvy tech team, great reporters, awesome graphics folks and excel at data-driven journalism. Plus, they give away virtually everything, including data, text, graphics & tools.

I was reading @USATODAY’s piece on lead levels in drinking water across America and saw they had a mini-interactive piece included with their findings. A quick view in Developer Tools revealed the JSON and an equally quick use of curlconverter had the data behind the interactive loaded into R in seconds (though it turns out just the straight URL to the JSON file works without the extra cURL parameters).

I wanted to grab the data since, while I feel bad about Texas having 183 testing exceedances, I wanted to see if there’s a worse story if you normalize by population. I performed such a normalization by state (normalizing to per-100k population), but the data includes county information so a logical & necessary next step is to do that calculation as well (which a task left for another day since I think it’ll require some more munging as they used county names vs FIPS codes in the JSON and I suspect the names aren’t prefect). If any reader jumps in to do the county analysis before I do, drop a note here and I’ll update the post with a link to it. Here’s the outcome (top 10 worst states, normalized by population):

Plot_Zoom

Even at the county-level, the normalization isn’t perfect since these are testing levels at specific locations per water provider in a county. But, I made an assumption (it may be wrong, I don’t do macro-level water quality analysis for a living) that 94 exceedances in a small (both in size & population) state like mine (Maine) is probably worse than 183 exceedances across a population the size of Texas. I think I’d need to take population density per county area into account to fully address this, but this is a first stab at the data and this post was written to talk about how to use ProPublica’s stateface font in R vs do investigative data journalism :-)

What’s a ‘stateface’?

The fine folks at ProPublica took the Natural Earth shapefiles and made an icon font out of them. They have full code (with some super spiffy, clean, readable C code you can reuse) in their github repo, but they also provide just the font. However, it’s in OTF format and R kinda needs it in TTF format for widest graphics device support, so I converted it for you (& don’t get me started about fonts in R). You’ll need that font loaded to run the example. Note that ProPublica is still the license owner of that converted font (both the OTF & TTF are free to use, but give credit where it’s due…they did all the hard work). Aannnd it looks like they already had it created (I had only looked at the zip file).

ProPublica provides many support files to use these on the web, since that’s their target environment. They do provide key mappings for the font, which makes it usable in virtually any context fonts are supported. We’ll take advantage of those mappings in a bit, but you should check out their link above, they have some neat examples using these shapes as “sparkmaps” (i.e. inline in a sentence).

Using ‘stateface’ in R

After loading the font into your system, it’s super-easy to use it in R. The font family name is “StateFace-Regular” and this is the translation table from 2-digit state abbreviation to glyph character:

state_trans <- c(AL='B', AK='A', AZ='D', AR='C', CA='E', CO='F', CT='G', 
                 DE='H', DC='y', FL='I', GA='J', HI='K', ID='M', IL='N', 
                 IN='O', IA='L', KS='P', KY='Q', LA='R', ME='U', MD='T',
                 MA='S', MI='V', MN='W', MS='Y', MO='X', MT='Z', NE='c',
                 NV='g', NH='d', NJ='e', NM='f', NY='h', NC='a', ND='b', 
                 OH='i', OK='j', OR='k', PA='l', RI='m', SC='n', SD='o',
                 TN='p', TX='q', UT='r', VT='t', VA='s', WA='u', WV='w', 
                 WI='v', WY='x', US='z')

You now only need to do:

state_trans["ME"]
##  ME 
## "U"

to get the mappings.

This is the (annotated) code to generate the bar chart above:

library(dplyr)    # munging
library(ggplot2)  # plotting, req: devtools::intstall_github("hadley/ggplot2")
library(scales)   # plotting helpers
library(hrbrmisc) # my themes

# read in exceedance state/county data
dat <- read.csv("http://rud.is/dl/h2olead.csv", stringsAsFactors=FALSE)

# this is how USA TODAY's computation works, I'm just following it
xcd <- count(distinct(dat, state, county, name), state, wt=exceedances)

# get U.S. state population estimates for 2015
us_pop <- setNames(read.csv("http://www.census.gov/popest/data/state/totals/2015/tables/NST-EST2015-01.csv",
                            skip=9, nrows=51, stringsAsFactors=FALSE, header=FALSE)[,c(1,9)],
                   c("state_name", "pop"))
us_pop$state_name <- sub("^\\.", "", us_pop$state_name)
us_pop$pop <- as.numeric(gsub(",", "", us_pop$pop))

# join them to the exceedance data
state_tbl <- data_frame(state_name=state.name, state=tolower(state.abb))
us_pop <- left_join(us_pop, state_tbl)
xcd <- left_join(xcd, us_pop)

# compute the exceedance by 100k population & order the
# states by that so we get the right bar order in ggplot
xcd$per100k <- (xcd$n / xcd$pop) * 100000
xcd$state_name <- factor(xcd$state_name,
                         levels=arrange(xcd, per100k)$state_name)
xcd <- arrange(xcd, desc(state_name))

# get the top 10 worse exceedances
top_10 <- head(xcd, 10)

# map (heh) the stateface font glpyh character to the state 2-letter code
top_10$st <- state_trans[toupper(top_10$state)]

gg <- ggplot(top_10, aes(x=state_name, y=per100k))
gg <- gg + geom_bar(stat="identity", width=0.75)

# here's what you need to do to place the stateface glyphs
gg <- gg + geom_text(aes(x=state_name, y=0.25, label=st),
                     family="StateFace-Regular", color="white",
                     size=5, hjust=0)

gg <- gg + geom_text(aes(x=state_name, y=per100k,
                         label=sprintf("%s total  ", comma(n))),
                     hjust=1, color="white", family="KerkisSans", size=3.5)
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0))
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL,
                title="Lead in the water: A nationwide look; Top 10 impacted states",
                subtitle="Exceedance count adjusted per 100K population; total exceedance displayed",
                caption="Data from USA TODAY's compliation of EPA’s Safe Drinking Water Information System (SDWIS) database.")

# you'll need the Kerkis font loaded to use this theme
# http://myria.math.aegean.gr/kerkis/
gg <- gg + theme_hrbrmstr_kerkis(grid=FALSE)

# I neee to fiddle with the theme settings so these line height tweaks
# aren't necessary in the future
gg <- gg + theme(plot.caption=element_text(lineheight=0.7))
gg <- gg + theme(plot.title=element_text(lineheight=0.7))

gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(panel.margin=margin(t=5, b=5, l=20, r=20, "pt"))
gg

Why 'stateface'?

People love maps and the bars seemed, well, lonely :-) Seriously, though, this (IMO) provides a nice trade off between full-on choropleths (which are usually not warranted) and bland, basic bars. The adornments here may help readers engage with the chart a bit more then they otherwise would.

Fin

As I stated, this analysis probably needs to be at the county population level to have the most efficacy, and you'd need to come up with a weighted roll-up mechanism to rank the states properly. I still think this naive normalization is better than using the raw exceednace counts, but I'd love to have comments from folks who do this macro water testing analysis for a living!

Full code is in this gist.

And, woo hoo!, Maine is #1 in blueberries, lobsters & insane governors and #3 in dangerous lead levels in public water systems.

So, I (unapologetically) did this to @Highcharts last week:

They did an awesome makeover (it’s interactive if you follow the link):

chart

And, I’m not kidding, it’s actually a really good treemap. Not too many hierarchies or discrete categories. But, it’s still hard for humans to compare things without the aid of the interaction (which is totally fair, the Highcharts folks do interaction well). I always try to find an alternative to treemaps, usually through trying to figure out the story to tell. I think there’s at least one story in the Highcharts data that we can uncover with a different visualization. Ironically, the visualization I’ve chosen is a stacked bar chart (I don’t generally like them, either). I’ll frame the story and then dissect the code.

RStudioScreenSnapz021

We looked at the number of frameworks being used with Highcharts across web-oriented programming languages. Surprisingly, four of the six top languages—Java, PHP, Python & dotNet—show Highcharts being used without an associated framework, which highlights the flexible nature of Highcharts. There seems to be—unsurprisingly—only one player in town when it comes to Ruby: Ruby on Rails, and the high prevalence of AngularJS tracks with Angular’s apparent dominance in javascript land. INSERT_MARKETING_LANGAUGE_HERE

In real life, I’d add a DataTables interactive table with this to let folks explore a bit more.

Making this in R & ggplot2

Highcharts used a Google Sheet to hold the data for their treemap makeover. That means we can have some fun with it in R. So, the two main story points are:

  1. show how the languages, and in-language frameworks rank against each other
  2. show the dominant framework in each language

As demonstrated, I’ve chosen to use stacked bar charts since there only six languages and it turns out there is a dominant category for each.

A design criteria I made was to use the main or alternate color for each language and use a gradient to segment each in-language framework. I chose the yellow alternate color for Python since it’s such cowardly language there was enough blue in the chart already. Java & Ruby are separated enough that their slightly different reds aren’t too bad/confusing (and neither language left me with much of an alternative). I picked a green from the Mozilla palette for JavaScript since they seem to dominate any Google search for JavaScript info.

Let’s get libraries out of the way. I’m using my personal theme since I really don’t feel like typing everything out. If you need me to, drop a note and I’ll see what I can do.

library(googlesheets) # get the data
library(dplyr)        # reshape the data
library(ggplot2)      # plot
library(hrbrmisc)     # theme
library(scales)       # plot helpers

First, we need the data, and that’s where @jennybryan’s excellent googlesheets package comes into play:

sheet <- gs_key("1wYm5waQmiYKGhtdofvXDS8SHdh72Mwcnygvf3bvFfoU")

langs <- gs_read(sheet)
langs <- langs[-(1:6), 2:4]

We need to be able to order the programming languages by # of frameworks and we need the colors defined:

tops <- count(langs, parent, wt=value)

parent_cols <- c(Java="#960000", PHP="#8892bf", Python="#ffdc51", 
                 JavaScript="#70ab2d", dotNet="#68217a", Ruby="#af1401")

To get bars and stacked segments sorted the right way, we need to add a helper column and arrange the overall data frame:


langs <- arrange(ungroup(mutate(group_by(langs, parent), rank=rank(value))), -rank)

Next, we need to assign colors per language and in-language framework, I do this by computing an ordered alpha value for each framework dependent on the number of frameworks in the language:

langs <- mutate(group_by(langs, parent), 
                color=alpha(parent_cols[parent[1]], seq(1, 0.3, length.out=n())))>

Finally we need the actual languages in factor order for ggplot:


langs$parent <- factor(langs$parent, levels=arrange(tops, n)$parent)

We also need the dominant frameworks separated out so we can annotate them. Extra marks for ensuring they're readable (black vs white depending on the base color):

top_f <- slice(group_by(langs, parent), 1)
top_f$color <- c("white", "white", "#2b2b2b", "#2b2b2b", "white", "white")

With the data in the right format, the actual ggplot code isn't too cumbersome:

gg <- ggplot()

# stack the bars. the bars themselvs will be ordered by the language factor and our
# computed rank will stack them in the right order. we'll use an identify fill for
# the mapped fill aesthetic

gg <- gg + geom_bar(data=langs, stat="identity", 
                    aes(x=parent, y=value, fill=color, order=rank),
                    color="white", size=0.15, width=0.65)

# text labels at the end of the bar means no need for any extra chart junk

gg <- gg + geom_text(data=tops, family="NoyhSlim-Medium",
                     aes(x=parent, y=n, label=n), 
                     hjust=-0.2, size=3)

# here's how we label the dominant framework

gg <- gg + geom_text(data=top_f, family="NoyhSlim-Medium",
                     aes(x=parent, y=value/2, label=id, color=color), 
                     hjust=0.5, size=3)

# we'll control our own panel breathing room, thanks anyway, ggplot2

gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), limits=c(0, 900))

# these tell ggplot to use the color we've specified vs map it to a scale

gg <- gg + scale_color_identity()
gg <- gg + scale_fill_identity()

# the rest doesn't need 'splainin

gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL,
                title="Popular web frameworks using Highcharts",
                subtitle="Total usage by language, including the most popular framework in-language",
                caption="Data graciously provided by Highcharts - http://jsfiddle.net/vidarbrekke/n6pd4jfo/")
gg <- gg + theme_hrbrmstr(grid=FALSE, axis="y")
gg <- gg + theme(legend.position="none")
gg <- gg + theme(axis.text.x=element_blank())
gg

If I wanted to kill more time, I'd've used the language logo vs the name in the axis.

Fin

What story/stories can you glean from the data and how would you tell them? Drop a note in the comments with your creation(s)!

Complete, contiguous code is in this gist.

Note that stacked bars aren't always a replacement for treemaps and that treemaps do have valid uses. The important part is to choose the visualization that best supports the story you want to tell.