Skip navigation

Category Archives: ggplot

A user of the {ggalt} package recently posted a question about how to add points to a geom_dumbbell() plot. For now, this is not something you can do with geom_dumbbell() but with a bit of data wrangling you can do this in a pretty straightforward manner with just your data and ggplot2. The example below uses 3 values per category but it should scale to n values per category (though after a certain n you should reconsider the use of a dummbell chart in favour of a more appropriate way to visualize the message you’re trying to convey).

Here’s the setup:

library(hrbrthemes)
library(tidyverse)

tibble(
  val1 = c(3, 2, 4),
  val2 = c(1, 4, 5),
  val3 = c(5, 8, 6),
  cat = factor(month.name[1:3], levels = rev(month.name[1:3]))
) -> xdf

Three values per category. The approach is pretty straightforward:

  • reshape the data frame & get min value so you can draw an eye-tracking line (this is one geom)
  • reshape the data frame & get min/max category values so you can draw the segment (this is another geom)
  • reshape the data frame & plot the points

I’ve put ^^ notes near each ggplot2 geom:

ggplot() +
  # reshape the data frame & get min value so you can draw an eye-tracking line (this is one geom)
  geom_segment(
    data = gather(xdf, measure, val, -cat) %>% 
      group_by(cat) %>% 
      top_n(-1) %>% 
      slice(1) %>%
      ungroup(),
    aes(x = 0, xend = val, y = cat, yend = cat),
    linetype = "dotted", size = 0.5, color = "gray80"
  ) +
  # reshape the data frame & get min/max category values so you can draw the segment (this is another geom)
  geom_segment(
    data = gather(xdf, measure, val, -cat) %>% 
      group_by(cat) %>% 
      summarise(start = range(val)[1], end = range(val)[2]) %>% 
      ungroup(),
    aes(x = start, xend = end, y = cat, yend = cat),
    color = "gray80", size = 2
  ) +
  # reshape the data frame & plot the points
  geom_point(
    data = gather(xdf, measure, value, -cat),
    aes(value, cat, color = measure), 
    size = 4
  ) +
  # i just extended the scale a bit + put axis on top; choose aesthetics that work 
  # for you
  scale_x_comma(position = "top", limits = c(0, 10)) +
  scale_color_ipsum(name = "A real legend title") +
  labs(
    x = "Description of the value", y = NULL,
    title = "A good plot title"
  ) +
  theme_ipsum_rc(grid = "X") +
  theme(legend.position = "top")

And, here’s the result:

I caught a re-tweet of this tweet by @harry_stevens:

Harry’s thread and Observable post are great on their own and both show the power and utility of Observable javascript notebooks.

However, the re-tweet (which I’m not posting because it’s daft) took a swipe at both Python & R. Now, I’m all for a good swipe at Python (mostly to ensure we never forget all those broken spacebars and tab keys that language has caused) but I’ll gladly defend it and R together when it comes to Getting Things Done, even on deadline.

Let’s walk through what one of us might have done had we been in the same scenario as Harry.

Mapping On A Deadline

So, we have to create a map of historical tornado frequency trends on deadline.

We emailed researchers and received three txt files. One is a set of latitudes, another longitudes, and the final one is the trend value. It’s gridded data.

Download that ZIP and pretend you got three files in email vs a nice ZIP and make a new RStudio project called “tornado” and put those three files in a local-to-the-project-root data/ directory. Let’s read them in and look at them:

library(hrbrthemes) # not 100% necessary but i like my ggplot2 theme(s) :-)
library(tidyverse)  # data wrangling & ggplot2

tibble(
  lat = scan(here::here("data/lats.txt")),
  lon = scan(here::here("data/lons.txt")),
  trend = scan(here::here("data/trends.txt"))
) -> tornado

You very likely never directly use the base::scan() function, but it’s handy here since we just have files of doubles with each value separated by whitespace. Now, let’s see what we have:

tornado
## # A tibble: 30,000 x 3
##      lat   lon trend
##    <dbl> <dbl> <dbl>
##  1 0.897 -180.     0
##  2 0.897 -179.     0
##  3 0.897 -178.     0
##  4 0.897 -176.     0
##  5 0.897 -175.     0
##  6 0.897 -174.     0
##  7 0.897 -173.     0
##  8 0.897 -172.     0
##  9 0.897 -170.     0
## 10 0.897 -169.     0
## # … with 29,990 more rows

summary(tornado)
##      lat               lon                 trend           
## Min.   : 0.8973   Min.   :-179.99808   Min.   :-0.4733610  
## 1st Qu.:22.0063   1st Qu.: -90.00066   1st Qu.: 0.0000000  
## Median :43.1154   Median :  -0.00323   Median : 0.0000000  
## Mean   :43.1154   Mean   :  -0.00323   Mean   : 0.0002756  
## 3rd Qu.:64.2245   3rd Qu.:  89.99419   3rd Qu.: 0.0000000  
## Max.   :85.3335   Max.   : 179.99161   Max.   : 0.6314569  

#+ grid-overview
ggplot(tornado, aes(lon, lat)) +
  geom_point(aes(color = trend))

#+ trend-overview
ggplot(tornado, aes(trend)) +
  geom_histogram() +
  scale_x_continuous(breaks = seq(-0.5, 0.5, 0.05))

Since we’re looking for trends (either direction) in just the United States the latitude and longitude ranges will need to be shrunk down a bit (it does indeed look like globally gridded data) and we’ll be able to shrink the data set a bit more since we only want to look at large or small tends.

We don’t really need modern R/ggplot2 mapping idioms for this project (i.e. the new {sf} ecosystem), so we’ll keep it “simple” (scare quotes since that’s a loaded term) and just use the built in maps and geom_map(). First, let’s get the U.S. states and extract their bounding boxes/limits:

maps::map("state", ".", exact = FALSE, plot = FALSE, fill = TRUE) %>% 
  fortify(map_obj) %>% 
  as_tibble() -> state_map

xlim <- range(state_map$long)
ylim <- range(state_map$lat)

NOTE: I tend not to use the handy ggplot::map_data() function since it ends up clobbering purrr::map() which I use heavily (though not in this post). I also try to use {sf} these days so this tends to not be an issue anymore anyway.

Now, let’s focus in on the target area in the original paper and the Axios article:

filter(
  tornado,
  between(lon, -107, xlim[2]), between(lat, ylim[1], ylim[2]), # -107 gets us ~left-edge of TX
  ((trend < -0.07) | (trend > 0.07)) # approximates notebook selection range
) -> tornado

#+ grid-overview-2
ggplot(tornado, aes(lon, lat)) +
  geom_point(aes(color = trend))

Now we’re getting close to our final solution.

As stated in the Observable notebook and implied by the word “grid” these dots are centroids of grid rectangles. This means we really want boxes, not points. The article got all fancy but it’s not really necessary since we can use ggplot2::geom_tile() to get us said boxes:

#+ grid-overview-3
ggplot(tornado, aes(lon, lat)) +
  geom_tile(aes(fill = trend, color = trend))

Now, we just need to add in map layers, and tweak some aesthetics to make it look like a map. We’ll start naively:

#+ map-1
ggplot() +
  geom_tile(
    data = tornado,
    aes(lon, lat, fill = trend, color = trend)
  ) +
  geom_map(
    data = state_map, map = state_map,
    aes(long, lat, map_id = region),
    color = "black", size = 0.125, fill = NA
  )

Our gridded data is definitely covering the right/same areas so we just need to make this more suitable for an article. We’ll use Harry’s palette and layer in U.S. state borders, an overall country border, and approximate the title and legend aesthetics:

#+ map-final
c(
  "#023858", "#045a8d", "#0570b0", "#3690c0", "#74a9cf",
  "#a6bddb", "#d0d1e6", "#ece7f2", "#fff7fb", "#ffffff",
  "#ffffcc", "#ffeda0", "#fed976", "#feb24c", "#fd8d3c",
  "#fc4e2a", "#e31a1c", "#bd0026", "#800026"
) -> grad_cols # colors from article

ggplot() +

  # tile layer

  geom_tile(
    data = tornado,
    aes(lon, lat, fill = trend, color = trend)
  ) +

  # state borders

  geom_map(
    data = state_map, map = state_map,
    aes(long, lat, map_id = region),
    color = ft_cols$slate, size = 0.125, fill = NA
  ) +

  # usa border

  borders("usa", colour = "black", size = 0.5) +

  # color scales

  scale_colour_gradientn(
    colours = grad_cols,
    labels = c("Fewer", rep("", 4), "More"),
    name = "Change in tornado frequency, 1979-2017"
  ) +
  scale_fill_gradientn(
    colours = grad_cols,
    labels = c("Fewer", rep("", 4), "More"),
    name = "Change in tornado frequency, 1979-2017"
  ) +

  # make it Albers-ish and ensure we can fit the borders in 

  coord_map(
    projection = "polyconic",
    xlim = scales::expand_range(range(tornado$lon), add = 2),
    ylim = scales::expand_range(range(tornado$lat), add = 2)
  ) +

  # tweak legend aesthetics

  guides(
    colour = guide_colourbar(
      title.position = "top", title.hjust = 0.5
    ),
    fill = guide_colourbar(
      title.position = "top", title.hjust = 0.5
    )
  ) +
  labs(
    x = NULL, y = NULL
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank()) +
  theme(legend.position = "top") +
  theme(legend.title = element_text(size = 16, hjust = 0.5)) +
  theme(legend.key.width = unit(4, "lines")) +
  theme(legend.key.height = unit(0.5, "lines"))

FIN

I went through some extra steps for folks new to R but the overall approach was at the very least equally as expedient as the Observable one and — despite the claims by the quite daft retweet — this is no less “shareable” or “reusable” than the Observable notebook. You can clone the repo (https://git.sr.ht/~hrbrmstr/tornado) and reuse this work immediately.

If you take a stab at an alternate approach — especially if you do use {sf} — definitely blog about it and drop a link here or on Twitter.

Today’s RSS feeds picked up this article by Marianne Sullivan, Chris Sellers, Leif Fredrickson, and Sarah Lamdanon on the woeful state of enforcement actions by the U.S. Environmental Protection Agency (EPA). While there has definitely been overreach by the EPA in the past the vast majority of its regulatory corpus is quite sane and has made Americans safer and healthier as a result. What’s happened to an EPA left in the hands of evil (yep, “evil”) in the past two years is beyond lamentable and we likely have two more years of lamenting ahead of us (unless you actually like your water with a coal ash chaser).

The authors of the article made this chart to show the stark contrast between 2017 and 2018 when it comes to regulatory actions for eight acts:

  • Clean Air Act (CAA)
  • Clean Water Act (CWA)
  • Emergency Planning and Community Right to Know Act (EPCRA)
  • Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA)
  • Resource Conservation and Recovery Act (RCRA)
  • Safe Drinking Water Act (SDWA)
  • Toxic Substances Control Act (TSCA)
    – Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA)

They made this arrow chart (via Datawrapper):

For some reason, that chart sparked a “I really need to make that in R” moment, and thus begat this post.

I’ve got a geom for dumbbell charts but that’s not going to work for this arrow chart since I really wanted to (mostly) reproduce it the way it was. Here’s my go at it.

Data First

Datawrapper embeds have a handy “Get the data” link in them but it’s not a link to a file. It’s a javascript-generated data: href so you either need to click on the link and download it or be hard-headed like I am go the way of pain and scrape it (reproducibility FTW). Let’s get packages and data gathering code out of the way. I’ll exposit a bit more about said data gathering after the code block:

library(stringi)
library(rvest)
library(hrbrthemes) # git[la|hu]b / hrbrmstr / hrbrthemes
library(tidyverse)

article <- read_html("https://theconversation.com/the-epa-has-backed-off-enforcement-under-trump-here-are-the-numbers-108640")

html_node(article, "iframe#psm7n") %>% # find the iframe
  html_attr("src") %>% # get iframe URL
  read_html() %>%  # read it in
  html_node(xpath=".//script[contains(., 'data: ')]") %>% # find the javascript section with the data
  html_text() %>% # get that section
  stri_split_lines() %>% # split into lines so we can target the actual data element
  unlist() %>% 
  keep(stri_detect_fixed, 'data: "Fiscal') %>% # just get the data line
  stri_trim_both() %>% # prep it for extraction
  stri_replace_first_fixed('data: "', "") %>% 
  stri_replace_last_fixed('"', "") %>% 
  stri_replace_all_fixed("\\n", "\n") %>% # make lines lines
  stri_split_lines() %>% 
  unlist() %>%
  stri_split_fixed("\\t") %>% # we now have a list of vectors
  map_dfc(~set_names(list(.x[2:length(.x)]), .x[1])) %>%  # first element of each vector is colname
  type_convert(col_types = "cddn") %>% # get real types
  set_names(c("act", "y2018", "y2017", "pct")) -> psm

psm
## # A tibble: 8 x 4
##   act    y2018 y2017   pct
##   <chr>  <dbl> <dbl> <dbl>
## 1 CAA      199   405   -51
## 2 CERCLA   147   194   -24
## 3 CWA      320   565   -43
## 4 EPCRA     56   107   -48
## 5 FIFRA    363   910   -60
## 6 RCRA     149   275   -46
## 7 SDWA     121   178   -32
## 8 TSCA      80   152   -47

Inside the main article URL content there’s an iframe load:

<p><iframe id="psm7n" class="tc-infographic-datawrapper" src="https://datawrapper.dwcdn.net/psm7n/2/" height="400px" width="100%" style="border: none" frameborder="0"></iframe></p>

We grab the contents of that iframe link (https://datawrapper.dwcdn.net/psm7n/2/) which has a data: line way down towards the bottom of one of the last javascript blocks:

That ugly line gets transformed into a link that will download as a normal CSV file, but we have to do the above wrangling on it before we can get it into a format we can work with.

Now, we can make the chart.

Chart Time!

Let’s get the Y axis in the right order:

psm %>%
  arrange(desc(y2017)) %>%
  mutate(act = factor(act, levels = rev(act))) -> psm

Next, we setup X axis breaks and also get the max value for some positioning calculations (so we don’t hardcode values):

# setup x axis breaks and max value for label position computation
x_breaks <- pretty(c(psm$y2018, psm$y2017))
max_val <- max(x_breaks)

I have two minor nitpicks about the original chart (and changes to them as a result). First, I really don’t like the Y axis gridlines but I do believe we need something to help the eye move horizontally and associate each label to its respective geom. Instead of gridlines I opt for a diminutive dotted line from 0 to the first (min) value.

The second nitpick is that — while the chart has the act information in the caption area — the caption is in alpha order vs the order the act acronyms appear in the data. If it was an alpha bullet list I might not complain, but I chose to modify the order to fit the chart, which we build dynamically with the help of this vector:

# act info for caption
c(
  "CAA" = "Clean Air Act (CAA)",
  "CWA" = "Clean Water Act (CWA)",
  "EPCRA" = "Emergency Planning and Community Right to Know Act (EPCRA)",
  "FIFRA" = "Federal Insecticide, Fungicide, and Rodenticide Act (FIFRA)",
  "RCRA" = "Resource Conservation and Recovery Act (RCRA)",
  "SDWA" = "Safe Drinking Water Act (SDWA)",
  "TSCA" = "Toxic Substances Control Act (TSCA)",
  "CERCLA" = "Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA)"
) -> acts

w125 <- scales::wrap_format(125) # help us word wrap at ~125 chars

# order the vector and turn it into wrapped lines
act_info <- w125(paste0(unname(acts[as.character(psm$act)]), collapse = "; "))

Now, we can generate the geoms. It looks like alot of code, but I like to use newlines to help structure ggplot2 calls. I still miss my old gg <- gg + idiom but RStudio makes it way too easy to execute the whole expression with just the use of + so I’ve succumbed to their behaviour modification. To break it down w/o code, we essentially need:

  • the arrows for each act
  • the 2017 and 2018 direct label values for each act
  • the 2017 and 2018 top “titles”
  • segments for ^^
  • title, subtitle and caption(s)

We use percent-maths to position labels and other objects so the code can be re-used for other arrow plots (hardcoding to the data values is likely fine, but you’ll end up tweaking the numbers more and wasting ~2-5m per new chart).

  # dots from 0 to minval
  geom_segment(
    aes(0, act, xend = y2018, yend = act),
    linetype = "dotted", color = "#b2b2b2", size = 0.33
  ) +

  # minval label
  geom_label(
    aes(y2018, act, label = y2018),
    label.size = 0, hjust = 1, size = 3.5, family = font_rc
  ) +

  # maxval label
  geom_label(
    aes(y2017 + (0.0015 * y2017), act, label = y2017),
    label.size = 0, hjust = 0, size = 3.5, family = font_rc
  ) +

  # the measure line+arrow
  geom_segment(
    aes(y2018, act, xend = y2017, yend = act),
    color = "#4a90e2", size = 0.75, # I pulled the color value from the original chart
    arrow = arrow(ends = "first", length = unit(5, "pt"))
  ) +

  # top of chart year (min)
  geom_label(
    data = head(psm, 1),
    aes(y2018, 9, label = "2018"),
    hjust = 0, vjust = 1, label.size = 0, size = 3.75, family = font_rc, color = ft_cols$slate
  ) +

  # top of chart year (max)
  geom_label(
    data = head(psm, 1),
    aes(y2017, 9, label = "2017"),
    hjust = 1, vjust = 1, label.size = 0, size = 3.75, family = font_rc, color = ft_cols$slate
  ) +

  # bar from top of chart year label to first minval measure
  geom_segment(
    data = head(psm, 1),
    aes(
      y2018 + (0.005 * max_val), 8.5, 
      xend = y2018 + (0.005 * max_val), yend = 8.25
    ), 
    size = 0.25
  ) +

  # bar from top of chart year label to first maxval measure
  geom_segment(
    data = head(psm, 1),
    aes(
      y2017 - (0.005 * max_val), 8.5, 
      xend = y2017 - (0.005 * max_val), yend = 8.25
    ), 
    size = 0.25
  ) +

  # fix x axis scale and place breaks
  scale_x_comma(limits = c(0, max_val), breaks = seq(0, max_val, 200)) +

  # make room for top "titles"
  scale_y_discrete(expand = c(0, 1)) +

  labs(
    y = NULL,
    title = "Decline by statute",
    subtitle = "The number of civil cases the EPA brought to conclusion has dropped across a number of federal statutes,\nincluding the Clean Air Act (CAA) and others.",
    x = act_info,
    caption = "Original Chart/Data: The Conversation, CC-BY-ND;<https://bit.ly/2VuJrOT>; Source: Environmental Data & Government Initiative <https://bit.ly/2VpcFyl>"
  ) +
  theme_ipsum_rc(grid = "X") +
  theme(axis.text.x = element_text(color = ft_cols$slate)) +
  theme(axis.title.x = element_text(
    hjust = 0, size = 10, face = "italic", color = ft_cols$gray, margin = margin(t = 10)
  )) +
  theme(plot.caption = element_text(hjust = 0))

Here’s the result:

(it even looks ok in “batman” mode):

FIN

With Microsoft owning GitHub I’m not using gists anymore and the GitLab “snippets” equivalent is just too dog-slow to use, so starting in 2019 I’m self-hosing contiguous R example code used in the blog posts. For the moment, that means links to plain R files but I may just setup gitea for them sometime before the end of Q1. You can find a contiguous, commented version of the above code in here.

If you do your own makeover don’t forget to drop a link to your creation(s) in the comments!

The CBC covered the recent (as of the original post-time on this blog entry) Quebec elections and used a well-crafted hex grid map to display results:

They have a great ‘splainer on why they use this type of map.

Thinking that it may be useful for others, I used a browser Developer Tools inspector to yank out the javascript-created SVG and wrangled out the hexes using svg2geojson? and put them into a GeoJSON file along with some metadata that I extracted from the minified javascript from the CBC’s site and turned into a data frame using the V8? package. Since most of the aforementioned work was mouse clicking and ~8 (disjointed) lines of accompanying super-basic R code, there’s not really much to show wrangling-wise1, but I can show an example of using the new GeoJSON file in R and the sf? package:

library(sf)
library(ggplot2)

# get the GeoJSON file from: https://gitlab.com/hrbrmstr/quebec-hex-ridings or https://github.com/hrbrmstr/quebec-hex-ridings
sf::st_read("quebec-ridings.geojson", quiet = TRUE, stringsAsFactors = FALSE) %>% 
  ggplot() +
  geom_sf(aes(fill = regionname)) +
  coord_sf(datum = NA) +
  ggthemes::scale_fill_tableau(name = NULL, "Tableau 20") +
  ggthemes::theme_map() +
  theme(legend.position = "bottom")

And, with a little more ggplot2-tweaking and some magick, we can even put it in the CBC-styled border:

library(sf)
library(magick)
library(ggplot2)

plt <- image_graph(1488, 1191, bg = "white")
sf::st_read("quebec-ridings.geojson", quiet=TRUE, stringsAsFactors=FALSE) %>% 
  ggplot() +
  geom_sf(aes(fill=regionname)) +
  coord_sf(datum=NA) +
  scale_x_continuous(expand=c(0,2)) +
  scale_y_continuous(expand=c(0,0)) +
  ggthemes::theme_map() +
  theme(plot.margin = margin(t=150)) +
  theme(legend.position = "none")
dev.off()

# get this bkgrnd img from the repo
image_composite(plt, image_read("imgs/background.png")) %>% 
  image_write("imgs/composite-map.png")

You can tweak the border color with magick? as needed and there’s a background2.png in the imgs directory in the repo that has the white inset that you can further composite as needed.

With a teensy bit of work you should be able adjust the stroke color via aes() to separate things as the CBC did.

FIN

It’s important to re-state that the CBC made the original polygons for the hexes (well, they made a set of grid points and open source software turned it into a set of SVG paths) and the background images. All I did was some extra bit of wrangling and conversionating2.

1 I can toss a screencast if there’s sufficient interest.
2 Totally not a word.

Nowadays (I’ve seen that word used so much in journal articles lately that I could not resist using it) I’m using world tile grids more frequently as the need arises to convey the state of exposure of various services at a global (country) scale. Given that necessity fosters invention it seemed that having a ggplot2 geom for world tile grids would make my life easier and also make work more efficient.

To that end, there’s a nascent ggplot2 extension package for making world tile grids called (uncreatively) worldtilegrid?. It’s also at GitHub if you’re more comfortable working from code from there.

The README has examples but we’ll walk through another one here and work with life expectancy data curated by Our World in Data. Rather than draw out this post to a less tenable length with an explanation of how to pull the XHR JSON data into R, just grab the CSV linked with the visualization on that page and substitute the path in the code for where you stored it.

We do need to clean up this data a bit since it has some issues. Let’s do that and carve out some slices into two new data frames so we can work with the most recent curated year and some historical data:

library(hrbrthemes) # gitlab/github :: hrbrmstr/hrbrthemes (use devtools installers)
library(worldtilegrid) # gitlab/github :: hrbrmstr/worldtilegrid (use devtools installers)
library(tidyverse)

cols(
  Entity = col_character(),
  Code = col_character(),
  Year = col_integer(),
  `Life expectancy (Clio-Infra up to 1949; UN Population Division for 1950 to 2015)` = col_double()
) -> le_cols

read_csv("~/Downloads/life-expectancy.csv", col_types = le_cols) %>%
  set_names(c("country", "iso3c", "year", "life_expectancy")) -> lifexp

# clean it up a bit since it's not great data and bust it into groups
# that match the vis on the site vs use continuous raw data, esp since
# the data is really just estimates
filter(lifexp, !is.na(iso3c)) %>%
  filter(nchar(iso3c) == 3) %>%
  mutate(
    grp = cut(
      x = life_expectancy,
      breaks = c(10, 20, 30, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85),
      include.lowest = TRUE
    )
  ) -> lifexp

last_year <-  filter(lifexp, year == last(year))
a_few_years <- filter(lifexp, year %in% c(seq(1900, 2015, 20), 2015))

Now we can use it to make some cartograms.

Using the World Tile Grid geom

You can reference a previous post for what it would normally take to create these tile grids. The idea was to simplify the API down to the caller just needing to specify some country names/ISO[23]Cs (to a country aesthetic) and a value (to a fill aesthetic) and let the package do the rest. To that end, here's how to display the life expectancy data for 2015:

ggplot(last_year) +
  geom_wtg(aes(country = iso3c, fill = grp), border_col = "#2b2b2b") + # yep, you can sum up the entire blog post to this one line
  viridis::scale_fill_viridis(
    name = "Life Expectancy", discrete = TRUE, na.value=alpha(ft_cols$gray, 1/10),
    drop = FALSE
  ) +
  coord_equal() +
  labs(
    x=NULL, y=NULL,
    title = "Life expectancy, 2015",
    subtitle = "Shown is period life expectancy at birth. This corresponds to an estimate of the average number\nof years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to\nstay the same throughout its life",
    caption = "Data source: "

  ) +
  theme_ft_rc(grid="") +
  theme(axis.text=element_blank()) +
  theme(legend.position = "bottom")

You can add labels via geom_text() and use the wtg stat for it as it provides a number of computed variables you can work with:

  • x,y: the X,Y position of the tile
  • name: Country name (e.g. Afghanistan)
  • country.code: ISO2C country code abbreviation (e.g. AF)
  • iso_3166.2: Full ISO 3166 2-letter abbreviation code (e.g. ISO 3166-2:AF)
  • region: Region name (e.g. Asia)
  • sub.region: Sub-region name (e.g. Southern Asia)
  • region.code: Region code (e.g. 142)
  • sub.region.code: Sub-region code (e.g. 034)

Labeling should be a deliberate decision and used sparingly/with care.

Easier World Tile Grid Facets

Making faceted charts is also straightforward. Just use ggplot2's faceting subsytem:

ggplot(a_few_years) +
  geom_wtg(aes(country = iso3c, fill = grp), border_col = "#2b2b2b") +
  viridis::scale_fill_viridis(
    name = "Life Expectancy", discrete = TRUE, na.value=alpha(ft_cols$gray, 1/10),
    drop = FALSE
  ) +
  coord_equal() +
  labs(
    x=NULL, y=NULL,
    title = "Life expectancy, 1900-2015 (selected years)",
    subtitle = "Shown is period life expectancy at birth. This corresponds to an estimate of the average number of years a newborn infant\nwould live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life",
    caption = "Data source: "
  ) +
  facet_wrap(~year) +
  theme_ft_rc(grid="") +
  theme(axis.text=element_blank()) +
  theme(legend.position = "bottom")

(You may want to open that one up in a separate tab/window.)

Animation might have been a better choice than facets (which is an exercise left to the reader ?).

FIN

The package API is still in "rapidly changing" mode, but feel free to kick the tyres and file issues or PRs on your community coding platform of choice as needed.

As noted in the package, the world tile grid concept credit goes to Jon Schwabish and the CSV data used to create the package exposed wtg data frame is the work of Maarten Lambrechts.

After reading this interesting analysis of “How Often Are Americans’ Accounts Breached?” by Gaurav Sood (which we need more of in cyber-land) I gave in to the impulse to do some gg-doodling with the “Have I Been Pwnd” JSON data he used.

It’s just some basic data manipulation with some heavy ggplot2 styling customization, so no real need for exposition beyond noting that there are many other ways to view the data. I just settled on centered segments early on and went from there. If you do a bit of gg-doodling yourself, drop a note in the comments with a link.

You can see a full-size version of the image via this link.

library(hrbrthemes) # use github or gitlab version
library(tidyverse)

# get the data

dat_url <- "https://raw.githubusercontent.com/themains/pwned/master/data/breaches.json"

jsonlite::fromJSON(dat_url) %>% 
  mutate(BreachDate = as.Date(BreachDate)) %>% 
  tbl_df() -> breaches

# selected breach labels df
group_by(breaches, year = lubridate::year(BreachDate)) %>% 
  top_n(1, wt=PwnCount) %>% 
  ungroup() %>% 
  filter(year %in% c(2008, 2015, 2016, 2017)) %>% # pick years where labels will fit nicely
  mutate(
    lab = sprintf("%s\n%sM accounts", Name, as.integer(PwnCount/1000000))
  ) %>% 
  arrange(year) -> labs

# num of known breaches in that year for labels
count(breaches, year = lubridate::year(BreachDate)) %>% 
  mutate(nlab = sprintf("n=%s", n)) %>% 
  mutate(lab_x = as.Date(sprintf("%s-07-02", year))) -> year_cts

mutate(breaches, p_half = PwnCount/2) %>% # for centered segments
  ggplot() +
  geom_segment( # centered segments
    aes(BreachDate, p_half, xend=BreachDate, yend=-p_half), 
    color = ft_cols$yellow, size = 0.3
  ) +
  geom_text( # selected breach labels
    data = labs, aes(BreachDate, PwnCount/2, label = lab),
    lineheight = 0.875, size = 3.25, family = font_rc,
    hjust = c(0, 1, 1, 0), vjust = 1, nudge_x = c(25, -25, -25, 25),
    nudge_y = 0,  color = ft_cols$slate
  ) +
  geom_text( # top year labels
    data = year_cts, aes(lab_x, Inf, label = year), family = font_rc, 
    size = 4, vjust = 1, lineheight = 0.875, color = ft_cols$gray
  ) +
  geom_text( # bottom known breach count totals
    data = year_cts, aes(lab_x, -Inf, label = nlab, size = n), 
    vjust = 0, lineheight = 0.875, color = ft_cols$peach,
    family = font_rc, show.legend = FALSE
  ) +
  scale_x_date( # break on year
    name = NULL, date_breaks = "1 year", date_labels = "%Y"
  ) +
  scale_y_comma(name = NULL, limits = c(-450000000, 450000000)) + # make room for labels
  scale_size_continuous(range = c(3, 4.5)) + # tolerable font sizes 
  labs(
    title = "HIBP (Known) Breach Frequency & Size",
    subtitle = "Segment length is number of accounts; n=# known account breaches that year",
    caption = "Source: HIBP via "
  ) +
  theme_ft_rc(grid="X") +
  theme(axis.text.y = element_blank()) +
  theme(axis.text.x = element_blank())

IBM has a new set of corporate typefaces — dubbed “Plex” — and has released them with a generous open license. IBM Plex Sans is not too shabby:

(that image was grifted from a Font Squirrel preview page)

The digit glyphs are especially nice for charts and the font iself is just different enough from Helvetica that I felt it was worth including in hrbrthemes?. I converted the fonts to TTF (for best compatibility with R ‘stuff’), added them to the package (so you can install them from there), cloned the other ipsum functions into _ps versions and tried to figure out the best fonts for various theme elements. Here’s the result so far:

(I really like those digits.)

I haven’t tried theme_ipsum_ps() on anything but macOS, so if you kick the tyres and encounter problems, please file an issue.

Flushing ticks’ text

Readers with keen eyes will see that the axis text labels have different hjust settings on the ends (vs the 0.5 in the inner ones. I don’t always set them that was as it is not always appropriate to do so (IMO). When I do, it’s usually been a “figure out the breaks and then hand-craft the theme() components” endeavour. Well, not anymore (at least in simple cases) thanks to the new flush_ticks()

It’s a dead-simple process but it requires knowing the breaks and you pretty much have to build the plot to do that, so the function takes your completed ggplot2 object, introspects it (post-temporary build) and sets the tick text justification in the style seen in the above image. The code below generates that image and demonstrates use of flush_ticks():

library(hrbrthemes)
library(ggplot2)

ggplot(mpg, aes(displ, hwy)) +
  geom_jitter(aes(color=class, fill=class), size=3, shape=21, alpha=1/2) +
  scale_x_continuous(expand=c(0,0), limits=c(1, 8), breaks=1:8) +
  scale_y_continuous(expand=c(0,0), limits=c(10, 50)) +
  scale_color_ipsum() +
  scale_fill_ipsum() +
  facet_wrap(~class, scales="free") +
  labs(
    title="IBM Plex Sans Test",
    subtitle="This is a subtitle to see the how it looks in IBM Plex Sans",
    caption="Source: hrbrthemes & IBM"
  ) +
  theme_ipsum_ps(grid="XY", axis="xy") +
  theme(legend.position="none") -> gg

flush_ticks(gg)

As indicated, this won’t presently work for free-scales but it’s going to save me a few minutes for each plot I would have done the more manual way. But, releasing it into the wild will help catch edge cases I haven’t considered yet. In the coming weeks, I’m going to add I’ve added an option to flush_ticks() to generate theme(axis.text...) statements vs just return the plot so you can include the theme() code directly vs have to rely on the function to display the plot. I may even make it an RStudio addin (though folks who would like to do that are encouraged to submit a PR).

I hope both new hrbrthemes features are useful to folks and they should be on CRAN in mid-December with an R markdown set of themes to go with the package.

As many folks know, I live in semi-rural Maine and we were hit pretty hard with a wind+rain storm Sunday to Monday. The hrbrmstr compound had no power (besides a generator) and no stable/high-bandwidth internet (Verizon LTE was heavily congested) since 0500 Monday and still does not as I write this post.

I’ve played with scraping power outage data from Central Maine Power but there’s a great Twitter account — PowerOutage_us — that has done much of the legwork for the entire country. They don’t cover everything and do not provide easily accessible historical data (likely b/c evil folks wld steal it w/o payment or credit) but they do have a site you can poke at and do provide updates via Twitter. As you’ve seen in a previous post, we can use the rtweet package to easily read Twitter data. And, the power outage tweets are regular enough to identify and parse. But raw data is so…raw.

While one could graph data just for one’s self, I decided to marry this power scraping capability with a recent idea I’ve been toying with adding to hrbrthemes or ggalt: gg_tweet(). Imagine being able to take a ggplot2 object and “plot” it to Twitter, fully conforming to Twitter’s stream or card image sizes. By conforming to these size constraints, they don’t get cropped in the timeline view (if you allow images to be previewed in-timeline). This is even more powerful if you have some helper functions for proper theme-ing (font sizes especially need to be tweaked). Enter gg_tweet().

Power Scraping

We’ll cover scraping @PowerOutage_us first, but we’ll start with all the packages we’ll need and a helper function to convert power outage estimates to numeric values:

library(httr)
library(magick)
library(rtweet)
library(stringi)
library(hrbrthemes)
library(tidyverse)

words_to_num <- function(x) {
  map_dbl(x, ~{
    val <- stri_match_first_regex(.x, "^([[:print:]]+) #PowerOutages")[,2]
    mul <- case_when(
      stri_detect_regex(val, "[Kk]") ~ 1000,
      stri_detect_regex(val, "[Mm]") ~ 1000000,
      TRUE ~ 1
    ) 
    val <- stri_replace_all_regex(val, "[^[:digit:]\\.]", "")
    as.numeric(val) * mul 
  })
}

Now, I can’t cover setting up rtweet OAuth here. The vignette and package web site do that well.

The bot tweets infrequently enough that this is really all we need (though, bump up n as you need to):

outage <- get_timeline("PowerOutage_us", n=300)

Yep, that gets the last 300 tweets from said account. It’s amazingly simple.

Now, the outage tweets for the east coast / northeast are not individually uniform but collectively they are (there’s a pattern that may change but you can tweak this if they do):

filter(outage, stri_detect_regex(text, "\\#(EastCoast|NorthEast)")) %>% 
  mutate(created_at = lubridate::with_tz(created_at, 'America/New_York')) %>% 
  mutate(number_out = words_to_num(text)) %>%  
  ggplot(aes(created_at, number_out)) +
  geom_segment(aes(xend=created_at, yend=0), size=5) +
  scale_x_datetime(date_labels = "%Y-%m-%d\n%H:%M", date_breaks="2 hours") +
  scale_y_comma(limits=c(0,2000000)) +
  labs(
    x=NULL, y="# Customers Without Power",
    title="Northeast Power Outages",
    subtitle="Yay! Twitter as a non-blather data source",
    caption="Data via: @PowerOutage_us <https://twitter.com/PowerOutage_us>"
  ) -> gg

That pipe chain looks for key hashtags (for my area), rejiggers the time zone, and calls the helper function to, say, convert 1.2+ Million to 1200000. Finally it builds a mostly complete ggplot2 object (you should make the max Y limit more dynamic).

You can plot that on your own (print gg). We’re here to tweet, so let’s go into the next section.

Magick Tweeting

@opencpu made it possible shunt plot output to a magick device. This means we have really precise control over ggplot2 output size as well as the ability to add other graphical components to a ggplot2 plot using magick idioms. One thing we need to take into account is “retina” plots. They are — essentially — double resolution plots (72 => 144 pixels per inch). For the best looking plots we need to go retina, but that also means kicking up base plot theme font sizes a bit. Let’s build on hrbrthemes::theme_ipsum_rc() a bit and make a theme_tweet_rc():

theme_tweet_rc <- function(grid = "XY", style = c("stream", "card"), retina=TRUE) {
  
  style <- match.arg(tolower(style), c("stream", "card"))
  
  switch(
    style, 
    stream = c(24, 18, 16, 14, 12),
    card = c(22, 16, 14, 12, 10)
  ) -> font_sizes
  
  theme_ipsum_rc(
    grid = grid, 
    plot_title_size = font_sizes[1], 
    subtitle_size = font_sizes[2], 
    axis_title_size = font_sizes[3], 
    axis_text_size = font_sizes[4], 
    caption_size = font_sizes[5]
  )
  
}

Now, we just need a way to take a ggplot2 object and shunt it off to twitter. The following gg_tweet() function does not (now) use rtweet as I’ll likely add it to either ggalt or hrbrthemes and want to keep dependencies to a minimum. I may opt-in to bypass the current method since it relies on environment variables vs an RDS file for app credential storage. Regardless, one thing I wanted to do here was provide a way to preview the image before tweeting.

So you pass in a ggplot2 object (likely adding the tweet theme to it) and a Twitter status text (there’s a TODO to check the length for 140c compliance) plus choose a style (stream or card, defaulting to stream) and decide on whether you’re cool with the “retina” default.

Unless you tell it to send the tweet it won’t, giving you a chance to preview the image before sending, just in case you want to tweak it a bit before committing it to the Twitterverse. It als returns the magick object it creates in the event you want to do something more with it:

gg_tweet <- function(g, status = "ggplot2 image", style = c("stream", "card"), 
                     retina=TRUE, send = FALSE) {
  
  style <- match.arg(tolower(style), c("stream", "card"))
  
  switch(
    style, 
    stream = c(w=1024, h=512),
    card = c(w=800, h=320)
  ) -> dims
  
  dims["res"] <- 72
  
  if (retina) dims <- dims * 2
  
  fig <- image_graph(width=dims["w"], height=dims["h"], res=dims["res"])
  print(g)
  dev.off()
  
  if (send) {
    
    message("Posting image to twitter...")
    
    tf <- tempfile(fileext = ".png")
    image_write(fig, tf, format="png")
    
    # Create an app at apps.twitter.com w/callback URL of http://127.0.0.1:1410
    # Save the app name, consumer key and secret to the following
    # Environment variables
    
    app <- oauth_app(
      appname = Sys.getenv("TWITTER_APP_NAME"),
      key = Sys.getenv("TWITTER_CONSUMER_KEY"),
      secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
    )
    
    twitter_token <- oauth1.0_token(oauth_endpoints("twitter"), app)
    
    POST(
      url = "https://api.twitter.com/1.1/statuses/update_with_media.json",
      config(token = twitter_token), 
      body = list(
        status = status,
        media = upload_file(path.expand(tf))
      )
    ) -> res
    
    warn_for_status(res)
    
    unlink(tf)
    
  }
  
  fig
  
}

Two Great Tastes That Taste Great Together

We can combine the power outage scraper & plotter with the tweeting code and just do:

gg_tweet(
  gg + theme_tweet_rc(grid="Y"),
  status = "Progress! #rtweet #gg_tweet",
  send=TRUE
)

That was, in-fact, the last power outage tweet I sent.

Next Steps

Ironically, given current levels of U.S. news and public “discourse” on Twitter and some inane machinations in my own area of domain expertise (cyber), gg_tweet() is likely one of the few ways I’ll be interacting with Twitter for a while. You can ping me on Keybase — hrbrmstr — or join the rstats Keybase team via keybase team request-access rstats if you need to poke me for anything for a while.

FIN

Kick the tyres and watch for gg_tweet() ending up in ggalt or hrbrthemes. Don’t hesitate to suggest (or code up) feature requests. This is still an idea in-progress and definitely not ready for prime time without a bit more churning. (Also, words_to_num() can be optimized, it was hastily crafted).