rud.is

Quick Hit: A Different (Diminutive) Look At Distributions With {ggeconodist}

Despite being a full-on denizen of all things digital I receive a fair number of dead-tree print magazines as there’s nothing quite like seeing an amazing, large, full-color print data-driven visualization up close and personal. I also like supporting data journalism through the subscriptions since without cash we will only have insane, extreme left/right-wing perspectives out there.

One of these publications is The Economist (I’d subscribe to the Financial Times as well for the non-liberal perspective but I don’t need another mortgage payment right now). The graphics folks at The Economist are Top Notch™ and a great source of inspiration to “do better” when cranking out visuals.

After reading a recent issue, one of their visualization styles stuck in my head. Specifically, the one from this story on the costs of a mammogram. I’ve put it below:

Essentially it’s a boxplot with outliers removed along with significantly different aesthetics than the one we’re all used to seeing. I would not use this for exploratory data analysis or working with other data science team members when poking at a problem but I really like the idea of making “distributions” easier to consume for general audiences and believe The Economist graphics folks have done a superb job focusing on the fundamentals (both statistical and aesthetic).

There are ways to hack something like those out manually in {ggplot2} but it would be nice to just be able to swap out something for geom_boxplot() when deciding to go to “production” with a distribution chart. Thus begat {ggeconodist}.

Since this is just a “quick hit” post we’ll avoid some interim blathering to note that we can use {ggplot2} (and a touch of {grid}) to make the following:

with just a tiny bit of R code:

library(ggeconodist)

ggplot(mammogram_costs, aes(x = city)) +
  geom_econodist(
    aes(ymin = tenth, median = median, ymax = ninetieth), stat = "identity"
  ) +
  scale_y_continuous(expand = c(0,0), position = "right", limits = range(0, 800)) +
  coord_flip() +
  labs(
    x = NULL, y = NULL,
    title = "Mammoscams",
    subtitle = "United States, prices for a mammogram*\nBy metro area, 2016, $",
    caption = "*For three large insurance companies\nSource: Health Care Cost Institute"
  ) +
  theme_econodist() -> gg

grid.newpage()
left_align(gg, c("subtitle", "title", "caption")) %>% 
  add_econodist_legend(econodist_legend_grob(), below = "subtitle") %>% 
  grid.draw()

FIN

A future post (and, even — perhaps — a new screen sharing video) will describe what went into making this new geom, stat, and theme (along with some info on how I managed to reproduce data for the vis since none was provided).

In the interim, hit up the CINC page on {ggeconodist} to learn more about the package. You may want to take a quick look at {hrbrthemes} since it might have some helpers for using all the required theming components.

So kick the tyres, file issues/PRs, and be on the lookout for the director’s cut of the making of {ggeconodist}.

Quick Hit: {waffle} 1.0 Font Awesome 5 Pictograms and More

The {waffle} package got some 💙 this week and now has a substantially improved geom_waffle() along with a brand new sibling function geom_pictogram() which has all the powerful new features of geom_waffle() but lets you use Font Awesome 5 brand and solid glyphs to make isotype pictograms.

This slideshow requires JavaScript.

A major new feature is that stat_waffle() (which powers both geoms) has an option to auto-compute proportions so you can use a proper 10×10 grid to show parts of a whole without doing any extra work (works in facet contexts, too).

You can look at a preview of the vignettes below or bust the iframes to see waffles and pictograms in action.

Building Waffle Charts

Building Pictograms

FIN

You can get the updated {waffle} code at your preferred social coding service (See the list for {waffle} over at CINC.

It needs much tyre kicking, especially the pictogram geom. File issues/PRs wherever you’re comfortable.

Make Refreshing Segmented Column Charts with {ggchicklet}

The first U.S. Democratic debates of the 2020 election season were held over two nights this past week due to the daft number of candidates running for POTUS. The spiffy @NYTgraphics folks took the tallies of time spent blathering by each speaker/topic and made rounded rectangle segmented bar charts ordered by the time the blathering was performed (these aren’t really debates, they’re loosely-prepared for performances) which I have dubbed “chicklet” charts due to a vague resemblance to the semi-popular gum/candy.

You can see each day’s live, javascript-created NYTimes charts here:

and this is a PNG snapshot of one of them:

I liked the chicklet aesthetic enough to make a new {ggplot2} geom_chicklet() to help folks make them. To save some blog bytes, you can read how to install the package over at https://cinc.rud.is/web/packages/ggchicklet/.

Making Chicklet Charts

Since the @NYTimes chose to use javascript to make their chart they also kinda made the data available (view the source of both of the aforelinked URLs) which I’ve wrangled a bit and put into the {ggchicklet} package. We’ll use it to progress from producing basic bars to crunching out chicklets and compare all the candidates across both days.

While the @Nytimes chart(s) provide a great deal of information, most media outlets focused on how much blather time each candidate managed to get. We do not need anything fancier than a bar chart or table to do that:

library(hrbrthemes)
library(ggchicklet)
library(tidyverse)

data("debates2019")

count(debates2019, speaker, wt=elapsed, sort=TRUE) %>%
  mutate(speaker = fct_reorder(speaker, n, sum, .desc=FALSE)) %>%
  mutate(speaker = fct_inorder(speaker) %>% fct_rev()) %>%
  ggplot(aes(speaker,n)) +
  geom_col() +
  scale_y_comma(position = "right") +
  coord_flip() +
  labs(x = NULL, y = "Minutes Spoken") +
  theme_ipsum_rc(grid="X")

If we want to see the same basic view but include how much time each speaker spent on each topic, we can also do that without much effort:

count(debates2019, speaker, topic, wt=elapsed, sort=TRUE) %>%
  mutate(speaker = fct_reorder(speaker, n, sum, .desc=FALSE)) %>%
  ggplot(aes(speaker, n , fill = topic)) +
  geom_col() +
  scale_y_comma(position = "right") +
  ggthemes::scale_fill_tableau("Tableau 20", name = NULL) +
  coord_flip() +
  labs(x = NULL, y = "Minutes Spoken") +
  theme_ipsum_rc(grid="X") +
  theme(legend.position = "bottom")

By default geom_col() is going to use the fill aesthetic to group the bars and use the default sort order to stack them together.

We can also get a broken out view by not doing the count() and just letting the segments group together and use a white bar outline to keep them distinct:

debates2019 %>%
  mutate(speaker = fct_reorder(speaker, elapsed, sum, .desc=FALSE)) %>%
  ggplot(aes(speaker, elapsed, fill = topic)) +
  geom_col(color = "white") +
  scale_y_comma(position = "right") +
  ggthemes::scale_fill_tableau("Tableau 20", name = NULL) +
  coord_flip() +
  labs(x = NULL, y = "Minutes Spoken") +
  theme_ipsum_rc(grid="X") +
  theme(legend.position = "bottom")

While I liked the rounded rectangle aesthetic, I also really liked how the @nytimes ordered the segments by when the topics occurred during the debate. For other types of chicklet charts you don’t need the grouping variable to be a time-y whime-y column, just try to use something that has a sane ordering characteristic to it:

debates2019 %>%
  mutate(speaker = fct_reorder(speaker, elapsed, sum, .desc=FALSE)) %>%
  ggplot(aes(speaker, elapsed, group = timestamp, fill = topic)) +
  geom_col(color = "white", position = position_stack(reverse = TRUE)) +
  scale_y_comma(position = "right") +
  ggthemes::scale_fill_tableau("Tableau 20", name = NULL) +
  coord_flip() +
  labs(x = NULL, y = "Minutes Spoken") +
  theme_ipsum_rc(grid="X") +
  theme(legend.position = "bottom")

That last chart is about as far as you could go to reproduce the @nytimes look-and-feel without jumping through some serious gg-hoops.

I had made a rounded rectangle hidden geom to make rounded-corner tiles for the {statebins} package so making a version of ggplot2::geom_col() (which I also added to {ggplot2}) was pretty straightforward. There are some key differences in the defaults of geom_chicklet():

a “white” stroke for the chicklet/segment (geom_col() has NA for the stroke)
automatic reversing of the group order (geom_col() uses the standard sort order)
radius setting of unit(3, "px") (change this as you need)
chicklet legend geom (b/c they aren’t bars or points)

You likely just want to see it in action, so here it is without further adieu:

debates2019 %>%
  mutate(speaker = fct_reorder(speaker, elapsed, sum, .desc=FALSE)) %>%
  ggplot(aes(speaker, elapsed, group = timestamp, fill = topic)) +
  geom_chicklet(width = 0.75) +
  scale_y_continuous(
    expand = c(0, 0.0625),
    position = "right",
    breaks = seq(0, 14, 2),
    labels = c(0, sprintf("%d min.", seq(2, 14, 2)))
  ) +
  ggthemes::scale_fill_tableau("Tableau 20", name = NULL) +
  coord_flip() +
  labs(
    x = NULL, y = NULL, fill = NULL,
    title = "How Long Each Candidate Spoke",
    subtitle = "Nights 1 & 2 of the June 2019 Democratic Debates",
    caption = "Each bar segment represents the length of a candidate’s response to a question.\n\nOriginals <https://www.nytimes.com/interactive/2019/admin/100000006581096.embedded.html?>\n<https://www.nytimes.com/interactive/2019/admin/100000006584572.embedded.html?>\nby @nytimes Weiyi Cai, Jason Kao, Jasmine C. Lee, Alicia Parlapiano and Jugal K. Patel\n\n#rstats reproduction by @hrbrmstr"
  ) +
  theme_ipsum_rc(grid="X") +
  theme(axis.text.x = element_text(color = "gray60", size = 10)) +
  theme(legend.position = "bottom")

Yes, I upped the ggplot2 tweaking a bit to get closer to the @nytimes (FWIW I like the Y gridlines, YMMV) but didn’t have to do much else to replace geom_col() with geom_chicket(). You’ll need to play with the segment width value depending on the size of your own, different plots to get the best look (just like you do with any other geom).

Astute, intrepid readers will note that the above chart has all the topics whereas the @nytimes just has a few. We can do the grouping of non-salient topics into an “Other” category with forcats::fct_other() and make a manual fill scale from the values ~~stolen from~~used in homage from the @nytimes:

debates2019 %>%
  mutate(speaker = fct_reorder(speaker, elapsed, sum, .desc=FALSE)) %>%
  mutate(topic = fct_other(
    topic,
    c("Immigration", "Economy", "Climate Change", "Gun Control", "Healthcare", "Foreign Policy"))
  ) %>%
  ggplot(aes(speaker, elapsed, group = timestamp, fill = topic)) +
  geom_chicklet(width = 0.75) +
  scale_y_continuous(
    expand = c(0, 0.0625),
    position = "right",
    breaks = seq(0, 14, 2),
    labels = c(0, sprintf("%d min.", seq(2, 14, 2)))
  ) +
  scale_fill_manual(
    name = NULL,
    values = c(
      "Immigration" = "#ae4544",
      "Economy" = "#d8cb98",
      "Climate Change" = "#a4ad6f",
      "Gun Control" = "#cc7c3a",
      "Healthcare" = "#436f82",
      "Foreign Policy" = "#7c5981",
      "Other" = "#cccccc"
    ),
    breaks = setdiff(unique(debates2019$topic), "Other")
  ) +
  guides(
    fill = guide_legend(nrow = 1)
  ) +
  coord_flip() +
  labs(
    x = NULL, y = NULL, fill = NULL,
    title = "How Long Each Candidate Spoke",
    subtitle = "Nights 1 & 2 of the June 2019 Democratic Debates",
    caption = "Each bar segment represents the length of a candidate’s response to a question.\n\nOriginals <https://www.nytimes.com/interactive/2019/admin/100000006581096.embedded.html?>\n<https://www.nytimes.com/interactive/2019/admin/100000006584572.embedded.html?>\nby @nytimes Weiyi Cai, Jason Kao, Jasmine C. Lee, Alicia Parlapiano and Jugal K. Patel\n\n#rstats reproduction by @hrbrmstr"
  ) +
  theme_ipsum_rc(grid="X") +
  theme(axis.text.x = element_text(color = "gray60", size = 10)) +
  theme(legend.position = "top")

FIN

Remember, you can find out how to install {ggchicklet} and also where you can file issues or PRs over at https://cinc.rud.is/web/packages/ggchicklet/. The package has full documentation, including a vignette, but if any usage help is lacking, definitely file an issue.

If you use the package, don’t hesitate to share your creations in a comment or on Twitter so other folks can see how to use the package in different contexts.

Quick hit: ‘dig’-ging Into r-project.org DNS Records with {processx}

The r-project.org domain had some temporary technical difficulties this week (2019-29) that made reaching R-related resources problematic for a bunch of folks for a period of time. Incidents like this underscore the need for regional and network diversity when it comes to ensuring the availability of DNS services. That is, it does no good if you have two DNS servers if they’re both connected to the same power source and/or network connection since if power goes out or the network gets wonky no client will be able to translate r-project.org to an IP address that it can then connect to.

I’m not at-keyboard much this week so only had time to take an external poke at the (new) r-project.org DNS configuration late yesterday and today before the sleepyhead vacationers emerged from slumber. To my surprise, the r-project.org current DNS setup allows full zone transfers, which means you can get the full “database” of r-project.org DNS records if you know the right incantations.

So, I wrote a small R function wrapper for the dig command using {processx}. Folks on the legacy Windows operating system are on your own for getting a copy of dig installed but users of proper, modern operating systems like Linux or macOS should have it installed by-default (or will be an easy package manager grab away).

Wrapping dig

The R-wrapper for the dig command is pretty straightforward:

library(stringi) # string processing
library(processx) # system processes orchestration
library(tidyverse) # good data wrangling idioms

dig <- function(..., cat = TRUE) {

  processx::run(
    command = unname(Sys.which("dig")), 
    args = unlist(list(...)),
  ) -> out

  if (cat) message(out$stdout)

  invisible(out)

}

We expand the ellipses into command arguments, run the command, return the output and optionally display the output via message().

Let’s see if it works by getting the dig help:

dig("-h")
## Usage:  dig [@global-server] [domain] [q-type] [q-class] {q-opt}
##             {global-d-opt} host [@local-server] {local-d-opt}
##             [ host [@local-server] {local-d-opt} [...]]
## Where:  domain    is in the Domain Name System
##         q-class  is one of (in,hs,ch,...) [default: in]
##         q-type   is one of (a,any,mx,ns,soa,hinfo,axfr,txt,...) [default:a]
##                  (Use ixfr=version for type ixfr)
##         q-opt    is one of:
##                  -4                  (use IPv4 query transport only)
##                  -6                  (use IPv6 query transport only)
##                  -b address[#port]   (bind to source address/port)
##                  -c class            (specify query class)
##                  -f filename         (batch mode)
##                  -i                  (use IP6.INT for IPv6 reverse lookups)
##                  -k keyfile          (specify tsig key file)
##                  -m                  (enable memory usage debugging)
##                  -p port             (specify port number)
##                  -q name             (specify query name)
##                  -t type             (specify query type)
##                  -u                  (display times in usec instead of msec)
##                  -x dot-notation     (shortcut for reverse lookups)
##                  -y [hmac:]name:key  (specify named base64 tsig key)
##         d-opt    is of the form +keyword[=value], where keyword is:
##                  +[no]aaonly         (Set AA flag in query (+[no]aaflag))
##                  +[no]additional     (Control display of additional section)
##                  +[no]adflag         (Set AD flag in query (default on))
##                  +[no]all            (Set or clear all display flags)
##                  +[no]answer         (Control display of answer section)
##                  +[no]authority      (Control display of authority section)
##                  +[no]besteffort     (Try to parse even illegal messages)
##                  +bufsize=###        (Set EDNS0 Max UDP packet size)
##                  +[no]cdflag         (Set checking disabled flag in query)
##                  +[no]cl             (Control display of class in records)
##                  +[no]cmd            (Control display of command line)
##                  +[no]comments       (Control display of comment lines)
##                  +[no]crypto         (Control display of cryptographic fields in records)
##                  +[no]defname        (Use search list (+[no]search))
##                  +[no]dnssec         (Request DNSSEC records)
##                  +domain=###         (Set default domainname)
##                  +[no]edns[=###]     (Set EDNS version) [0]
##                  +ednsflags=###      (Set EDNS flag bits)
##                  +[no]ednsnegotiation (Set EDNS version negotiation)
##                  +ednsopt=###[:value] (Send specified EDNS option)
##                  +noednsopt          (Clear list of +ednsopt options)
##                  +[no]expire         (Request time to expire)
##                  +[no]fail           (Don't try next server on SERVFAIL)
##                  +[no]identify       (ID responders in short answers)
##                  +[no]idnout         (convert IDN response)
##                  +[no]ignore         (Don't revert to TCP for TC responses.)
##                  +[no]keepopen       (Keep the TCP socket open between queries)
##                  +[no]multiline      (Print records in an expanded format)
##                  +ndots=###          (Set search NDOTS value)
##                  +[no]nsid           (Request Name Server ID)
##                  +[no]nssearch       (Search all authoritative nameservers)
##                  +[no]onesoa         (AXFR prints only one soa record)
##                  +[no]opcode=###     (Set the opcode of the request)
##                  +[no]qr             (Print question before sending)
##                  +[no]question       (Control display of question section)
##                  +[no]recurse        (Recursive mode)
##                  +retry=###          (Set number of UDP retries) [2]
##                  +[no]rrcomments     (Control display of per-record comments)
##                  +[no]search         (Set whether to use searchlist)
##                  +[no]short          (Display nothing except short
##                                       form of answer)
##                  +[no]showsearch     (Search with intermediate results)
##                  +[no]split=##       (Split hex/base64 fields into chunks)
##                  +[no]stats          (Control display of statistics)
##                  +subnet=addr        (Set edns-client-subnet option)
##                  +[no]tcp            (TCP mode (+[no]vc))
##                  +time=###           (Set query timeout) [5]
##                  +[no]trace          (Trace delegation down from root [+dnssec])
##                  +tries=###          (Set number of UDP attempts) [3]
##                  +[no]ttlid          (Control display of ttls in records)
##                  +[no]vc             (TCP mode (+[no]tcp))
##         global d-opts and servers (before host name) affect all queries.
##         local d-opts and servers (after host name) affect only that lookup.
##         -h                           (print help and exit)
##         -v                           (print version and exit)

To get the DNS records of r-project.org DNS we need to find the nameservers, which we can do via:

ns <- dig("+short", "NS", "@9.9.9.9", "r-project.org")
## ns1.wu-wien.ac.at.
## ns2.urbanek.info.
## ns1.urbanek.info.
## ns3.urbanek.info.
## ns4.urbanek.info.
## ns2.wu-wien.ac.at.

There are six of them (which IIRC is a few more than they had earlier this week). I wanted to see if any supported zone transfers. Here’s one way to do that:

stri_split_lines(ns$stdout, omit_empty = TRUE) %>% # split the response in stdout into lines
  flatten_chr() %>% # turn the list into a character vector
  map_df(~{ # make a data frame out of the following
    tibble(
      ns = .x, # the nameserver we are probing
      res = dig("+noall", "+answer", "AXFR", glue::glue("@{.x}"), "r-project.org", cat = FALSE) %>%  # the dig zone transfer request
        pluck("stdout") # we only want the `stdout` element of the {processx} return value
    )
  }) -> xdf

xdf
## # A tibble: 6 x 2
##   ns              res                                                  
##   <chr>           <chr>                                                
## 1 ns1.wu-wien.ac… "; Transfer failed.\n"                               
## 2 ns2.urbanek.in… "R-project.org.\t\t7200\tIN\tSOA\tns0.wu-wien.ac.at.…
## 3 ns1.urbanek.in… "R-project.org.\t\t7200\tIN\tSOA\tns0.wu-wien.ac.at.…
## 4 ns3.urbanek.in… "R-project.org.\t\t7200\tIN\tSOA\tns0.wu-wien.ac.at.…
## 5 ns4.urbanek.in… "R-project.org.\t\t7200\tIN\tSOA\tns0.wu-wien.ac.at.…
## 6 ns2.wu-wien.ac… "; Transfer failed.\n"

(NOTE: You may not get things in the same order if you try this at home due to the way DNS queries and responses work.)

So, two servers did not accept our request but four did. Let’s see what a set of zone transfer records looks like:

cat(xdf[["res"]][[2]]) 
## R-project.org.    7200  IN  SOA ns0.wu-wien.ac.at. postmaster.wu-wien.ac.at. 2019040400 3600 1800 604800 3600
## R-project.org.    7200  IN  NS  ns1.urbanek.info.
## R-project.org.    7200  IN  NS  ns1.wu-wien.ac.at.
## R-project.org.    7200  IN  NS  ns2.urbanek.info.
## R-project.org.    7200  IN  NS  ns2.wu-wien.ac.at.
## R-project.org.    7200  IN  NS  ns3.urbanek.info.
## R-project.org.    7200  IN  NS  ns4.urbanek.info.
## R-project.org.    7200  IN  A 137.208.57.37
## R-project.org.    7200  IN  MX  5 mc1.ethz.ch.
## R-project.org.    7200  IN  MX  5 mc2.ethz.ch.
## R-project.org.    7200  IN  MX  5 mc3.ethz.ch.
## R-project.org.    7200  IN  MX  5 mc4.ethz.ch.
## R-project.org.    7200  IN  TXT "v=spf1 ip4:129.132.119.208/32 ~all"
## cran.at.R-project.org.  7200  IN  CNAME cran.wu-wien.ac.at.
## beta.R-project.org. 7200  IN  A 137.208.57.37
## bugs.R-project.org. 7200  IN  CNAME rbugs.urbanek.info.
## cran.ch.R-project.org.  7200  IN  CNAME cran.wu-wien.ac.at.
## cloud.R-project.org.  7200  IN  CNAME d3caqzu56oq2n9.cloudfront.net.
## cran.R-project.org. 7200  IN  CNAME cran.wu-wien.ac.at.
## ftp.cran.R-project.org. 7200  IN  CNAME cran.wu-wien.ac.at.
## www.cran.R-project.org. 7200  IN  CNAME cran.wu-wien.ac.at.
## cran-archive.R-project.org. 7200 IN CNAME cran.wu-wien.ac.at.
## developer.R-project.org. 7200 IN  CNAME rdevel.urbanek.info.
## cran.es.R-project.org.  7200  IN  A 137.208.57.37
## ess.R-project.org.  7200  IN  CNAME ess.math.ethz.ch.
## journal.R-project.org.  7200  IN  CNAME cran.wu-wien.ac.at.
## mac.R-project.org.  7200  IN  CNAME r.research.att.com.
## portal.R-project.org. 7200  IN  CNAME r-project.org.
## r-forge.R-project.org.  7200  IN  CNAME r-forge.wu-wien.ac.at.
## *.r-forge.R-project.org. 7200 IN  CNAME r-forge.wu-wien.ac.at.
## search.R-project.org. 7200  IN  CNAME finzi.psych.upenn.edu.
## svn.R-project.org.  7200  IN  CNAME svn-stat.math.ethz.ch.
## translation.R-project.org. 7200 IN  CNAME translation.r-project.kr.
## cran.uk.R-project.org.  7200  IN  CNAME cran.wu-wien.ac.at.
## cran.us.R-project.org.  7200  IN  A 137.208.57.37
## user2004.R-project.org. 7200  IN  CNAME r-project.org.
## useR2006.R-project.org. 7200  IN  CNAME r-project.org.
## user2007.R-project.org. 7200  IN  CNAME r-project.org.
## useR2008.R-project.org. 7200  IN  CNAME r-project.org.
## useR2009.R-project.org. 7200  IN  CNAME r-project.org.
## user2010.R-project.org. 7200  IN  CNAME r-project.org.
## useR2011.R-project.org. 7200  IN  CNAME r-project.org.
## useR2012.R-project.org. 7200  IN  CNAME r-project.org.
## useR2013.R-project.org. 7200  IN  CNAME r-project.org.
## user2014.R-project.org. 7200  IN  CNAME user2014.github.io.
## useR2015.R-project.org. 7200  IN  CNAME r-project.org.
## useR2016.R-project.org. 7200  IN  CNAME user2016.github.io.
## useR2017.R-project.org. 7200  IN  CNAME r-project.org.
## useR2018.R-project.org. 7200  IN  CNAME user-2018.netlify.com.
## useR2019.R-project.org. 7200  IN  A 5.135.185.16
## wiki.R-project.org. 7200  IN  CNAME cran.wu-wien.ac.at.
## win-builder.R-project.org. 7200 IN  A 129.217.207.166
## win-builder.R-project.org. 7200 IN  MX  0 rdevel.urbanek.info.
## www.R-project.org.  7200  IN  CNAME cran.wu-wien.ac.at.
## R-project.org.    7200  IN  SOA ns0.wu-wien.ac.at. postmaster.wu-wien.ac.at. 2019040400 3600 1800 604800 3600

That’s not pretty, but it’s wrangle-able. Let’s turn it into a data frame:

xdf[["res"]][[2]] %>% # get the response text
  stri_split_lines(omit_empty = TRUE) %>%  # split it into lines
  flatten_chr() %>%  # turn it into a character vector
  stri_split_regex("[[:space:]]+", n = 5, simplify = TRUE) %>% # split at whitespace, limiting to five fields
  as_tibble(.name_repair = "unique") %>% # make it a tibble
  set_names(c("host", "ttl", "class", "record_type", "value")) %>% # better colnames
  mutate(host = stri_trans_tolower(host)) %>% # case matters not in DNS names
  print(n=nrow(.)) # see our results
## # A tibble: 55 x 5
##    host                     ttl   class record_type value                                                              
##    <chr>                    <chr> <chr> <chr>       <chr>                                                              
##  1 r-project.org.           7200  IN    SOA         ns0.wu-wien.ac.at. postmaster.wu-wien.ac.at. 2019040400 3600 1800 …
##  2 r-project.org.           7200  IN    NS          ns1.urbanek.info.                                                  
##  3 r-project.org.           7200  IN    NS          ns1.wu-wien.ac.at.                                                 
##  4 r-project.org.           7200  IN    NS          ns2.urbanek.info.                                                  
##  5 r-project.org.           7200  IN    NS          ns2.wu-wien.ac.at.                                                 
##  6 r-project.org.           7200  IN    NS          ns3.urbanek.info.                                                  
##  7 r-project.org.           7200  IN    NS          ns4.urbanek.info.                                                  
##  8 r-project.org.           7200  IN    A           137.208.57.37                                                      
##  9 r-project.org.           7200  IN    MX          5 mc1.ethz.ch.                                                     
## 10 r-project.org.           7200  IN    MX          5 mc2.ethz.ch.                                                     
## 11 r-project.org.           7200  IN    MX          5 mc3.ethz.ch.                                                     
## 12 r-project.org.           7200  IN    MX          5 mc4.ethz.ch.                                                     
## 13 r-project.org.           7200  IN    TXT         "\"v=spf1 ip4:129.132.119.208/32 ~all\""                           
## 14 cran.at.r-project.org.   7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 15 beta.r-project.org.      7200  IN    A           137.208.57.37                                                      
## 16 bugs.r-project.org.      7200  IN    CNAME       rbugs.urbanek.info.                                                
## 17 cran.ch.r-project.org.   7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 18 cloud.r-project.org.     7200  IN    CNAME       d3caqzu56oq2n9.cloudfront.net.                                     
## 19 cran.r-project.org.      7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 20 ftp.cran.r-project.org.  7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 21 www.cran.r-project.org.  7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 22 cran-archive.r-project.… 7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 23 developer.r-project.org. 7200  IN    CNAME       rdevel.urbanek.info.                                               
## 24 cran.es.r-project.org.   7200  IN    A           137.208.57.37                                                      
## 25 ess.r-project.org.       7200  IN    CNAME       ess.math.ethz.ch.                                                  
## 26 journal.r-project.org.   7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 27 mac.r-project.org.       7200  IN    CNAME       r.research.att.com.                                                
## 28 portal.r-project.org.    7200  IN    CNAME       r-project.org.                                                     
## 29 r-forge.r-project.org.   7200  IN    CNAME       r-forge.wu-wien.ac.at.                                             
## 30 *.r-forge.r-project.org. 7200  IN    CNAME       r-forge.wu-wien.ac.at.                                             
## 31 search.r-project.org.    7200  IN    CNAME       finzi.psych.upenn.edu.                                             
## 32 svn.r-project.org.       7200  IN    CNAME       svn-stat.math.ethz.ch.                                             
## 33 translation.r-project.o… 7200  IN    CNAME       translation.r-project.kr.                                          
## 34 cran.uk.r-project.org.   7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 35 cran.us.r-project.org.   7200  IN    A           137.208.57.37                                                      
## 36 user2004.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 37 user2006.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 38 user2007.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 39 user2008.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 40 user2009.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 41 user2010.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 42 user2011.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 43 user2012.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 44 user2013.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 45 user2014.r-project.org.  7200  IN    CNAME       user2014.github.io.                                                
## 46 user2015.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 47 user2016.r-project.org.  7200  IN    CNAME       user2016.github.io.                                                
## 48 user2017.r-project.org.  7200  IN    CNAME       r-project.org.                                                     
## 49 user2018.r-project.org.  7200  IN    CNAME       user-2018.netlify.com.                                             
## 50 user2019.r-project.org.  7200  IN    A           5.135.185.16                                                       
## 51 wiki.r-project.org.      7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 52 win-builder.r-project.o… 7200  IN    A           129.217.207.166                                                    
## 53 win-builder.r-project.o… 7200  IN    MX          0 rdevel.urbanek.info.                                             
## 54 www.r-project.org.       7200  IN    CNAME       cran.wu-wien.ac.at.                                                
## 55 r-project.org.           7200  IN    SOA         ns0.wu-wien.ac.at. postmaster.wu-wien.ac.at. 2019040400 3600 1800

FIN

Zone transfers are a quick way to get all the DNS information for a site. As such, it isn’t generally recommended to allow zone transfers from just anyone (though trying to keep anything secret in public DNS is a path generally fraught with peril given how easy it is to brute-force record lookups). However, if r-project.org zone transfers stay generally open, then you can use this method to keep a local copy of r-project.org host info and make local /etc/hosts (or the Windows equivalent) entries when issues like the one this past week arise.

Quick Hit: Above the Fold; Hard wrapping text at ‘n’ characters

Despite being on holiday I’m getting in a bit of non-work R coding since the fam has a greater ability to sleep late than I do. Apart from other things I’ve been working on a PR into {lutz}, a package by @andyteucher that turns lat/lng pairs into timezone strings.

The package is super neat and has two modes: “fast” (originally based on a {V8}-backed version of @darkskyapp’s tzlookup javascript module) and “accurate” using R’s amazing spatial ops.

I ported the javascript algorithm to C++/Rcpp and have been tweaking the bit of package helper code that fetches this:

and extracts the embedded string tree and corresponding timezones array and turns both into something C++ can use.

Originally I just made a header file with the same long lines:

but that’s icky and fairly bad form, especially given that C++ will combine adjacent string literals for you.

The stringi::stri_wrap() function can easily take care of wrapping the time zone array elements for us:

but, I also needed the ability to hard-wrap the encoded string tree at a fixed width. There are lots of ways to do that, here are three of them:

library(Rcpp)
library(stringi)
library(tidyverse)
library(hrbrthemes)
library(microbenchmark)

sourceCpp(code = "
#include <Rcpp.h>

// [[Rcpp::export]]
std::vector< std::string > fold_cpp(const std::string& input, int width) {

  int sz = input.length() / width;

  std::vector< std::string > out;
  out.reserve(sz); // shld make this more efficient

  for (unsigned long idx=0; idx<sz; idx++) {
    out.push_back(
      input.substr(idx*width, width)
    );
  }

  if (input.length() % width != 0) out.push_back(input.substr(width*sz));

  return(out);
}
") 

fold_base <- function(input, width) {

  vapply(
    seq(1, nchar(input), width), 
    function(idx) substr(input, idx, idx + width - 1), 
    FUN.VALUE = character(1)
  )

}

fold_tidy <- function(input, width) {

  map_chr(
    seq(1, nchar(input), width),
    ~stri_sub(input, .x, length = width)
  ) 

}

(If you know of a package that has this type of function def leave a note in the comments).

Each one does the same thing: move n sequences of width characters into a new slot in a character vector. Let’s see what they do with this toy long string example:

(src <- paste0(c(rep("a", 30), rep("b", 30), rep("c", 4)), collapse = ""))
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcccc"

for (n in c(1, 7, 30, 40)) {

  print(fold_base(src, n))
  print(fold_tidy(src, n))
  print(fold_cpp(src, n))
  cat("\n")

}
##  [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c"
##  [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c"
##  [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [18] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" "b" "b" "b"
## [35] "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## [52] "b" "b" "b" "b" "b" "b" "b" "b" "b" "c" "c" "c" "c"
## 
##  [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb"
##  [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c"      
##  [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb"
##  [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c"      
##  [1] "aaaaaaa" "aaaaaaa" "aaaaaaa" "aaaaaaa" "aabbbbb" "bbbbbbb"
##  [7] "bbbbbbb" "bbbbbbb" "bbbbccc" "c"      
## 
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
## [3] "cccc"                          
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
## [3] "cccc"                          
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
## [3] "cccc"                          
## 
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb"
## [2] "bbbbbbbbbbbbbbbbbbbbcccc"                
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb"
## [2] "bbbbbbbbbbbbbbbbbbbbcccc"                
## [1] "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbb"
## [2] "bbbbbbbbbbbbbbbbbbbbcccc"

So, we know they all work, which means we can take a look at which one is faster. Let’s compare folding at various widths:

map_df(c(1, 3, 5, 7, 10, 20, 30, 40, 70), ~{
  microbenchmark(
    base = fold_base(src, .x),
    tidy = fold_tidy(src, .x),
    cpp = fold_cpp(src, .x)
  ) %>% 
    mutate(width = .x) %>% 
    as_tibble()
}) %>% 
  mutate(
    width = factor(width, 
                   levels = sort(unique(width)), 
                   ordered = TRUE)
  ) -> bench_df

ggplot(bench_df, aes(expr, time)) +
  ggbeeswarm::geom_quasirandom(
    aes(group = width, fill = width),
    groupOnX = TRUE, shape = 21, color = "white", size = 3, stroke = 0.125, alpha = 1/4
  ) +
  scale_y_comma(trans = "log10", position = "right") +
  coord_flip() +
  guides(
    fill = guide_legend(override.aes = list(alpha = 1))
  ) +
  labs(
    x = NULL, y = "Time (nanoseconds)",
    fill = "Split width:", 
    title = "Performance comparison between 'fold' implementations"
  ) +
  theme_ft_rc(grid="X") +
  theme(legend.position = "top")

ggplot(bench_df, aes(width, time)) +
  ggbeeswarm::geom_quasirandom(
    aes(group = expr, fill = expr),
    groupOnX = TRUE, shape = 21, color = "white", size = 3, stroke = 0.125, alpha = 1/4
  ) +
  scale_x_discrete(
    labels = c(1, 3, 5, 7, 10, 20, 30, 40, "Split/fold width: 70")
  ) +
  scale_y_comma(trans = "log10", position = "right") +
  scale_fill_ft() +
  coord_flip() +
  guides(
    fill = guide_legend(override.aes = list(alpha = 1))
  ) +
  labs(
    x = NULL, y = "Time (nanoseconds)",
    fill = NULL,
    title = "Performance comparison between 'fold' implementations"
  ) +
  theme_ft_rc(grid="X") +
  theme(legend.position = "top")

The Rcpp version is both faster and more consistent than the other two implementations (though they get faster as the number of string subsetting operations decrease); but, they’re all pretty fast. For an infrequently run process, it might be better to use the base R version purely for simplicity. Despite that fact, I used the Rcpp version to turn the string tree long line into:

FIN

If you have need to “fold” like this how do you currently implement your solution? Found a bug or better way after looking at the code? Drop a note in the comments so you can help others find an optimal solution to their own ‘fold’ing problems.

Quick hit: Some ggplot2 Stat 💙 for {logspline}

I’ve become a big fan of the {logspline} package over the past ~6 months and decided to wrap up a manual ggplot2 plotting process (well, it was at least in an RStudio snippet) into a small {ggplot2} Stat to make it easier to visualize various components of the fitted model.

If you’re new to logspline density estimation this paper by Stone, Hansen, Kooperberg, and Truong is a pretty accessible introduction to the methodology and the (primary) algorithm used in the {logspline} package.

If you’re already a user of the {logspline} package and have been just using the package’s base S3 plot function but wanted the functionality in {ggplot2} this was pretty much designed for you.

If you hit up https://git.rud.is/hrbrmstr/gglogspline there are links to all your fav social code collaboration services plus installation instructions. There’s a single Stat — stat_logspline() — which does all the hard work. By default it will make y the stat(density) computed field but there are other computed fields you can use as well:

count: computed count estimates (similar to ggplot2::stat_density())
probs: distribution function
survival: survival function
hazard : hazard function

You can take a look at the (light) comparison between geom_histogram(), geom_density(), ggalt::geom_bkde(), and gglogspline::stat_logspline() below (or use this link to bust the iframe):

FIN

For “general purpose” density estimation you can likely still rely on stat_density() but after you dig into the background of logspline and try it out in some appropriate use-cases you may grow to like it as much as I have.

The package itself is super-small and focused. One reason for that is it may be helpful for those who want to create or customize their own ggplot2 Stats. (I think I forgot to change the license to MIT before this post so I’ll rectify that ASAP; I default to AGPL in a fairly pathetic attempt to stop skeezy cybersecurity startups — pretty much 99% of them — from stealing code without giving back to the community).

As usual, kick the tyres and file issues or PRs as you feel moved (and wherever you feel comfortable social coding).

Introducing the {ethercalc} package

I mentioned EtherCalc in a previous post and managed to scrounge some time to put together a fledgling {ethercalc} package (it’s also on GitLab, SourceHut, Bitbucket and GitUgh, just sub out the appropriate URL prefix).

I’m creating a package-specific Docker image (there are a couple out there but I’m not supporting their use with the package as they have a CORS configuration that make EtherCalc API wrangling problematic) for EtherCalc but I would highly recommend that you just use it via the npm module. To do that you need a working Node.js installation and I highly recommended also running a local Redis instance (it’s super lightweight). Linux folks can use their fav package manager for that and macOS folks can use homebrew. Folks on the legacy Windows operating system can visit:

to get EtherCalc going.

I also recommend running EtherCalc and Redis together for performance reasons. EtherCalc will maintain a persistent store for your spreadsheets (they call them “rooms” since EtherCalc supports collaborative editing) with or without Redis, but using Redis makes all EtherCalc ops much, much faster.

Once you have Redis running (on localhost, which is the default) and Node.js + npm installed, you can do the following to install EtherCalc:

$ npm install -g ethercalc # may require `sudo` on some macOS or *nix systems

The -g tells npm to install the module globally and will work to ensure the ethercalc executable is on your PATH. Like many things one can install from Node.js or, even Python, you may see a cadre of “warnings” and possibly even some “errors”. If you execute the following and see similar messages:

$ ethercalc --host=localhost ## IMPORTANT TO USE --host=localhost
Please connect to: http://localhost:8000/
Starting backend using webworker-threads
Falling back to vm.CreateContext backend
Express server listening on port 8000 in development mode
Zappa 0.5.0 "You can't do that on stage anymore" orchestrating the show
Connected to Redis Server: localhost:6379

and then go to the URL it gives you and you see something like this:

then you’re all set to continue.

A [Very] Brief EtherCalc Introduction

EtherCalc has a wiki. As such, please hit that to get more info on EtherCalc.

For now, if you hit that big, blue “Create Spreadsheet” button, you’ll see something pretty familiar if you’ve used Google Sheets, Excel, LibreOffice Calc (etc):

If you start ethercalc without the --host=localhost it listens on all network interfaces, so other folks on your network can also use it as a local “cloud” spreadsheet app, but also edit with you, just like Google Sheets.

I recommend playing around a bit in EtherCalc before continuing just to see that it is, indeed, a spreadsheet app like all the others you are familiar with, except it has a robust API that we can orchestrate from within R, now.

Working with {ethercalc}

You can install {ethercalc} from the aforelinked source or via:

install.packages("ethercalc", repos = "https://cinc.rud.is")

where you’ll get a binary install for Windows and macOS (binary builds are for R 3.5.x but should also work for 3.6.x installs).

If you don’t want to drop to a command line interface to start EtherCalc you can use ec_start() to run one that will only be live during your R session.

Once you have EtherCalc running you’ll need to put the URL into an ETHERCALC_HOST environment variable. I recommend adding the following to ~/.Renviron and restarting your R session:

ETHERCALC_HOST=http://localhost:8000

(You’ll get an interactive prompt to provide this if you don’t have the environment variable setup.)

You can verify R can talk to your EtherCalc instance by executing ec_running() and reading the message or examining the (invisible) return value. Post a comment or file an issue (on your preferred social coding site) if you really think you’ve done everything right and still aren’t up and running by this point.

The use-case I setup in the previous blog post was to perform light data entry since scraping was both prohibited and would have taken more time given how the visualization was made. To start a new spreadsheet (remember, EtherCalc folks call these “rooms”), just do:

ec_new("for-blog")

And you should see this appear in your default web browser:

You can do ec_list() to see the names of all “saved” spreadsheets (ec_delete() can remove them, too).

We’ll type in the values from the previous post:

Now, to retrieve those values, we can do:

ec_read("for-blog", col_types="cii")
## # A tibble: 14 x 3
##    topic                actually_read say_want_covered
##    <chr>                        <int>            <int>
##  1 Health care                      7                1
##  2 Climate change                   5                2
##  3 Education                       11                3
##  4 Economics                        6                4
##  5 Science                         10                7
##  6 Technology                      14                8
##  7 Business                        13               11
##  8 National Security                1                5
##  9 Politics                         2               10
## 10 Sports                           3               14
## 11 Immigration                      4                6
## 12 Arts & entertainment             8               13
## 13 U.S. foreign policy              9                9
## 14 Religion                        12               12

That function takes any (relevant to this package use-case) parameter that readr::read_csv() takes (since it uses that under the hood to parse the object that comes back from the API call). If someone adds or modifies any values you can just call ec_read() again to retrieve them.

The ec_export() function lets you download the contents of the spreadsheet (“room”) to a local:

CSV
JSON
HTML
Markdown
Excel

file (and it also returns the raw data directly to the R session). So you can do something like:

cat(rawToChar(ec_export("for-blog", "md", "~/Data/survey.md")))
## |topic|actually_read|say_want_covered|
## | ---- | ---- | ---- |
## |Health care|7|1|
## |Climate change|5|2|
## |Education|11|3|
## |Economics|6|4|
## |Science|10|7|
## |Technology|14|8|
## |Business|13|11|
## |National Security|1|5|
## |Politics|2|10|
## |Sports|3|14|
## |Immigration|4|6|
## |Arts & entertainment|8|13|
## |U.S. foreign policy|9|9|
## |Religion|12|12|

You can also append to a spreadsheet right from R. We’ll sort that data frame (to prove the append is working and I’m not fibbing) and append it to the existing sheet (this is a toy example, but imagine appending to an always-running EtherCalc instance as a data logger, which folks actually do IRL):

ec_read("for-blog", col_types="cii") %>% 
  dplyr::arrange(desc(topic)) %>% 
  ec_append("for-blog")

Note that you can open up EtherCalc to any existing spreadsheets (“rooms”) via ec_view() as well.

FIN

It’s worth noting that EtherCalc appears to have a limit of around 500,000 “cells” per spreadsheet (“room”). I mention that since if you try to, say, ec_edit(ggplot2movies::movies, "movies") you would have very likely crashed the running EtherCalc instance if I did not code in some guide rails into that function and the ec_append() function to stop you from doing that. It’s sane limit IMO an Google Sheets does something similar (per-tab) for the similar reasons (and both limits are one reason I’m still against using a browser for “everything” given the limitations of javascript wrangling of DOM elements).

If you’re doing work on large-ish data, spreadsheets in general aren’t the best tools.

And, while you should avoid hand-wrangling data at all costs, ec_edit() is a much faster and feature-rich alternative to R’s edit() function on most systems.

I’ve shown off most of the current functionality of the {ethercalc} package in this post. One function I’ve left out is ec_cmd() which lets you completely orchestrate all EtherCalc operations. It’s powerful enough, and the EtherCalc command structure is gnarly enough, that we’ll have to cover it in a separate post. Also, stay tune for the aforementioned package-specific EtherCalc Docker image.

Kick the tyres, contribute issues and/or PRs as moved (and on your preferred social coding site) and see if both EtherCalc and {ethercalc} might work for you in place of or along with Excel and/or Google Sheets.

Makeover Jambalaya: Beating Dumbbells into Slopegraphs Whilst Orchestrating EtherCalc

This morning, @kairyssdal tweeted out the following graphic from @axios:

Confusing, but interesting.

Data shows we're a nation of news consumption hypocrites – Axios https://t.co/O0lPSc4OV3

— Kai Ryssdal (@kairyssdal) June 11, 2019

If you’re doing the right thing and blocking evil social media javascript you can find the Axios story here and the graphic below:

I’m gonna say it: the chart is confusing. I grok what they were trying to do, but this is a clear example where a slopegraph would definitely be a better choice than a directional dumbbell chart. At the time I had ~5 minutes to spare so I did a quick makeover and a short howto thread. This post is an expansion on said thread and if you’re in the midst of making the decision to consider reading or moving on here’s what we’ll cover:

Making the choice between scraping or manual data entry
Quick introduction to EtherCalc
Why you might consider using EtherCalc for manual data entry over Excel or Google Sheets
Pulling data from EtherCalc into R
Making a slopegraph with the captured data
Customizing the slopegraph with the captured data to tell one or more stories

Read on if any or all of that is captures your interest.

To scrape or not to scrape

Even if I wanted to scrape the site, Axios makes it pretty clear they are kinda not very nice people since — while it doesn’t mention scraping — that ToS link does indicate that:

(a) you will not license, sell, rent, lease, transfer, assign, distribute, host, or otherwise commercially exploit the Site or any content displayed on the Site; (b) you will not modify, make derivative works of, disassemble, reverse compile or reverse engineer any part of the Site; (c) you will not access the Site in order to build a similar or competitive website, product, or service; and (d) except as expressly stated herein, no part of the Site may be copied, reproduced, distributed, republished, downloaded, displayed, posted or transmitted in any form or by any means. Unless otherwise indicated, any future release, update, or addition to the Site’s functionality will be subject to these Terms. All copyright and other proprietary notices on the Site (or on any content displayed on the Site) must be retained on all copies thereof.

(OH NO I COPIED THAT FROM THEIR SITE SO I AM ALREADY VIOLATING THEIR [unenforceable] TOS!)

There’s this thing called “Fair Use” and this makeover I’m doing is 100% covered under that. The Axios ToS and the ToS of many other sites try to prohibit such fair use and they generally lose those battles in court. I have and will be citing them as sources throughout this post and the post itself falls under “commentary and criticism”. Unlike many unethical scrapers who are just scavenging data they did no work to generate and whose work will not serve the better interest of the general community, this post is fully dedicated to sharing and education.

In reality, Axios likely has such draconian ToS due to all the horrible unethical scrapers who just want free, un-cited news content.

Anyway…

Even if I could scrape they don’t embed a javascript data object nor do they load an XHR JSON data blob to make the graphic. They use an idiom of loading a base image then perform annotation via markup:

making it not worth taking the time to scrape.

That means data entry. 🙁

Using EtherCalc for fun and profit data entry

I dislike Microsoft Excel (even the modern versions of it) because it is overkill for data entry. I also dislike performing data entry in Google Sheets since that means I need to be cloud-connected. So, for small, local data entry needs I turn to EtherCalc. No internet access is required, nor is there a bloated app to run.

EtherCalc is a multiuser Google Sheets-like browser-based spreadsheet powered by javascript (both in-browser and the back-end). You can install it with:

$ npm install -g ethercalc

which assumes you have a working Node.js setup along with npm.

When you run:

$ ethercalc

you are given a URL to hit with your browser. Below is what that looks like with my data entry already complete:

It can use Redis or a local filesystem as a persistence layer and does support multiple folks editing the same document at the same time.

At this point I could just save it out manually to a CSV file and read it in the old-fashioned way, but EtherCalc has an API! So we can grab the data using {httr} calls, like this:

library(hrbrthemes)
library(tidyverse)

httr::GET(
  url = "http://localhost:8000/a983kmmne1i7.csv"
) -> res

(xdf <- httr::content(res))
## # A tibble: 14 x 3
##    topic                actually_read say_want_covered
##    <chr>                        <dbl>            <dbl>
##  1 Health care                      7                1
##  2 Climate change                   5                2
##  3 Education                       11                3
##  4 Economics                        6                4
##  5 Science                         10                7
##  6 Technology                      14                8
##  7 Business                        13               11
##  8 National Security                1                5
##  9 Politics                         2               10
## 10 Sports                           3               14
## 11 Immigration                      4                6
## 12 Arts & entertainment             8               13
## 13 U.S. foreign policy              9                9
## 14 Religion                        12               12

where a983kmmne1i7 is the active document identifer.

Now that we have the data, it’s time to start the makeover.

Stage 1: A basic slopegraph

(If you need a primer on slopegraphs, definitely check out this resource by @visualisingdata.)

We need to make a decision as to what’s going where on the slopegraph. I’m choosing to put what respondents actually read on the left and then what they say they want covered on the right. Regardless of order, we need to do bit of data wrangling to take a first stab at the chart:

ggplot() +
  # draw the slope lines
  geom_segment(
    data = xdf,
    aes(
      x = "Actually read", y = actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered
    )
  ) +
  # left and right vertical bars
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  # left and right category text
  geom_text(data = xdf, aes("Actually read", actually_read, label = topic)) +
  geom_text(data = xdf, aes("Say they\nwant covered", say_want_covered, label = topic)) +
  scale_x_discrete(position = "top")

That chart isn’t winning any (good) awards any time soon. Apart from the non-aligned category labels, the categories aren’t in traditional order (rank “#1” being at the top on the left), plus we definitely need more information on the chart (title, subtitle, caption, etc.). We’ll reorder the labels and tweak some of the aesthetic problems away and switch the theme:

xdf <- mutate(xdf, dir = factor(sign(actually_read - say_want_covered))) # get the category order right
xdf <- mutate(xdf, actually_read = -actually_read, say_want_covered = -say_want_covered) # reverse the Y axis

ggplot() +
  geom_segment(
    data = xdf,
    aes(
      "Actually read", actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered
    ),
    size = 0.25, color = "#b2b2b2"
  ) +
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  geom_text(
    data = xdf,
    aes("Actually read", actually_read, label = topic),
    family = font_rc, size = 4, hjust = 1, nudge_x = -0.01
  ) +
  geom_text(
    data = xdf,
    aes("Say they\nwant covered", say_want_covered, label = topic),
    family = font_rc, size = 4, hjust = 0, nudge_x = 0.01
  ) +
  scale_x_discrete(position = "top") +
  labs(
    x = NULL, y = NULL,
    title = "14 Topics Ranked by What Americans Read vs Want Covered",
    subtitle = "'Read' rank from Parse.ly May 2019 data.\n'Want covered' rank from Axios/SurveyMonkey poll conducted May 17-20, 2019",
    caption = "Source: Axios <https://www.axios.com/news-consumption-read-topics-56467fe6-81bd-4ae5-9173-cdff9865deda.html>\nMakeover by @hrbrmstr"
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank())

That looks much better and I stopped there due to time constraints for the initial thread. However, the slope lines tend to be fairly hard to follow and we really should be telling a story with them. But what story do we want to focus on ?

Story time

One aesthetic element we’ll want to immediately modify regardless of story is the line color. We can use the dir column for this:

ggplot() +
  geom_segment(
    data = xdf,
    aes(
      "Actually read", actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered,
      color = dir, size = dir
    )
  ) +
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  geom_text(
    data = xdf,
    aes("Actually read", actually_read, label = topic),
    family = font_rc, size = 4, hjust = 1, nudge_x = -0.01, lineheight = 0.875
  ) +
  geom_text(
    data = xdf,
    aes("Say they\nwant covered", say_want_covered, label = topic),
    family = font_rc, size = 4, hjust = 0, nudge_x = 0.01, lineheight = 0.875
  ) +
  scale_x_discrete(position = "top") +
  scale_size_manual(
    values = c(
      `-1` = 0.2,
      `0` = 0.2,
      `1` = 0.2
    ),
  ) +
  scale_color_manual(
    name = NULL,
    values = c(
      `-1` = ft_cols$red,
      `0` = "#2b2b2b",
      `1` = ft_cols$blue
    ),
    labels = c(
      `-1` = "Topics Readers Want Covered < Topics Read",
      `0` = "Topics Read The Same Amount As They Want Covered",
      `1` = "Topics Read < Topics Readers Want Covered"
    )
  ) +
  guides(
    size = FALSE
  ) +
  labs(
    x = NULL, y = NULL,
    title = "14 Topics Ranked by What Americans Read vs Want Covered",
    subtitle = "'Read' rank from Parse.ly May 2019 data.\n'Want covered' rank from Axios/SurveyMonkey poll conducted May 17-20, 2019",
    caption = "Source: Axios <https://www.axios.com/news-consumption-read-topics-56467fe6-81bd-4ae5-9173-cdff9865deda.html>\nMakeover by @hrbrmstr"
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank()) +
  theme(legend.position = "bottom") +
  theme(legend.direction = "vertical")

It’s still somewhat hard to pick out stories and the legend may be useful but it’s not ideal. Let’s highlight the different slope types with color, annotate them directly, and see what emerges:

library(hrbrthemes)
library(tidyverse)

httr::GET(
  url = "http://localhost:8000/a983kmmne1i7.csv"
) -> res

(xdf <- httr::content(res))

xdf <- mutate(xdf, dir = factor(sign(actually_read - say_want_covered)))
xdf <- mutate(xdf, actually_read = -actually_read, say_want_covered = -say_want_covered)

arw <- arrow(length = unit(5, "pt"), type = "closed")
#   x = c(1.2, 1.8, 1.9),
# y = -c(1, 13, 14),
# xend = c(1.05, 1.7, 1.6),
# yend = -c(1.125, 13, 14)
# ),
# aes(x, y , xend=xend, yend=yend),

ggplot() +
  geom_segment(
    data = xdf,
    aes(
      "Actually read", actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered,
      color = dir, size = dir
    ), show.legend = FALSE
  ) +
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  geom_text(
    data = xdf,
    aes("Actually read", actually_read, label = topic),
    family = font_rc, size = 4, hjust = 1, nudge_x = -0.01, lineheight = 0.875
  ) +
  geom_text(
    data = xdf,
    aes("Say they\nwant covered", say_want_covered, label = topic),
    family = font_rc, size = 4, hjust = 0, nudge_x = 0.01, lineheight = 0.875
  ) +
  geom_curve(
    data = data.frame(), 
    aes(x = 1.2, y = -1, xend = 1.05, yend = -1.125), 
    color = ft_cols$red, arrow = arw
  ) +
  geom_segment(
    data = data.frame(), aes(x = 1.6, xend = 1.6, yend = -12.1, y = -12.9), 
    color = "#2b2b2b", arrow = arw
  ) +
  geom_curve(
    data = data.frame(), aes(x = 1.2, y = -14.1, xend = 1.1, yend = -13.6),
    curvature = -0.5, color = ft_cols$blue, arrow = arw
  ) +
  geom_text(
    data = data.frame(
      x = c(1.15, 1.6, 1.2),
      y = -c(1.2, 13, 14),
      hjust = c(0, 0.5, 0),
      vjust = c(0.5, 1, 0.5),
      lab = c(
        "Topics Readers Want Covered < Topics Read",
        "Topics Read The Same Amount\nAs They Want Covered",
        "Topics Read < Topics Readers Want Covered"
      ),
      stringsAsFactors = FALSE
    ),
    aes(x, y, hjust = hjust, vjust = vjust, label = lab),
    family = font_rc, size = 2.5, lineheight = 0.875
  ) +
  scale_x_discrete(position = "top") +
  scale_size_manual(
    values = c(
      `-1` = 0.75,
      `0` = 0.2,
      `1` = 0.2
    )
  ) +
  scale_color_manual(
    name = NULL,
    values = c(
      `-1` = ft_cols$red,
      `0` = "#2b2b2b",
      `1` = ft_cols$blue
    )
  ) +
  labs(
    x = NULL, y = NULL,
    title = "14 Topics Ranked by What Americans Read vs Want Covered",
    subtitle = "'Read' rank from Parse.ly May 2019 data.\n'Want covered' rank from Axios/SurveyMonkey poll conducted May 17-20, 2019",
    caption = "Source: Axios <https://www.axios.com/news-consumption-read-topics-56467fe6-81bd-4ae5-9173-cdff9865deda.html>\nMakeover by @hrbrmstr"
  ) +
  theme(axis.text.x = element_text(size = 12, face = "bold", color = "black")) +
  theme(axis.text.y = element_blank())

This first story indicates a potential social desirability bias in the respondents in that they claim to care more about health care, climate change, and education but really care more about more frivolous things (sports), base things (politics), and things they have almost no control over (national security).

Let’s switch the focus (only showing the modified aesthetic to avoid a code DoS):

  scale_size_manual(
    values = c(
      `-1` = 0.2,
      `0` = 0.2,
      `1` = 0.75
    )
  ) +

Now we get to see just how far down on the priority list some of the “desired coverage” topics really sit. At least Health care is not at the bottom, but given how much technology controls our lives it’s a bit disconcerting to see that at the bottom.

What about the categories that did not differ in rank:

  scale_size_manual(
    values = c(
      `-1` = 0.2,
      `0` = 0.75,
      `1` = 0.2
    )
  ) +

You’re guess is as good as mine why folks rated these the same (assuming the surveys had similar language).

FIN

Now that you’ve got the data (oh, right, I forgot to do that):

structure(list(topic = c("Health care", "Climate change", "Education", 
"Economics", "Science", "Technology", "Business", "National Security", 
"Politics", "Sports", "Immigration", "Arts & entertainment", 
"U.S. foreign policy", "Religion"), actually_read = c(7, 5, 11, 
6, 10, 14, 13, 1, 2, 3, 4, 8, 9, 12), say_want_covered = c(1, 
2, 3, 4, 7, 8, 11, 5, 10, 14, 6, 13, 9, 12)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -14L), spec = structure(list(
    cols = list(topic = structure(list(), class = c("collector_character", 
    "collector")), actually_read = structure(list(), class = c("collector_double", 
    "collector")), say_want_covered = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

and some alternate views, perhaps you have an even better way to look at it. Drop a note in the comments with any of your creations or suggestions for improvement for the final versions shown here.