Makeover Jambalaya: Beating Dumbbells into Slopegraphs Whilst Orchestrating EtherCalc

This morning, @kairyssdal tweeted out the following graphic from @axios:

If you’re doing the right thing and blocking evil social media javascript you can find the Axios story here and the graphic below:

I’m gonna say it: the chart is confusing. I grok what they were trying to do, but this is a clear example where a slopegraph would definitely be a better choice than a directional dumbbell chart. At the time I had ~5 minutes to spare so I did a quick makeover and a short howto thread. This post is an expansion on said thread and if you’re in the midst of making the decision to consider reading or moving on here’s what we’ll cover:

  • Making the choice between scraping or manual data entry
  • Quick introduction to EtherCalc
  • Why you might consider using EtherCalc for manual data entry over Excel or Google Sheets
  • Pulling data from EtherCalc into R
  • Making a slopegraph with the captured data
  • Customizing the slopegraph with the captured data to tell one or more stories

Read on if any or all of that is captures your interest.

To scrape or not to scrape

Even if I wanted to scrape the site, Axios makes it pretty clear they are kinda not very nice people since — while it doesn’t mention scraping — that ToS link does indicate that:

(a) you will not license, sell, rent, lease, transfer, assign, distribute, host, or otherwise commercially exploit the Site or any content displayed on the Site; (b) you will not modify, make derivative works of, disassemble, reverse compile or reverse engineer any part of the Site; (c) you will not access the Site in order to build a similar or competitive website, product, or service; and (d) except as expressly stated herein, no part of the Site may be copied, reproduced, distributed, republished, downloaded, displayed, posted or transmitted in any form or by any means. Unless otherwise indicated, any future release, update, or addition to the Site’s functionality will be subject to these Terms. All copyright and other proprietary notices on the Site (or on any content displayed on the Site) must be retained on all copies thereof.

(OH NO I COPIED THAT FROM THEIR SITE SO I AM ALREADY VIOLATING THEIR [unenforceable] TOS!)

There’s this thing called “Fair Use” and this makeover I’m doing is 100% covered under that. The Axios ToS and the ToS of many other sites try to prohibit such fair use and they generally lose those battles in court. I have and will be citing them as sources throughout this post and the post itself falls under “commentary and criticism”. Unlike many unethical scrapers who are just scavenging data they did no work to generate and whose work will not serve the better interest of the general community, this post is fully dedicated to sharing and education.

In reality, Axios likely has such draconian ToS due to all the horrible unethical scrapers who just want free, un-cited news content.

Anyway

Even if I could scrape they don’t embed a javascript data object nor do they load an XHR JSON data blob to make the graphic. They use an idiom of loading a base image then perform annotation via markup:

making it not worth taking the time to scrape.

That means data entry. 🙁

Using EtherCalc for fun and profit data entry

I dislike Microsoft Excel (even the modern versions of it) because it is overkill for data entry. I also dislike performing data entry in Google Sheets since that means I need to be cloud-connected. So, for small, local data entry needs I turn to EtherCalc. No internet access is required, nor is there a bloated app to run.

EtherCalc is a multiuser Google Sheets-like browser-based spreadsheet powered by javascript (both in-browser and the back-end). You can install it with:

$ npm install -g ethercalc

which assumes you have a working Node.js setup along with npm.

When you run:

$ ethercalc

you are given a URL to hit with your browser. Below is what that looks like with my data entry already complete:

It can use Redis or a local filesystem as a persistence layer and does support multiple folks editing the same document at the same time.

At this point I could just save it out manually to a CSV file and read it in the old-fashioned way, but EtherCalc has an API! So we can grab the data using {httr} calls, like this:

library(hrbrthemes)
library(tidyverse)

httr::GET(
  url = "http://localhost:8000/a983kmmne1i7.csv"
) -> res

(xdf <- httr::content(res))
## # A tibble: 14 x 3
##    topic                actually_read say_want_covered
##    <chr>                        <dbl>            <dbl>
##  1 Health care                      7                1
##  2 Climate change                   5                2
##  3 Education                       11                3
##  4 Economics                        6                4
##  5 Science                         10                7
##  6 Technology                      14                8
##  7 Business                        13               11
##  8 National Security                1                5
##  9 Politics                         2               10
## 10 Sports                           3               14
## 11 Immigration                      4                6
## 12 Arts & entertainment             8               13
## 13 U.S. foreign policy              9                9
## 14 Religion                        12               12

where a983kmmne1i7 is the active document identifer.

Now that we have the data, it’s time to start the makeover.

Stage 1: A basic slopegraph

(If you need a primer on slopegraphs, definitely check out this resource by @visualisingdata.)

We need to make a decision as to what’s going where on the slopegraph. I’m choosing to put what respondents actually read on the left and then what they say they want covered on the right. Regardless of order, we need to do bit of data wrangling to take a first stab at the chart:

ggplot() +
  # draw the slope lines
  geom_segment(
    data = xdf,
    aes(
      x = "Actually read", y = actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered
    )
  ) +
  # left and right vertical bars
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  # left and right category text
  geom_text(data = xdf, aes("Actually read", actually_read, label = topic)) +
  geom_text(data = xdf, aes("Say they\nwant covered", say_want_covered, label = topic)) +
  scale_x_discrete(position = "top")

That chart isn’t winning any (good) awards any time soon. Apart from the non-aligned category labels, the categories aren’t in traditional order (rank “#1” being at the top on the left), plus we definitely need more information on the chart (title, subtitle, caption, etc.). We’ll reorder the labels and tweak some of the aesthetic problems away and switch the theme:

xdf <- mutate(xdf, dir = factor(sign(actually_read - say_want_covered))) # get the category order right
xdf <- mutate(xdf, actually_read = -actually_read, say_want_covered = -say_want_covered) # reverse the Y axis

ggplot() +
  geom_segment(
    data = xdf,
    aes(
      "Actually read", actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered
    ),
    size = 0.25, color = "#b2b2b2"
  ) +
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  geom_text(
    data = xdf,
    aes("Actually read", actually_read, label = topic),
    family = font_rc, size = 4, hjust = 1, nudge_x = -0.01
  ) +
  geom_text(
    data = xdf,
    aes("Say they\nwant covered", say_want_covered, label = topic),
    family = font_rc, size = 4, hjust = 0, nudge_x = 0.01
  ) +
  scale_x_discrete(position = "top") +
  labs(
    x = NULL, y = NULL,
    title = "14 Topics Ranked by What Americans Read vs Want Covered",
    subtitle = "'Read' rank from Parse.ly May 2019 data.\n'Want covered' rank from Axios/SurveyMonkey poll conducted May 17-20, 2019",
    caption = "Source: Axios <https://www.axios.com/news-consumption-read-topics-56467fe6-81bd-4ae5-9173-cdff9865deda.html>\nMakeover by @hrbrmstr"
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank())

That looks much better and I stopped there due to time constraints for the initial thread. However, the slope lines tend to be fairly hard to follow and we really should be telling a story with them. But what story do we want to focus on ?

Story time

One aesthetic element we’ll want to immediately modify regardless of story is the line color. We can use the dir column for this:

ggplot() +
  geom_segment(
    data = xdf,
    aes(
      "Actually read", actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered,
      color = dir, size = dir
    )
  ) +
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  geom_text(
    data = xdf,
    aes("Actually read", actually_read, label = topic),
    family = font_rc, size = 4, hjust = 1, nudge_x = -0.01, lineheight = 0.875
  ) +
  geom_text(
    data = xdf,
    aes("Say they\nwant covered", say_want_covered, label = topic),
    family = font_rc, size = 4, hjust = 0, nudge_x = 0.01, lineheight = 0.875
  ) +
  scale_x_discrete(position = "top") +
  scale_size_manual(
    values = c(
      `-1` = 0.2,
      `0` = 0.2,
      `1` = 0.2
    ),
  ) +
  scale_color_manual(
    name = NULL,
    values = c(
      `-1` = ft_cols$red,
      `0` = "#2b2b2b",
      `1` = ft_cols$blue
    ),
    labels = c(
      `-1` = "Topics Readers Want Covered < Topics Read",
      `0` = "Topics Read The Same Amount As They Want Covered",
      `1` = "Topics Read < Topics Readers Want Covered"
    )
  ) +
  guides(
    size = FALSE
  ) +
  labs(
    x = NULL, y = NULL,
    title = "14 Topics Ranked by What Americans Read vs Want Covered",
    subtitle = "'Read' rank from Parse.ly May 2019 data.\n'Want covered' rank from Axios/SurveyMonkey poll conducted May 17-20, 2019",
    caption = "Source: Axios <https://www.axios.com/news-consumption-read-topics-56467fe6-81bd-4ae5-9173-cdff9865deda.html>\nMakeover by @hrbrmstr"
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text = element_blank()) +
  theme(legend.position = "bottom") +
  theme(legend.direction = "vertical")

It’s still somewhat hard to pick out stories and the legend may be useful but it’s not ideal. Let’s highlight the different slope types with color, annotate them directly, and see what emerges:

library(hrbrthemes)
library(tidyverse)

httr::GET(
  url = "http://localhost:8000/a983kmmne1i7.csv"
) -> res

(xdf <- httr::content(res))

xdf <- mutate(xdf, dir = factor(sign(actually_read - say_want_covered)))
xdf <- mutate(xdf, actually_read = -actually_read, say_want_covered = -say_want_covered)

arw <- arrow(length = unit(5, "pt"), type = "closed")
#   x = c(1.2, 1.8, 1.9),
# y = -c(1, 13, 14),
# xend = c(1.05, 1.7, 1.6),
# yend = -c(1.125, 13, 14)
# ),
# aes(x, y , xend=xend, yend=yend),

ggplot() +
  geom_segment(
    data = xdf,
    aes(
      "Actually read", actually_read,
      xend = "Say they\nwant covered", yend = say_want_covered,
      color = dir, size = dir
    ), show.legend = FALSE
  ) +
  geom_vline(aes(xintercept = c(1, 2)), color = "#b2b2b2") +
  geom_text(
    data = xdf,
    aes("Actually read", actually_read, label = topic),
    family = font_rc, size = 4, hjust = 1, nudge_x = -0.01, lineheight = 0.875
  ) +
  geom_text(
    data = xdf,
    aes("Say they\nwant covered", say_want_covered, label = topic),
    family = font_rc, size = 4, hjust = 0, nudge_x = 0.01, lineheight = 0.875
  ) +
  geom_curve(
    data = data.frame(), 
    aes(x = 1.2, y = -1, xend = 1.05, yend = -1.125), 
    color = ft_cols$red, arrow = arw
  ) +
  geom_segment(
    data = data.frame(), aes(x = 1.6, xend = 1.6, yend = -12.1, y = -12.9), 
    color = "#2b2b2b", arrow = arw
  ) +
  geom_curve(
    data = data.frame(), aes(x = 1.2, y = -14.1, xend = 1.1, yend = -13.6),
    curvature = -0.5, color = ft_cols$blue, arrow = arw
  ) +
  geom_text(
    data = data.frame(
      x = c(1.15, 1.6, 1.2),
      y = -c(1.2, 13, 14),
      hjust = c(0, 0.5, 0),
      vjust = c(0.5, 1, 0.5),
      lab = c(
        "Topics Readers Want Covered < Topics Read",
        "Topics Read The Same Amount\nAs They Want Covered",
        "Topics Read < Topics Readers Want Covered"
      ),
      stringsAsFactors = FALSE
    ),
    aes(x, y, hjust = hjust, vjust = vjust, label = lab),
    family = font_rc, size = 2.5, lineheight = 0.875
  ) +
  scale_x_discrete(position = "top") +
  scale_size_manual(
    values = c(
      `-1` = 0.75,
      `0` = 0.2,
      `1` = 0.2
    )
  ) +
  scale_color_manual(
    name = NULL,
    values = c(
      `-1` = ft_cols$red,
      `0` = "#2b2b2b",
      `1` = ft_cols$blue
    )
  ) +
  labs(
    x = NULL, y = NULL,
    title = "14 Topics Ranked by What Americans Read vs Want Covered",
    subtitle = "'Read' rank from Parse.ly May 2019 data.\n'Want covered' rank from Axios/SurveyMonkey poll conducted May 17-20, 2019",
    caption = "Source: Axios <https://www.axios.com/news-consumption-read-topics-56467fe6-81bd-4ae5-9173-cdff9865deda.html>\nMakeover by @hrbrmstr"
  ) +
  theme(axis.text.x = element_text(size = 12, face = "bold", color = "black")) +
  theme(axis.text.y = element_blank())

This first story indicates a potential social desirability bias in the respondents in that they claim to care more about health care, climate change, and education but really care more about more frivolous things (sports), base things (politics), and things they have almost no control over (national security).

Let’s switch the focus (only showing the modified aesthetic to avoid a code DoS):

  scale_size_manual(
    values = c(
      `-1` = 0.2,
      `0` = 0.2,
      `1` = 0.75
    )
  ) +

Now we get to see just how far down on the priority list some of the “desired coverage” topics really sit. At least Health care is not at the bottom, but given how much technology controls our lives it’s a bit disconcerting to see that at the bottom.

What about the categories that did not differ in rank:

  scale_size_manual(
    values = c(
      `-1` = 0.2,
      `0` = 0.75,
      `1` = 0.2
    )
  ) +

You’re guess is as good as mine why folks rated these the same (assuming the surveys had similar language).

FIN

Now that you’ve got the data (oh, right, I forgot to do that):

structure(list(topic = c("Health care", "Climate change", "Education", 
"Economics", "Science", "Technology", "Business", "National Security", 
"Politics", "Sports", "Immigration", "Arts & entertainment", 
"U.S. foreign policy", "Religion"), actually_read = c(7, 5, 11, 
6, 10, 14, 13, 1, 2, 3, 4, 8, 9, 12), say_want_covered = c(1, 
2, 3, 4, 7, 8, 11, 5, 10, 14, 6, 13, 9, 12)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -14L), spec = structure(list(
    cols = list(topic = structure(list(), class = c("collector_character", 
    "collector")), actually_read = structure(list(), class = c("collector_double", 
    "collector")), say_want_covered = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

and some alternate views, perhaps you have an even better way to look at it. Drop a note in the comments with any of your creations or suggestions for improvement for the final versions shown here.

Cover image from Data-Driven Security
Amazon Author Page

8 Comments Makeover Jambalaya: Beating Dumbbells into Slopegraphs Whilst Orchestrating EtherCalc

  1. Pingback: Introducing the {ethercalc} package | rud.is

  2. Ben G

    Great post, can you explain this line a little bit xdf <- mutate(xdf, dir = factor(sign(actually_read - say_want_covered))) # get the category order right?

    Reply
    1. hrbrmstr

      Glad to!

      By “category order” I really mean “slope going up”, “level”, “slope going down”.

      sign() gives you the sign of numeric vector. e.g:

      sign(c(-5, 0, 5))
      ## [1] -1  0  1
      

      So -1 translates to “slope going down”, 0 translates to “level”, and 1 translates to slope going up.

      If I had not been lazy and focused on literate programming (my bad and def sincere apologies to all readers), I could have done something like this factor operation:

      set.seed(13) # this seed happens to randomly pull a 0
      (dat <- sample(-100:100, 10))
      ##  [1] -98  91   0 -27 -95  75  49  31 -37 -20
      
      factor(
        sign(dat),
        levels = c(-1, 0, 1),
        labels = c("Slope going down", "Level", "Slope going up")
      )
      ##  [1] Slope going down Slope going up   Level           
      ##  [4] Slope going down Slope going down Slope going up  
      ##  [7] Slope going up   Slope going up   Slope going down
      ## [10] Slope going down
      ## Levels: Slope going down Level Slope going up
      

      which would have also made some of the other parts of the blog post a bit more literate as well.

      Def let me know if this needs more elaboration & #ty for checking out the blog and taking time to comment!

      Reply
  3. Pingback: Making sense of tangled analytics - Enholm Heuristics

  4. Pingback: More On Slopegraphs – Data Science Austria

  5. Pingback: Introducing the {ethercalc} package – Data Science Austria

Leave a Reply to hrbrmstr Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.