The infamous @albertocairo [blogged about](http://www.thefunctionalart.com/2016/06/propublica-visualizes-seasonality-in.html) a [nice interactive piece on German company tax avoidance](https://projects.propublica.org/graphics/dividend) by @ProPublica. Here’s a snapshot of their interactive chart:
![](https://2.bp.blogspot.com/-S-8bu1UdYWM/V1rXibnBxrI/AAAAAAAAGo0/L940SpU3DvUPX90JK82jrKQN6fWMyn2IACLcB/s1600/1prop.png)
Dr. Cairo (his PhD is in the bag as far as I’m concerned :-) posited:
>_Isn’t it weird that the chart doesn’t have a scale on the Y-axis? It’s not the first time I see this, and it makes me feel uneasy._
I jumped over to the interactive piece to see if the authors used interactive tooltips since viewers can get a good idea for the scale limits if they do that and it _kinda sorta_ makes not having Y-axis label mostly OK if they compensate with said interactive notations. The interactive had no tooltips and the Y-axis was completely unlabeled.
Now, they used D3, so there _are_ built-in ways to create and add a Y-axis, so I don’t think this was an “oops…we forgot” moment. The Y values are “Short Interest Quantity” which is the quantity of stock shares that investors have sold short but not yet covered or closed out. It’s definitely a “1%-er” term and the authors already took time to explain some technical financial details and probably would have had to add even more text to explain this term properly (since that short definition is really not enough for most of us 99%-ers). It seems that they felt the the arrowed-annotations on the right hand side of the plot made up for the lack of actual Y-axis detail.
Should we _always_ have labels on a given axis? Would knowing that the Y-axis on this chart went from 0 to 800 million have aided in the decoding or groking the overall message? Here’s another example to help frame that question. This is the seminal `ggplot2::geom_density()` demo chart:
Given that folks outside the realm of statistics/datasci really don’t grok what that Y-axis is saying, would it be _horribad_ to just leave it with a _”density”_ Y-label (sans unit marks) and then explain it in text (or talk to/around it in text but not go into detail)? Or should we keep the full annotations and spend a precious paragraph of text talking about measuring the area under a curve? (Another argument is to choose the right vis for the right audience but that’s another post entirely).
To further illustrate the posit, I recently made a series of what I call a “rank ordered segment plot” for a [report](https://information.rapid7.com/rs/495-KNT-277/images/rapid7-research-report-national-exposure-index-060716.pdf) that we did at @Rapid7:
There are text annotations for countries at either end of the spectrum on the X-axis but they aren’t individually labeled cuz…ewwww that’d be messy. The interactive version (coming this week over at `community.rapid7.com`) has the full table and light hover popup-annotations. But the point wasn’t to really focus on the countries as it was to depict the sad state of the ratio of unencrypted vs encrypted for a given service type within a country.
So, _should_ the ProPublica authors have tried to be more discrete w/r/t their Y-axis or is it fine the way it is? Does there _always_ need to be discrete axes annotations or is there some wiggle room? Opines are welcome in the comments since I honestly don’t think there is “one answer to rule them all” for this.
And for those that really want to see more discrete info on the ProPublica Y-axis labels, here’s a static, faceted chart (you may need to click/select/tap the chart to make it big enough to view):
### Don’tTry This At Home!
ProPublica made that data available via two CSV files and the crosswalk org translation table via their main D3 javascript file (use Developer Tools “Inspect Element” to see such things). I ended up having to use `Sys.setlocale(‘LC_ALL’,’C’)` and expand the translation table a bit due to some of the mixed encodings in the data sets. Code to make the chart is below.
library(ggplot2)
library(dplyr)
library(stringi)
library(hrbrmisc)
library(scales)
library(ggalt)
library(sitools)
# mixed encodings ftw!
Sys.setlocale('LC_ALL','C')
# different names in different data sets; sigh
org_crosswalk <- read.table(text='company,trans
"Adidas AG","Adidas AG"
"Allianz SE","Allianz SE"
"BASF SE","BASF SE"
"Bayer AG","Bayer AG"
"Bayerische Motoren Werke AG","BMW AG"
"BMW AG","BMW AG",
"Beiersdorf AG","Beiersdorf AG"
"Commerzbank AG","Commerzbank AG"
"Continental AG","Continental AG"
"Daimler AG","Daimler AG"
"Deutsche Bank AG","Deutsche Bank AG"
"Deutsche Boerse AG","Deutsche Boerse AG"
"Deutsche Lufthansa AG","Deutsche Lufthansa AG"
"Deutsche Post AG","Deutsche Post AG"
"Deutsche Telekom AG","Deutsche Telekom AG"
"E.ON","E.ON"
"Fresenius Medical Care AG & Co. KGaA","Fresenius Medical Care AG"
"Fresenius Medical Care AG","Fresenius Medical Care AG"
"Fresenius SE & Co KGaA","Fresenius SE & Co KGaA"
"HeidelbergCement AG","HeidelbergCement AG"
"Henkel AG & Co. KGaA","Henkel AG & Co. KGaA"
"Infineon Technologies AG","Infineon Technologies AG"
"K+S AG","K+S AG"
"Lanxess AG","Lanxess AG"
"Linde AG","Linde AG"
"Merck KGaA","Merck KGaA"
"MŸnchener RŸckversicherungs-Gesellschaft AG","Munich RE AG"
"M�nchener R�ckversicherungs-Gesellschaft AG","Munich RE AG"
"M\x9fnchener R\x9fckversicherungs-Gesellschaft AG","Munich RE AG"
"M?nchener R?ckversicherungs-Gesellschaft AG","Munich RE AG"
"Munich RE AG","Munich RE AG"
"RWE AG","RWE AG"
"SAP SE","SAP SE"
"Siemens AG","Siemens AG"
"ThyssenKrupp AG","ThyssenKrupp AG"
"Volkswagen AG","Volkswagen AG"', stringsAsFactors=FALSE, sep=",", quote='"', header=TRUE)
# quicker/less verbose than left_join()
org_trans <- setNames(org_crosswalk$trans, org_crosswalk$company)
# get and clean both data sets, being kind to the propublica bandwidth $
rec_url <- "https://projects.propublica.org/graphics/javascripts/dividend/record_dates.csv"
rec_fil <- basename(rec_url)
if (!file.exists(rec_fil)) download.file(rec_url, rec_fil)
records <- read.csv(rec_fil, stringsAsFactors=FALSE)
records %>%
select(company=1, year=2, record_date=3) %>%
mutate(record_date=as.Date(stri_replace_all_regex(record_date,
"([[:digit:]]+)/([[:digit:]]+)+/([[:digit:]]+)$",
"20$3-$1-$2"))) %>%
mutate(company=ifelse(grepl("Gesellschaft", company), "Munich RE AG", company)) %>%
mutate(company=org_trans[company]) -> records
div_url <- "https://projects.propublica.org/graphics/javascripts/dividend/dividend.csv"
div_fil <- basename(div_url)
if (!file.exists(div_fil)) download.file(div_url, div_fil)
dividends <- read.csv(div_fil, stringsAsFactors=FALSE)
dividends %>%
select(company=1, pricing_date=2, short_int_qty=3) %>%
mutate(pricing_date=as.Date(stri_replace_all_regex(pricing_date,
"([[:digit:]]+)/([[:digit:]]+)+/([[:digit:]]+)$",
"20$3-$1-$2"))) %>%
mutate(company=ifelse(grepl("Gesellschaft", company), "Munich RE AG", company)) %>%
mutate(company=org_trans[company]) -> dividends
# sitools::f2si() doesn't work so well for this for some reason, so mk a small helper function
m_fmt <- function (x) { sprintf("%d M", as.integer(x/1000000)) }
# gotta wrap'em all
subt <- wrap_format(160)("German companies typically pay shareholders one big dividend a year. With the help of U.S. banks, international investors briefly lend their shares to German funds that don’t have to pay a dividend tax. The avoided tax – usually 15 percent of the dividend – is split by the investors and other participants in the deal. These transactions cost the German treasury about $1 billion a year. [Y-axis == short interest quantity]")
gg <- ggplot()
# draw the markers for the dividends
gg <- gg + geom_vline(data=records,
aes(xintercept=as.numeric(record_date)),
color="#b2182b", size=0.25, linetype="dotted")
# draw the time series
gg <- gg + geom_line(data=dividends,
aes(pricing_date, short_int_qty, group=company),
size=0.15)
gg <- gg + scale_x_date(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), labels=m_fmt,
limits=c(0,800000000))
gg <- gg + facet_wrap(~company, scales="free_x")
gg <- gg + labs(x="Red, dotted line == Dividend date", y=NULL,
title="Tax Avoidance Has a Heartbeat",
subtitle=subt,
caption="Source: https://projects.propublica.org/graphics/dividend")
# devtools::install_github("hrbrmstr/hrbrmisc") or roll your own
gg <- gg + theme_hrbrmstr_an(grid="XY", axis="", strip_text_size=8.5,
subtitle_size=10)
gg <- gg + theme(axis.text=element_text(size=6))
gg <- gg + theme(panel.grid.major=element_line(size=0.05))
gg <- gg + theme(panel.background=element_rect(fill="#e2e2e233",
color="#e2e2e233"))
gg <- gg + theme(panel.margin=margin(10,10,20,10))
gg <- gg + theme(plot.margin=margin(20,20,20,20))
gg <- gg + theme(axis.title.x=element_text(color="#b2182bee", size=9, hjust=1))
gg <- gg + theme(plot.caption=element_text(margin=margin(t=5)))
gg
4 Comments
It’s not entirely uncommon in theory physics to have “Quantity [arbitrary units]” on a y-axis, particularly when plotting a ratio or when the plotted thing has been transformed to within an inch of its life (common) to illustrate some feature (third derivative, etc…). It’s critical that the axis itself is labelled qualitatively (I don’t count the title as acceptable via inference) but sometimes the quantitative feature is apparent without an absolute scale, e.g. peak at x, relative heights (still maintaining relative scale to some baseline), relative slopes. As usual, the answer is probably “it depends.”
Thanks for your rant.
A quick question: what is the library “hrbrmisc”: where could it be available?
Sorry for the lame comment, digging better the internet I found a very convenient way to install the library.
(you can erase both comments if you want)
As silly as this sounds: I don’t demand labels from those I trust, I trust very few, therefore I always demand axis labels. But outside of the scientific world I think that data are used to tell a story, which should have less stringent rules, so I’m more lenient on it, especially if it’s telling. The enormous annual temporal deviations in those graphs need not to be scrutinized by my eyes if their point is that that pattern exists. And as an aside, back into the scientific world, and of special circumstance, I recently gave a collaborator a graph with neither x nor y axes, that was a 4×4 panel illustrating the change of some parameters of an equation. I thought that the figure didn’t need them because it captured everything within qualitative range of behaviours. Despite that, we decided that they should be labeled.
2 Trackbacks/Pingbacks
[…] article was first published on R – rud.is, and kindly contributed to […]
[…] article was first published on R – rud.is, and kindly contributed to […]