I posted a visualization of email safety status (a.k.a. DMARC) of the Fortune 500 (2017 list) the other day on Twitter and received this spiffy request from @MarkAltosaar:
Would you be willing to add the R code used to produce this to your vignette for ggchicklet? I would love to see how you arranged the factors since it is a proportion. Every time I try something like this I feel like my code becomes very complex.
— Mark Altosaar (@MarkAltosaar) September 26, 2019
There are many ways to achieve this result. I’ll show one here and walk through the process starting with the data (this is the 2018 DMARC evaluation run):
library(hrbrthemes) # CRAN or fav social coding site using hrbrmstr/pkgname
library(ggchicklet) # fav social coding site using hrbrmstr/pkgname
library(tidyverse)
f500_dmarc <- read_csv("https://rud.is/dl/f500-industry-dmarc.csv.gz", col_types = "cc")
f500_dmarc
## # A tibble: 500 x 2
## industry p
## <chr> <chr>
## 1 Retailing Reject
## 2 Technology None
## 3 Health Care Reject
## 4 Wholesalers None
## 5 Retailing Quarantine
## 6 Motor Vehicles & Parts None
## 7 Energy None
## 8 Wholesalers None
## 9 Retailing None
## 10 Telecommunications Quarantine
## # … with 490 more rows
The p
column is the DMARC classification for each organization (org names have been withheld to protect the irresponsible) and comes from the p=…
value in the DMARC DNS TXT
record field. It has a limited set of values, so let’s enumerate them and assign some colors:
dmarc_levels <- c("No DMARC", "None", "Quarantine", "Reject")
dmarc_cols <- set_names(c(ft_cols$slate, "#a6dba0", "#5aae61", "#1b7837"), dmarc_levels)
We want the aggregate value of each p
, thus we need to do count counting:
(dmarc_summary <- count(f500_dmarc, industry, p))
## # A tibble: 63 x 3
## industry p n
## <chr> <chr> <int>
## 1 Aerospace & Defense No DMARC 9
## 2 Aerospace & Defense None 3
## 3 Aerospace & Defense Quarantine 1
## 4 Apparel No DMARC 4
## 5 Apparel None 1
## 6 Business Services No DMARC 9
## 7 Business Services None 7
## 8 Business Services Reject 4
## 9 Chemicals No DMARC 12
## 10 Chemicals None 2
## # … with 53 more rows
We’re also going to want to sort the industries by those with the most DMARC (sorted bars/chicklets FTW!). We’ll need a factor for that, so let’s make one:
(dmarc_summary %>%
filter(p != "No DMARC") %>% # we don't care abt this `p` value
count(industry, wt=n, sort=TRUE) -> industry_levels)
## # A tibble: 21 x 2
## industry n
## <chr> <int>
## 1 Financials 54
## 2 Technology 25
## 3 Health Care 24
## 4 Retailing 23
## 5 Wholesalers 16
## 6 Energy 12
## 7 Transportation 12
## 8 Business Services 11
## 9 Industrials 8
## 10 Food, Beverages & Tobacco 6
## # … with 11 more rows
Now, we can make the chart:
dmarc_summary %>%
mutate(p = factor(p, levels = rev(dmarc_levels))) %>%
mutate(industry = factor(industry, rev(industry_levels$industry))) %>%
ggplot(aes(industry, n)) +
geom_chicklet(aes(fill = p)) +
scale_fill_manual(name = NULL, values = dmarc_cols) +
scale_y_continuous(expand = c(0,0), position = "right") +
coord_flip() +
labs(
x = NULL, y = NULL,
title = "DMARC Status of Fortune 500 (2017 List; 2018 Measurement) Primary Email Domains"
) +
theme_ipsum_rc(grid = "X") +
theme(legend.position = "top")
Doh! We rly want them to be 100% width. Thankfully, {ggplot2} has a position_fill()
we can use instead of position_dodge()
:
dmarc_summary %>%
mutate(p = factor(p, levels = rev(dmarc_levels))) %>%
mutate(industry = factor(industry, rev(industry_levels$industry))) %>%
ggplot(aes(industry, n)) +
geom_chicklet(aes(fill = p), position = position_fill()) +
scale_fill_manual(name = NULL, values = dmarc_cols) +
scale_y_continuous(expand = c(0,0), position = "right") +
coord_flip() +
labs(
x = NULL, y = NULL,
title = "DMARC Status of Fortune 500 (2017 List; 2018 Measurement) Primary Email Domains"
) +
theme_ipsum_rc(grid = "X") +
theme(legend.position = "top")
Doh! Even though we forgot to use reverse = TRUE
in the call to position_fill()
everything is out of order. Kinda. It’s in the order we told it to be in, but that’s not right b/c we need it ordered by the in-industry percentages. If each industry had the same number of organizations, there would not have been an issue. Unfortunately, the folks who make up these lists care not about our time. Let’s re-compute the industry factor by computing the percents:
(dmarc_summary %>%
group_by(industry) %>%
mutate(pct = n/sum(n)) %>%
ungroup() %>%
filter(p != "No DMARC") %>%
count(industry, wt=pct, sort=TRUE) -> industry_levels)
## # A tibble: 21 x 2
## industry n
## <chr> <dbl>
## 1 Transportation 0.667
## 2 Technology 0.641
## 3 Wholesalers 0.615
## 4 Financials 0.614
## 5 Health Care 0.6
## 6 Business Services 0.55
## 7 Food & Drug Stores 0.5
## 8 Retailing 0.5
## 9 Industrials 0.444
## 10 Telecommunications 0.375
## # … with 11 more rows
Now, we can go back to using position_fill()
as before:
dmarc_summary %>%
mutate(p = factor(p, levels = rev(dmarc_levels))) %>%
mutate(industry = factor(industry, rev(industry_levels$industry))) %>%
ggplot(aes(industry, n)) +
geom_chicklet(aes(fill = p), position = position_fill(reverse = TRUE)) +
scale_fill_manual(name = NULL, values = dmarc_cols) +
scale_y_percent(expand = c(0, 0.001), position = "right") +
coord_flip() +
labs(
x = NULL, y = NULL,
title = "DMARC Status of Fortune 500 (2017 List; 2018 Measurement) Primary Email Domains"
) +
theme_ipsum_rc(grid = "X") +
theme(legend.position = "top")
FIN
As noted, this is one way to handle this situation. I’m not super happy with the final visualization here as it doesn’t have the counts next to the industry labels and I like to have the ordering by both count and more secure configuration (so, conditional on higher prevalence of Quarantine
or Reject
when there are ties). That is an exercise left to the reader 😎.
3 Comments
I really love these chicklets, they are so visually appealing!
I wondered, if it takes much effort to add these great aesthetics (rounded corners, white stroke …) to other positionings like in
geom_rect()
orgeom_tile()
?#ty for checking out the blog and the kinds words abt the pkg!
Aye. You can use (note the triple
:::
)ggchicklet:::geom_rrect()
forgeom_rect()
with a radius component andstatebins:::geom_rtile()
(https://git.rud.is/hrbrmstr/statebins/src/branch/master/R/geom-rtile.R) for thegeom_tile()
equivalent with rounded corners. I haven’t exposed them yet since I’m not sure which pkg they shld go in to be general purpose geoms.awesome! thx!
4 Trackbacks/Pingbacks
[…] September 28, 2019 By Donald Greer [This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page […]
[…] by data_admin [This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page […]
[…] *** This is a Security Bloggers Network syndicated blog from rud.is authored by hrbrmstr. Read the original post at: https://rud.is/b/2019/09/27/100-stacked-chicklets/ […]
[…] September 29, 2019 By Donald Greer [This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page […]