This is a brief (and likely obvious, for some folks) post on the dplyr::case_when()
function.
Part of my work-work is dealing with data from internet scans. When we’re performing a deeper inspection of a particular internet protocol or service we try to capture as much system and service metadata as possible. Sifting through said metadata to find individual and collective insight is often a painful task given the diversity in the internet ecosystem.
One attribute we try to collect in all our service scans is operating system (OS) version. For many of our minutiae-focused researchers, it’s vital to know if a host is using “CoreOS 899.17.0” vs “CoreOS 835.9.0”. For much of the aggregation and clustering work we do, “CoreOS” is just fine.
In broad scans for any given service the OS diversity can be YUGE. There may be upwards of 10 different variations each of Windows, Red Hat, Ubuntu, Debian, et. al. present along with a smattering of very highly infrequent OS-types such as “Scientific Linux”. Plus, we can always count on probes returning many NA
values for many discrete attribute queries, including OS type+version.
There are many ways to reduce a diverse list of OS type+version strings to a reduced target set. switch()
and ifelse()
are likely go-to solutions for many of you reading this. If you are in those camps and haven’t tried dplyr::case_when()
read on!
Noise Reduction
To illustrate the utility of case_when()
, let’s walk through an example. I created a tiny excerpt of just the OS type + version info from 500 observations out of a much larger internet scan. You can find that data at https://rud.is/dl/os.txt. Let’s take a look at the OS diversity:
library(ggalt)
library(hrbrthemes)
library(tidyverse)
os <- read_lines("https://rud.is/dl/os.txt", na = "NA")
str(table(os, useNA = "always"))
## 'table' int [1:28(1d)] 2 3 1 1 1 44 3 101 1 6 ...
## - attr(*, "dimnames")=List of 1
## ..$ os: chr [1:28] "" "<unknown>" "Amazon Linux AMI 2016.03" "Amazon Linux AMI 2016.09" ...
sort(unique(os))
## [1] ""
## [2] "<unknown>"
## [3] "Amazon Linux AMI 2016.03"
## [4] "Amazon Linux AMI 2016.09"
## [5] "Arch Linux"
## [6] "CentOS Linux 7 (Core)"
## [7] "CoreOS 766.4.0"
## [8] "CoreOS 899.17.0"
## [9] "Debian GNU/Linux 7 (wheezy)"
## [10] "Debian GNU/Linux 8 (jessie)"
## [11] "Fedora 20 (Heisenbug)"
## [12] "linux"
## [13] "openSUSE Leap 42.2"
## [14] "RancherOS v0.7.0"
## [15] "Red Hat Enterprise Linux Server 7.2 (Maipo)"
## [16] "Red Hat Enterprise Linux Server 7.3 (Maipo)"
## [17] "Ubuntu 14.04.1 LTS"
## [18] "Ubuntu 14.04.2 LTS"
## [19] "Ubuntu 14.04.3 LTS"
## [20] "Ubuntu 14.04.4 LTS"
## [21] "Ubuntu 14.04.5 LTS"
## [22] "Ubuntu 15.10"
## [23] "Ubuntu 16.04.1 LTS"
## [24] "Ubuntu 16.04.2 LTS"
## [25] "Ubuntu 16.10"
## [26] "Windows Server 2016 Datacenter"
## [27] "Windows Server 2016 Standard"
There are 29 (including NA
) different strings in just a tiny excerpt. Ugh.
If we want to group all Windows results as “Windows”, all Red Hat, CentOS and Fedora results as “Fedora”, all Ubuntu and Debian results as “Debian” and all CoreOS and Amazon results as “Amazon” while keeping NA
_s_ NA
and lumping everything else as “Other” it’s super-easy with case_when()
:
ELSE <- TRUE
case_when(
grepl("Windows", os) ~ "Windows-ish",
grepl("Red Hat|CentOS|Fedora", os) ~ "Fedora-ish",
grepl("Ubuntu|Debian", os) ~ "Debian-ish",
grepl("CoreOS|Amazon", os) ~ "Amazon-ish",
is.na(os) ~ "Unknown",
ELSE ~ "Other"
) %>%
table() %>%
as_data_frame() %>%
set_names(c("os", "Node Count")) %>%
arrange(`Node Count`) %>%
mutate(os = factor(os, os)) %>%
ggplot(aes(`Node Count`, os)) +
geom_lollipop(horizontal = TRUE, size=1.5, color="#54278f") +
scale_x_comma(limits=c(0,300)) +
labs(y=NULL, title="OS Types") +
theme_ipsum_rc(grid="X")
The clever formula (~
) syntax used by case_when()
enables you to cleanly and effortlessly reduce factor/categorical levels and also lets you preserve NA
values (which I translated to “Unknown”). Since ELSE
is used in the SQL CASE
statement and dplyr::case_when()
is a riff of said SQL cousin, I like to use an assigned ELSE
to make it more visually explicit, but using TRUE
is just as good (and, perhaps, better since TRUE
can’t get namespace clobbered like the ELSE
variable can).
FIN
If you’re in sequential or nested ifelse()
Hades or are frustrated by switch()
limitations, give dplyr::case_when()
a try for your next project.
Epilogue
Not enough time earlier to add other methods, so this hint from @drob will have to suffice for now:
@hrbrmstr alternative: fuzzyjoin's regex_left_join? Advantages:
1. regexes can be in config file
2. one string can match multiple regexes pic.twitter.com/9IavUTEDMJ— David Robinson (@drob) March 10, 2017