hrbrmstr, Author at rud.is

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

R⁶ — RStudio Server Client? Make An App For That!

RStudio is a great way to work through analyses tasks, and I suspect most folks use the “desktop” version of the product on their local workstations.

The fine folks at RStudio also make a server version (the codebase for RStudio is able to generate server or desktop and they are generally in 100% feature parity when it comes to interactive use). You only need to set it up on a compatible server and then access it via any modern web browser to get virtually the same experience you have on the desktop.

I use RStudio Server as well as RStudio Desktop and have never been comfortable mixing web browsing tasks and analysis tasks in the same browser (it’s one of the many reasons I dislike jupyter notebooks). I also keep many apps open and inevitably would try to cmd-tab (I’m on macOS) between apps to find the RStudio server one only to realize I shld have been keyboard tabbing through Chrome tabs.

Now, it’s not too hard to fire up a separate Chrome or Safari instance to get a separate server but it’d be great if there was a way to make it “feel” more like an app — just like RStudio Desktop. Well, it turns out there is a way using nativefier.

If you use the Slack standalone desktop client, the Atom text editor or a few other “modern” apps, they are pretty much just web pages wrapped in a browser shell using something like Electron. Jia Hao came up with the idea of being able to do the same thing for any web page.

To create a standalone RStudio Server client for a particular RStudio Server instance, just do the following after installing nativefier:

nativefier "https://rstudio.example.com:8787/" --name "RStudio @ Example"

Replace the URL with the one you currently use in-browser (and, please consider using SSL/TLS when connecting over the public internet) and use any name that will be identifiable to you. You get a safe, standalone application and will never have to worry about browser crashes impacting your workflow.

There are many customizations you can make to the app shell and you can even use your own icons to represent servers differently. It’s all super-simple to setup and get working.

Note that for macOS folks there has been a way pre-nativefier to do this same thing with a tool called Fluid but it uses the Apple WebKit/Safari shell vs the Chrome shell and I prefer the Chrome shell and cross-platform app-making ability.

Hopefully this quick R⁶ tip will help you corral your RStudio Server connections. And, don’t forget to join in on the R⁶ bandwagon and share your own quick tips, snippets and hints to help the broader R community level-up their #rstats work.

Simple Offline Airport Wi-Fi Tracker in R

@visualisingdata rebroadcast this tweet today:

Wireless Passwords From Airports And Lounges Around The World https://t.co/OV0WJfwj8E

— deb verhoeven (@bestqualitycrab) April 2, 2017

The Google Maps interface is a bit meh and the “formatted” data is also a bit meh but the data is useful when travelling (NOTE: always use a VPN in airports on both your mobile devices and laptops).

You don’t need their app or online access to take advantage of this data if you’re willing to do a tiny bit of work. apply

You’ll find an R-generated version of the airport Wi-Fi finder below or via this direct link.

The premise is simple:

grab the airport data KML that the Google Maps interface uses (NOTE: Visit the original URL from the Tweet occasionally to see if the KML URL changes)
make cleaner, more formatted text for the popups
toss up a leaflet map
add a searchable DT::datatable interface
render the R markdown page to HTML and keep it with you
refresh the generated HTML right before you go on your world tour

The code is directly embedded in the generated R markdown document and also in this gist so I won’t pollute this post with code blocks.

I’m hoping @bhaskar_vk will apply his mad, l33t Leaflet h@x0r $k1llz to this crude, quick hack and kick this up a notch as there is substantial room for improvement. If you give that a go as well, drop a note in the comments or on Twitter. Some ideas for improvement:

better markers (please, not those airplane ones :-)
way better text cleanup
link the table and map (I think that means using Shiny but I’m likely wrong about that)
make it better on mobile (it works on mobile but there are different design considerations to make it more usable on tiny glowing rectangles)
make a “trip planner” Shiny app, letting folks select their airports and produce a handy digital reference card for them as they move about the globe

Superclassing to R⁶

To avoid “branding” confusion with R⁴ I’m superclassing it to R⁶ and encouraging others in the R community to don the moniker and do their own small, focused posts on topics that would help the R community learn things. Feel free to use R⁶ (I’ll figure out an acronym later). Feel free to tag your posts as R⁶ (or r6) and use the moniker as you see fit.

I’ll eventually tag the 2 current “r4” posts as “r6”.

Hopefully we can link together a cadre of R⁶ posts into a semi-organized structure that all can benefit from.

R⁶ — Snow Day Facets

Back in 2014 I blogged about first snowfall dates for a given U.S. state. It’s April 1, 2017 and we’re slated to get 12-18″ of snow up here in Maine and @mrshrbrmstr asked how often this — snow in May — has occurred near us.

As with all of these “R⁶ posts, expository is minimal and the focus is generally to demonstrate one small concept.

What I’ve done here (first) is make a full tidyverse update to the snowfirst code posted in the aforementioned blog post. You’ll need to clone that repo if you’re trying to work verbatim from the code below (otherwise just change file path code):

library(rprojroot)
library(stringi)
library(hrbrthemes)
library(tidyverse)

pre <- find_rstudio_root_file()

# Get and read in Maine precip ------------------------------------------------------

URL <- "http://cdiac.ornl.gov/ftp/ushcn_daily/state17_ME.txt.gz"
fil <- file.path(pre, "data", basename(URL))
if (!file.exists(fil)) download.file(URL, fil)

read_fwf(file = fil,
         col_positions = fwf_widths(c(6, 4, 2, 4, rep(c(5, 1, 1, 1), 31)),
                                    col_names = c("coop_id", "year", "month", "element",
                                                  flatten_chr(map(1:31, ~paste("r_", c("v", "fm", "fq", "fs"),
                                                                               .x, sep=""))))),
         col_types = paste0("ciic", paste0(rep("iccc", 31), collapse=""), collapse=""),
         na = c("", "NA", "-", "-9999")) %>%
  gather(day, value, starts_with("r_v")) %>%
  select(-starts_with("r_")) %>%
  mutate(day = as.numeric(stri_replace_first_fixed(day, "r_v", ""))) %>%
  mutate(date = sprintf("%s-%02d-%02d", year, month, day)) -> daily_wx

# Read in stations ------------------------------------------------------------------

URL <- "http://cdiac.ornl.gov/ftp/ushcn_daily/ushcn-stations.txt"
fil <- file.path(pre, "data", basename(URL))
if (!file.exists(fil)) download.file(URL, fil)

read_fwf(file = file.path(pre, "data", "ushcn-stations.txt"),
         col_positions = fwf_widths(c(6, 9, 10, 7, 3, 31, 7, 7, 7, 3),
                                    col_names = c("coop_id", "latitude", "longitude",
                                                  "elevation", "state", "name",
                                                  "component_1", "component_2",
                                                  "component_3", "utc_offset")),
         col_types = "cdddcccccc") -> stations

closestStation <- function(stations, lat, lon, restrict_to = NULL) {
  if (!is.null(restrict_to)) stations <- filter(stations, state == restrict_to)
  index <- which.min(sqrt((stations$latitude-lat)^2 +
                            (stations$longitude-lon)^2))
  stations[index,]
}

# compute total snow amounts per month ----------------------------------------------

(near_me <- closestStation(stations, 43.2672, -70.8617, restrict_to="ME"))

Now that we have the data, the short lesson here is just exposing the fact that you can get blank facets for free with ggplot2. I’m pointing this out as many folks seem to not like reading R documentation or miss things in said documentation (in fact, I had to be instructed today by @thomasp85 about a ggplot2 theme element setting that I didn’t know about and should have since I do try to keep up).

filter(daily_wx, coop_id == near_me$coop_id, element=="SNOW", value>0) %>%
  count(year, month, wt=value) %>%
  ungroup() %>%
  mutate(
    n = n / 10, # readings are in 10ths of inches
    date = as.Date(sprintf("%s-%02d-01", year, month)),
    month_name = lubridate::month(date, TRUE, FALSE)
  ) %>%
  ggplot(aes(x=date, y=n)) +
  geom_segment(aes(xend=date, yend=0), size=0.75, color="#9ecae1") +
  scale_y_continuous(limits=c(0, 65)) +
  facet_wrap(~month_name, ncol=3, drop=FALSE, scales="free") +
  labs(x=NULL, y="inches", title="Total snowfall in a given month by year",
       subtitle="Data for Station id 176905 — Portland (Maine) Jetport") +
  theme_ipsum_rc(grid="Y", axis_text_size=8)

Without ggplot2 helping us out we would have had to do some work to have those no-value facets to show up. I also like how there are no x-axis labels since there’s no data. ggplot2::facet_wrap() has many, very granular options for customizing the appearance of facets:

facet_wrap(facets, nrow = NULL, ncol = NULL, scales = "fixed",
           shrink = TRUE, labeller = "label_value", as.table = TRUE,
           switch = NULL, drop = TRUE, dir = "h", strip.position = "top")

If you haven’t played with them, you can use this example to try them out.

Fin

Even though that visualization gets the message across, I kinda like this view a bit better:

filter(daily_wx, coop_id == near_me$coop_id, element=="SNOW", value>0) %>%
  count(year, month, wt=value) %>%
  ungroup() %>%
  mutate(n = n / 10) %>%
  complete(year, month=1:12) %>%
  mutate(
    date = as.Date(sprintf("%s-%02d-01", year, month)),
    month_name = factor(lubridate::month(date, TRUE, FALSE), levels=rev(month.name))
  ) %>% 
  ggplot(aes(year, month_name)) +
  geom_tile(aes(fill=n), color="#b2b2b2", size=0.15) +
  scale_x_continuous(expand=c(0,0.15), position="top") +
  viridis::scale_fill_viridis(name = "Total inches", na.value="white") +
  labs(x=NULL, y=NULL, title="Total snowfall in a given month by year",
       subtitle="Data for Station id 176905 — Portland (Maine) Jetport") +
  theme_ipsum_rc(grid="", axis_text_size = 10) +
  guides(fill=guide_colourbar(label.position = "top", direction = "horizontal", title.vjust = 0)) +
  theme(legend.title = element_text(size=10)) +
  theme(legend.key.height = unit(0.5, "lines")) +
  theme(legend.position = c(0.9, 1.25))

The precision is lacking in the heatmap view, but you get a quick impression of when it has/hasn’t snowed. Plus you get to use viridis ?

All the updated code in in the snowfirst repo.

Crank you your own, small code snippets or ideas to the R community. R⁶ is an open tag and perhaps we can band together to make a distributed cadre of helpful, digestible posts the R community can benefit from.

All-in on R⁶ : Progress [bars] on first post

@eddelbuettel’s idea is a good one. (it’s a quick read…jump there and come back). We’ll avoid confusion and call it R⁶ over here. Feel free to don the superclass.

I often wait for a complete example or new package announcement to blog something when a briefly explained snippet might have sufficient utility for many R users. Also, tweets are fleeting and twitter could end up on the island of misfit social media sites if it can’t generate revenue or find a giant buyer this year. Don’t get me wrong, twitter convos are fine/useful, but blogs are at least semi-permanent, especially if you let them be hoovered up by the Internet Archive (“Save Page Now” on their site or use this handy Chrome extension).

I’ll tag all R⁶ posts as “r6” if you want to auto-filter those out of your stream or just page through them.

I’ll also lead off the these micro-posts with a simple one: adding progress bars to your tidyverse purrr operations.

The purrr::map* functions enable expressive and type-safe vectorized operations. Mine are usually over a few million/billion IPv4 addresses or domain names/URLs and often involve moderately lengthy tasks so I usually add the ability to incorporate progress bars to functions I make (and, I’m trying hard to get out of the bad habit of long-ish anonymous functions in purrr calls). The following is a toy example, but it’s a working example you can run in your interactive R session now:

library(tidyverse)

arduously_long_nchar <- function(input_var, .pb=NULL) {
  
  if ((!is.null(.pb)) && inherits(.pb, "Progress") && (.pb$i < .pb$n)) .pb$tick()$print()
  
  Sys.sleep(1)
  
  nchar(input_var)
  
}

pb <- progress_estimated(length(letters))

map_int(letters, arduously_long_nchar, .pb=pb)

And, yes, I did make you wait ~26 seconds (unless you were intrepid enough to reduce the amount of sleep time :-)

If you happen to forget the progress bar object (or know you don’t need one):

map_int(letters, arduously_long_nchar)

the function still works (sans progress bars).

If you happen to also mess up what you pass in to the .pb parameter or get your progress bar out of sync with your object it won’t error out on you (it can be made much safer and wrapped in another function, say — tick_off(.pb) — but this is supposed to be a small post).

Comments/feedback/your-own-progress-methods are most welcome and encouraged.

Exploring 2017 Retail Store Closings with R

A story about one of the retail chains (J.C. Penny) releasing their list of stores closing in 2017 crossed paths with my Feedly reading list today and jogged my memory that there were a number of chains closing many of their doors this year, and I wanted to see the impact that might have on various states.

I’m also doing this to add one more example of:

scraping (with content caching)
data cleaning
per-capita normalization
comparing salient information to other indicators

to the growing list of great examples out there by the extended R community. Plus, I feel compelled to try to keep up with @ma_salmon’s blogging pace.

Let’s jump right in…

library(httr)
library(rvest)
library(knitr)
library(kableExtra)
library(ggalt)
library(statebins)
library(hrbrthemes)
library(tidyverse)

options(knitr.table.format = "html")
update_geom_font_defaults(font_rc_light, size = 2.75)

“Closing” lists of four major retailers — K-Mart, Sears, Macy’s and J.C. Penny — abound (HTML formatting a list seems to be the “easy way out” story-wise for many blogs and newspapers). We can dig a bit deeper than just a plain set of lists, but first we need the data.

The Boston Globe has a nice, predictable, mostly-uniform pattern to their list-closing “stories”, so we’ll use that data. Site content can change quickly, so it makes sense to try to cache content whenever possible as we scrape it. To that end, we’ll use httr::GET vs xml2::read_html since GET preserves all of the original request and response information and read_html returns an external pointer that has no current support for serialization without extra work.

closings <- list(
  kmart = "https://www.bostonglobe.com/metro/2017/01/05/the-full-list-kmart-stores-closing-around/4kJ0YVofUWHy5QJXuPBAuM/story.html",
  sears = "https://www.bostonglobe.com/metro/2017/01/05/the-full-list-sears-stores-closing-around/yHaP6nV2C4gYw7KLhuWuFN/story.html",
  macys = "https://www.bostonglobe.com/metro/2017/01/05/the-full-list-macy-stores-closing-around/6TY8a3vy7yneKV1nYcwY7K/story.html",
    jcp = "https://www.bostonglobe.com/business/2017/03/17/the-full-list-penney-stores-closing-around/vhoHjI3k75k2pSuQt2mZpO/story.html"
)

saved_pgs <- "saved_store_urls.rds"

if (file.exists(saved_pgs)) {
  pgs <- read_rds(saved_pgs)
} else {
  pgs <- map(closings, GET)
  write_rds(pgs, saved_pgs)
}

This is what we get from that scraping effort:

map(pgs, content) %>%
  map(html_table) %>%
  walk(~glimpse(.[[1]]))
## Observations: 108
## Variables: 3
## $ X1 <chr> "300 Highway 78 E", "2003 US Hwy 280 Bypass", "3600 Wilson ...
## $ X2 <chr> "Jasper", "Phenix City", "Bakersfield", "Coalinga", "Kingsb...
## $ X3 <chr> "AL", "AL", "CA", "CA", "CA", "CA", "CO", "CO", "CT", "FL",...
## Observations: 42
## Variables: 4
## $ X1 <chr> "301 Cox Creek Pkwy", "1901 S Caraway Road", "90 Elm St; En...
## $ X2 <chr> "Florence", "Jonesboro", "Enfield", "Lake Wales", "Albany",...
## $ X3 <chr> "AL", "AR", "CT", "FL", "GA", "GA", "IN", "KS", "KY", "LA",...
## $ X4 <chr> "Y", "N", "N", "Y", "Y", "N", "N", "N", "Y", "Y", "Y", "Y",...
## Observations: 68
## Variables: 6
## $ X1 <chr> "Mission Valley Apparel", "Paseo Nuevo", "*Laurel Plaza", "...
## $ X2 <chr> "San Diego", "Santa Barbara", "North Hollywood", "Simi Vall...
## $ X3 <chr> "CA", "CA", "CA", "CA", "FL", "FL", "FL", "FL", "FL", "GA",...
## $ X4 <int> 385000, 141000, 475000, 190000, 101000, 195000, 143000, 140...
## $ X5 <int> 1961, 1990, 1995, 2006, 1995, 2000, 1977, 1974, 2000, 1981,...
## $ X6 <int> 140, 77, 105, 105, 68, 83, 86, 73, 72, 69, 9, 57, 54, 87, 5...
## Observations: 138
## Variables: 3
## $ Mall/Shopping Center <chr> "Auburn Mall", "Tannehill Promenade", "Ga...
## $ City                 <chr> "Auburn", "Bessemer", "Gadsden", "Jasper"...
## $ State                <chr> "AL", "AL", "AL", "AL", "AR", "AR", "AZ",...

We now need to normalize the content of the lists.

map(pgs, content) %>%
  map(html_table) %>%
  map(~.[[1]]) %>%
  map_df(select, abb=3, .id = "store") -> closings

We’re ultimately just looking for city/state for this simple exercise, but one could do more precise geolocation (perhaps with rgeocodio) and combine that with local population data, job loss estimates, current unemployment levels, etc. to make a real story out of the closings vs just do the easy thing and publish a list of stores.

count(closings, abb) %>%
  left_join(data_frame(name = state.name, abb = state.abb)) %>%
  left_join(usmap::statepop, by = c("abb"="abbr")) %>%
  mutate(per_capita = (n/pop_2015) * 1000000) %>%
  select(name, n, per_capita) -> closings_by_state

(NOTE: you can get the code for the entire Rmd via RPubs or GitHub)

Compared to unemployment/underutilization

I’d have used the epidata package to get the current state unemployment data but it’s not quite current enough, so we can either use a package to get data from the Bureau of Labor Statistics or just scrape it. A quick coin-flip says we’ll scrape the data.

We’ll use the U-6 rate since that is an expanded definition of “underutilization” including “total unemployed, plus all marginally attached workers, plus total employed part time for economic reasons, as a percent of the civilian labor force plus all marginally attached workers” and is likely to more representative for the populations working at retail chains. I could be wrong. I only play an economist on ?. If you are an economist, please drop a note telling me where I errd in my thinking ?

pg <- read_html("https://www.bls.gov/lau/stalt16q4.htm")

html_nodes(pg, "table#alternmeas16\\:IV") %>% 
  html_table(header = TRUE, fill = TRUE) %>%
  .[[1]] %>% 
  docxtractr::assign_colnames(1) %>% 
  rename(name=State) %>% 
  as_data_frame() %>% 
  slice(2:52) %>% 
  type_convert() %>% 
  left_join(closings_by_state, by="name") %>% 
  filter(!is.na(n)) -> with_unemp

ggplot(with_unemp, aes(per_capita, `U-6`)) +
  geom_label(aes(label=name), fill="#8c96c6", color="white", size=3.5, family=font_rc) +
  scale_x_continuous(limits=c(-0.125, 6.75)) +
  labs(x="Closings per-capita (1MM)", 
       y="BLS Labor Underutilization (U-6 rate)",
       title="Per-capita store closings compared to current BLS U-6 Rate") +
  theme_ipsum_rc(grid="XY")

If I were a reporter (again, I only play one on ?), I think I’d be digging a bit deeper on the impact of these (and the half-dozen or so other) retailers closing locations in New Mexico, Nevada, West Virginia, Wyoming, (mebbe Maine, though I’m super-b ased :-), North Dakota & South Dakota. I also hope @Marketplace does a few more stories on the changing retail landscape in the U.S. over the coming months to see if there are any overt consequences to the loss of jobs and anchor stores.

If you end up tinkering with the data, drop a note in the comments if something you discover piques your interest. For those interested in potentially marrying the data up with some additional cartography, there should be enough precision the store lists to get distinct enough lat/lon paris after geocoding (I did a quick test with rgeocodio) to make some interesting map views, especially if you can find more store closing lists.

Think Twice Before Using Ohmconnect

2017-03-12 – 11:26
Posted in Cybersecurity, Gadgets, iot, Physical Security, Security Awareness
Tagged post
Comments (3)

I listen to @NPR throughout the day (on most days) and a story on Ohmconnect piqued my interest (it aired 5 days prior to this post). The TLDR on Ohmconnect is that it ostensibly helps you save energy by making you more aware of consumption and can be enabled to control various bits of IoT you have in your abode to curtail wanton power usage.

OK. So…?

Such a service requires access to (possibly many) accounts and devices to facilitate said awareness and control. Now, it’s 2017 and there’s this thing called OAuth that makes giving such access quite a bit safer than it was in the “old days” when you pretty much had to give your main username and password out to “connect” things.

It — apparently — is not 2017 wherever Ohmconnect developers reside since they ask for your credentials to every service and integration you want enabled. Don’t believe me? Take a look:

That’s just from (mostly) the non-thermostat integrations. They ask for your credentials for all services. That’s insane.

I can understand that they may need power company credentials since such industries are usually far behind the curve when it comes to internet-enablement. That doesn’t mean it’s A Good Thing to provide said credentials, but it’s a necessary evil when a service provider has no support for OAuth and you really want to use some integration to their portal.

Virtually all of the possible Ohmconnect-supported service integrations have OAuth support. Here’s a list of the ones that do/dont:

OAuth Support:

Ecobee
Honeywell
Insteon
Nest
Philips Hue
Smartthings
Venstar : has commercial support for OAuth but not easily linkable
Wink
Wiser Air

Appears to have no OAuth Support:

Lennox
Lutron
Radio Thermostat (Filtrete)
Revolv
WeMo

NOTE: The ones labeled as having no OAuth support may have either commercial OAuth support or hidden OAuth support. I’ll gladly modify the post if you leave a comment with official documentation showing they have OAuth support.

On the plus side, Ohmconnect developers now have some links they can follow to learn about OAuth and fix their woefully insecure service.

Why Are Credentials Bad?

Ohmconnect has to store your credentials for other services either in the clear or in some way that’s easy for them to reverse/decode. That means when criminals breach their servers (yes, when) they’ll get access to all the credentials you’ve entered on all those sites. Even if you’re one of the few who don’t use the same password everywhere and manage credentials in an app like @1Password it’s still both a pain to change them and you’ll be at risk during whatever the time-period is between breach and detection (which can be a very long time).

In the highly unlikely event they are doing the OAuth in the background for you (a complete violation of OAuth principles) they still take and process (and, likely store) your credentials for that transaction.

Either way, the request for and use of credentials is either (at best) a naive attempt at simplifying the user experience or (at worse) a brazen disregard for accepted norms for modern user-service integration for non-obvious reasons.

NOTE: I say “when” above as this would be a lovely target of choice for thieves given the types of data it can collect and the demographic that’s likely to use it.

What Can You Do?

Well, if you’re a current Ohmconnect you can cancel your account and change all the credentials for the services you connected. Yes, I’m being serious. If you really like their service, contact customer support and provide the above links and demand that they use OAuth instead.

You should absolutely not connect the devices/services that are on the “Appears to have no OAuth Support” list above to any third-party service if that service needs your credentials to make the connection. There’s no excuse for a cloud-based service to not support OAuth and there are plenty of choices for home/device control. Pick another brand and use it instead.

If you aren’t an Ohmconnect user, I would not sign up until they support OAuth. By defaulting to the “easy” use of username & password they are showing they really don’t take your security & privacy seriously and that means they really don’t deserve your business.

FIN

It is my firm belief that @NPR should either remove the story or issue guidance along with it in text and in audio form. They showcased this company and have all but directly encouraged listeners to use it. Such recommendations should come after much more research, especially security-focused research (they can ask for help with that from many orgs that will give them good advice for free).

In case you’re wondering, I did poke them about this on Twitter immediately after the NPR story and my initial signup attempt but they ignored said poke.

I’m also not providing any links to them given their lax security practices.

Making a Case for case_when

This is a brief (and likely obvious, for some folks) post on the dplyr::case_when() function.

Part of my work-work is dealing with data from internet scans. When we’re performing a deeper inspection of a particular internet protocol or service we try to capture as much system and service metadata as possible. Sifting through said metadata to find individual and collective insight is often a painful task given the diversity in the internet ecosystem.

One attribute we try to collect in all our service scans is operating system (OS) version. For many of our minutiae-focused researchers, it’s vital to know if a host is using “CoreOS 899.17.0” vs “CoreOS 835.9.0”. For much of the aggregation and clustering work we do, “CoreOS” is just fine.

In broad scans for any given service the OS diversity can be YUGE. There may be upwards of 10 different variations each of Windows, Red Hat, Ubuntu, Debian, et. al. present along with a smattering of very highly infrequent OS-types such as “Scientific Linux”. Plus, we can always count on probes returning many NA values for many discrete attribute queries, including OS type+version.

There are many ways to reduce a diverse list of OS type+version strings to a reduced target set. switch() and ifelse() are likely go-to solutions for many of you reading this. If you are in those camps and haven’t tried dplyr::case_when() read on!

Noise Reduction

To illustrate the utility of case_when(), let’s walk through an example. I created a tiny excerpt of just the OS type + version info from 500 observations out of a much larger internet scan. You can find that data at https://rud.is/dl/os.txt. Let’s take a look at the OS diversity:

library(ggalt)
library(hrbrthemes)
library(tidyverse)

os <- read_lines("https://rud.is/dl/os.txt", na = "NA")

str(table(os, useNA = "always"))
##  'table' int [1:28(1d)] 2 3 1 1 1 44 3 101 1 6 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ os: chr [1:28] "" "<unknown>" "Amazon Linux AMI 2016.03" "Amazon Linux AMI 2016.09" ...

sort(unique(os))
##  [1] ""                                           
##  [2] "<unknown>"                                  
##  [3] "Amazon Linux AMI 2016.03"                   
##  [4] "Amazon Linux AMI 2016.09"                   
##  [5] "Arch Linux"                                 
##  [6] "CentOS Linux 7 (Core)"                      
##  [7] "CoreOS 766.4.0"                             
##  [8] "CoreOS 899.17.0"                            
##  [9] "Debian GNU/Linux 7 (wheezy)"                
## [10] "Debian GNU/Linux 8 (jessie)"                
## [11] "Fedora 20 (Heisenbug)"                      
## [12] "linux"                                      
## [13] "openSUSE Leap 42.2"                         
## [14] "RancherOS v0.7.0"                           
## [15] "Red Hat Enterprise Linux Server 7.2 (Maipo)"
## [16] "Red Hat Enterprise Linux Server 7.3 (Maipo)"
## [17] "Ubuntu 14.04.1 LTS"                         
## [18] "Ubuntu 14.04.2 LTS"                         
## [19] "Ubuntu 14.04.3 LTS"                         
## [20] "Ubuntu 14.04.4 LTS"                         
## [21] "Ubuntu 14.04.5 LTS"                         
## [22] "Ubuntu 15.10"                               
## [23] "Ubuntu 16.04.1 LTS"                         
## [24] "Ubuntu 16.04.2 LTS"                         
## [25] "Ubuntu 16.10"                               
## [26] "Windows Server 2016 Datacenter"             
## [27] "Windows Server 2016 Standard"

There are 29 (including NA) different strings in just a tiny excerpt. Ugh.

If we want to group all Windows results as “Windows”, all Red Hat, CentOS and Fedora results as “Fedora”, all Ubuntu and Debian results as “Debian” and all CoreOS and Amazon results as “Amazon” while keeping NA_s_ NA and lumping everything else as “Other” it’s super-easy with case_when():

ELSE <- TRUE

case_when(
  grepl("Windows", os) ~ "Windows-ish",
  grepl("Red Hat|CentOS|Fedora", os) ~ "Fedora-ish",
  grepl("Ubuntu|Debian", os) ~ "Debian-ish",
  grepl("CoreOS|Amazon", os) ~ "Amazon-ish",
  is.na(os) ~ "Unknown",
  ELSE ~ "Other"
) %>%
  table() %>%
  as_data_frame() %>%
  set_names(c("os", "Node Count")) %>%
  arrange(`Node Count`) %>%
  mutate(os = factor(os, os)) %>%
  ggplot(aes(`Node Count`, os)) +
  geom_lollipop(horizontal = TRUE, size=1.5, color="#54278f") +
  scale_x_comma(limits=c(0,300)) +
  labs(y=NULL, title="OS Types") +
  theme_ipsum_rc(grid="X")

The clever formula (~) syntax used by case_when() enables you to cleanly and effortlessly reduce factor/categorical levels and also lets you preserve NA values (which I translated to “Unknown”). Since ELSE is used in the SQL CASE statement and dplyr::case_when() is a riff of said SQL cousin, I like to use an assigned ELSE to make it more visually explicit, but using TRUE is just as good (and, perhaps, better since TRUE can’t get namespace clobbered like the ELSE variable can).

FIN

If you’re in sequential or nested ifelse() Hades or are frustrated by switch() limitations, give dplyr::case_when() a try for your next project.

Epilogue

Not enough time earlier to add other methods, so this hint from @drob will have to suffice for now:

@hrbrmstr alternative: fuzzyjoin's regex_left_join? Advantages:

1. regexes can be in config file
2. one string can match multiple regexes pic.twitter.com/9IavUTEDMJ

— David Robinson (@drob) March 10, 2017