Skip navigation

Tag Archives: post

@eddelbuettel’s idea is a good one. (it’s a quick read…jump there and come back). We’ll avoid confusion and call it R⁶ over here. Feel free to don the superclass.

I often wait for a complete example or new package announcement to blog something when a briefly explained snippet might have sufficient utility for many R users. Also, tweets are fleeting and twitter could end up on the island of misfit social media sites if it can’t generate revenue or find a giant buyer this year. Don’t get me wrong, twitter convos are fine/useful, but blogs are at least semi-permanent, especially if you let them be hoovered up by the Internet Archive (“Save Page Now” on their site or use this handy Chrome extension).

I’ll tag all R⁶ posts as “r6” if you want to auto-filter those out of your stream or just page through them.

I’ll also lead off the these micro-posts with a simple one: adding progress bars to your tidyverse purrr operations.

The purrr::map* functions enable expressive and type-safe vectorized operations. Mine are usually over a few million/billion IPv4 addresses or domain names/URLs and often involve moderately lengthy tasks so I usually add the ability to incorporate progress bars to functions I make (and, I’m trying hard to get out of the bad habit of long-ish anonymous functions in purrr calls). The following is a toy example, but it’s a working example you can run in your interactive R session now:

library(tidyverse)

arduously_long_nchar <- function(input_var, .pb=NULL) {
  
  if ((!is.null(.pb)) && inherits(.pb, "Progress") && (.pb$i < .pb$n)) .pb$tick()$print()
  
  Sys.sleep(1)
  
  nchar(input_var)
  
}

pb <- progress_estimated(length(letters))

map_int(letters, arduously_long_nchar, .pb=pb)

And, yes, I did make you wait ~26 seconds (unless you were intrepid enough to reduce the amount of sleep time :-)

If you happen to forget the progress bar object (or know you don’t need one):

map_int(letters, arduously_long_nchar)

the function still works (sans progress bars).

If you happen to also mess up what you pass in to the .pb parameter or get your progress bar out of sync with your object it won’t error out on you (it can be made much safer and wrapped in another function, say — tick_off(.pb) — but this is supposed to be a small post).

Comments/feedback/your-own-progress-methods are most welcome and encouraged.

A story about one of the retail chains (J.C. Penny) releasing their list of stores closing in 2017 crossed paths with my Feedly reading list today and jogged my memory that there were a number of chains closing many of their doors this year, and I wanted to see the impact that might have on various states.

I’m also doing this to add one more example of:

  • scraping (with content caching)
  • data cleaning
  • per-capita normalization
  • comparing salient information to other indicators

to the growing list of great examples out there by the extended R community. Plus, I feel compelled to try to keep up with @ma_salmon’s blogging pace.

Let’s jump right in…

library(httr)
library(rvest)
library(knitr)
library(kableExtra)
library(ggalt)
library(statebins)
library(hrbrthemes)
library(tidyverse)

options(knitr.table.format = "html")
update_geom_font_defaults(font_rc_light, size = 2.75)

“Closing” lists of four major retailers — K-Mart, Sears, Macy’s and J.C. Penny — abound (HTML formatting a list seems to be the “easy way out” story-wise for many blogs and newspapers). We can dig a bit deeper than just a plain set of lists, but first we need the data.

The Boston Globe has a nice, predictable, mostly-uniform pattern to their list-closing “stories”, so we’ll use that data. Site content can change quickly, so it makes sense to try to cache content whenever possible as we scrape it. To that end, we’ll use httr::GET vs xml2::read_html since GET preserves all of the original request and response information and read_html returns an external pointer that has no current support for serialization without extra work.

closings <- list(
  kmart = "https://www.bostonglobe.com/metro/2017/01/05/the-full-list-kmart-stores-closing-around/4kJ0YVofUWHy5QJXuPBAuM/story.html",
  sears = "https://www.bostonglobe.com/metro/2017/01/05/the-full-list-sears-stores-closing-around/yHaP6nV2C4gYw7KLhuWuFN/story.html",
  macys = "https://www.bostonglobe.com/metro/2017/01/05/the-full-list-macy-stores-closing-around/6TY8a3vy7yneKV1nYcwY7K/story.html",
    jcp = "https://www.bostonglobe.com/business/2017/03/17/the-full-list-penney-stores-closing-around/vhoHjI3k75k2pSuQt2mZpO/story.html"
)

saved_pgs <- "saved_store_urls.rds"

if (file.exists(saved_pgs)) {
  pgs <- read_rds(saved_pgs)
} else {
  pgs <- map(closings, GET)
  write_rds(pgs, saved_pgs)
}

This is what we get from that scraping effort:

map(pgs, content) %>%
  map(html_table) %>%
  walk(~glimpse(.[[1]]))
## Observations: 108
## Variables: 3
## $ X1 <chr> "300 Highway 78 E", "2003 US Hwy 280 Bypass", "3600 Wilson ...
## $ X2 <chr> "Jasper", "Phenix City", "Bakersfield", "Coalinga", "Kingsb...
## $ X3 <chr> "AL", "AL", "CA", "CA", "CA", "CA", "CO", "CO", "CT", "FL",...
## Observations: 42
## Variables: 4
## $ X1 <chr> "301 Cox Creek Pkwy", "1901 S Caraway Road", "90 Elm St; En...
## $ X2 <chr> "Florence", "Jonesboro", "Enfield", "Lake Wales", "Albany",...
## $ X3 <chr> "AL", "AR", "CT", "FL", "GA", "GA", "IN", "KS", "KY", "LA",...
## $ X4 <chr> "Y", "N", "N", "Y", "Y", "N", "N", "N", "Y", "Y", "Y", "Y",...
## Observations: 68
## Variables: 6
## $ X1 <chr> "Mission Valley Apparel", "Paseo Nuevo", "*Laurel Plaza", "...
## $ X2 <chr> "San Diego", "Santa Barbara", "North Hollywood", "Simi Vall...
## $ X3 <chr> "CA", "CA", "CA", "CA", "FL", "FL", "FL", "FL", "FL", "GA",...
## $ X4 <int> 385000, 141000, 475000, 190000, 101000, 195000, 143000, 140...
## $ X5 <int> 1961, 1990, 1995, 2006, 1995, 2000, 1977, 1974, 2000, 1981,...
## $ X6 <int> 140, 77, 105, 105, 68, 83, 86, 73, 72, 69, 9, 57, 54, 87, 5...
## Observations: 138
## Variables: 3
## $ Mall/Shopping Center <chr> "Auburn Mall", "Tannehill Promenade", "Ga...
## $ City                 <chr> "Auburn", "Bessemer", "Gadsden", "Jasper"...
## $ State                <chr> "AL", "AL", "AL", "AL", "AR", "AR", "AZ",...

We now need to normalize the content of the lists.

map(pgs, content) %>%
  map(html_table) %>%
  map(~.[[1]]) %>%
  map_df(select, abb=3, .id = "store") -> closings

We’re ultimately just looking for city/state for this simple exercise, but one could do more precise geolocation (perhaps with rgeocodio) and combine that with local population data, job loss estimates, current unemployment levels, etc. to make a real story out of the closings vs just do the easy thing and publish a list of stores.

count(closings, abb) %>%
  left_join(data_frame(name = state.name, abb = state.abb)) %>%
  left_join(usmap::statepop, by = c("abb"="abbr")) %>%
  mutate(per_capita = (n/pop_2015) * 1000000) %>%
  select(name, n, per_capita) -> closings_by_state

(NOTE: you can get the code for the entire Rmd via RPubs or GitHub)

Compared to unemployment/underutilization

I’d have used the epidata package to get the current state unemployment data but it’s not quite current enough, so we can either use a package to get data from the Bureau of Labor Statistics or just scrape it. A quick coin-flip says we’ll scrape the data.

We’ll use the U-6 rate since that is an expanded definition of “underutilization” including “total unemployed, plus all marginally attached workers, plus total employed part time for economic reasons, as a percent of the civilian labor force plus all marginally attached workers” and is likely to more representative for the populations working at retail chains. I could be wrong. I only play an economist on ?. If you are an economist, please drop a note telling me where I errd in my thinking ?

pg <- read_html("https://www.bls.gov/lau/stalt16q4.htm")

html_nodes(pg, "table#alternmeas16\\:IV") %>% 
  html_table(header = TRUE, fill = TRUE) %>%
  .[[1]] %>% 
  docxtractr::assign_colnames(1) %>% 
  rename(name=State) %>% 
  as_data_frame() %>% 
  slice(2:52) %>% 
  type_convert() %>% 
  left_join(closings_by_state, by="name") %>% 
  filter(!is.na(n)) -> with_unemp

ggplot(with_unemp, aes(per_capita, `U-6`)) +
  geom_label(aes(label=name), fill="#8c96c6", color="white", size=3.5, family=font_rc) +
  scale_x_continuous(limits=c(-0.125, 6.75)) +
  labs(x="Closings per-capita (1MM)", 
       y="BLS Labor Underutilization (U-6 rate)",
       title="Per-capita store closings compared to current BLS U-6 Rate") +
  theme_ipsum_rc(grid="XY")

If I were a reporter (again, I only play one on ?), I think I’d be digging a bit deeper on the impact of these (and the half-dozen or so other) retailers closing locations in New Mexico, Nevada, West Virginia, Wyoming, (mebbe Maine, though I’m super-b ased :-), North Dakota & South Dakota. I also hope @Marketplace does a few more stories on the changing retail landscape in the U.S. over the coming months to see if there are any overt consequences to the loss of jobs and anchor stores.

If you end up tinkering with the data, drop a note in the comments if something you discover piques your interest. For those interested in potentially marrying the data up with some additional cartography, there should be enough precision the store lists to get distinct enough lat/lon paris after geocoding (I did a quick test with rgeocodio) to make some interesting map views, especially if you can find more store closing lists.

I listen to @NPR throughout the day (on most days) and a story on Ohmconnect piqued my interest (it aired 5 days prior to this post). The TLDR on Ohmconnect is that it ostensibly helps you save energy by making you more aware of consumption and can be enabled to control various bits of IoT you have in your abode to curtail wanton power usage.

OK. So…?

Such a service requires access to (possibly many) accounts and devices to facilitate said awareness and control. Now, it’s 2017 and there’s this thing called OAuth that makes giving such access quite a bit safer than it was in the “old days” when you pretty much had to give your main username and password out to “connect” things.

It — apparently — is not 2017 wherever Ohmconnect developers reside since they ask for your credentials to every service and integration you want enabled. Don’t believe me? Take a look:

That’s just from (mostly) the non-thermostat integrations. They ask for your credentials for all services. That’s insane.

I can understand that they may need power company credentials since such industries are usually far behind the curve when it comes to internet-enablement. That doesn’t mean it’s A Good Thing to provide said credentials, but it’s a necessary evil when a service provider has no support for OAuth and you really want to use some integration to their portal.

Virtually all of the possible Ohmconnect-supported service integrations have OAuth support. Here’s a list of the ones that do/dont:

OAuth Support:

Appears to have no OAuth Support:

  • Lennox
  • Lutron
  • Radio Thermostat (Filtrete)
  • Revolv
  • WeMo

NOTE: The ones labeled as having no OAuth support may have either commercial OAuth support or hidden OAuth support. I’ll gladly modify the post if you leave a comment with official documentation showing they have OAuth support.

On the plus side, Ohmconnect developers now have some links they can follow to learn about OAuth and fix their woefully insecure service.

Why Are Credentials Bad?

Ohmconnect has to store your credentials for other services either in the clear or in some way that’s easy for them to reverse/decode. That means when criminals breach their servers (yes, when) they’ll get access to all the credentials you’ve entered on all those sites. Even if you’re one of the few who don’t use the same password everywhere and manage credentials in an app like @1Password it’s still both a pain to change them and you’ll be at risk during whatever the time-period is between breach and detection (which can be a very long time).

In the highly unlikely event they are doing the OAuth in the background for you (a complete violation of OAuth principles) they still take and process (and, likely store) your credentials for that transaction.

Either way, the request for and use of credentials is either (at best) a naive attempt at simplifying the user experience or (at worse) a brazen disregard for accepted norms for modern user-service integration for non-obvious reasons.

NOTE: I say “when” above as this would be a lovely target of choice for thieves given the types of data it can collect and the demographic that’s likely to use it.

What Can You Do?

Well, if you’re a current Ohmconnect you can cancel your account and change all the credentials for the services you connected. Yes, I’m being serious. If you really like their service, contact customer support and provide the above links and demand that they use OAuth instead.

You should absolutely not connect the devices/services that are on the “Appears to have no OAuth Support” list above to any third-party service if that service needs your credentials to make the connection. There’s no excuse for a cloud-based service to not support OAuth and there are plenty of choices for home/device control. Pick another brand and use it instead.

If you aren’t an Ohmconnect user, I would not sign up until they support OAuth. By defaulting to the “easy” use of username & password they are showing they really don’t take your security & privacy seriously and that means they really don’t deserve your business.

FIN

It is my firm belief that @NPR should either remove the story or issue guidance along with it in text and in audio form. They showcased this company and have all but directly encouraged listeners to use it. Such recommendations should come after much more research, especially security-focused research (they can ask for help with that from many orgs that will give them good advice for free).

In case you’re wondering, I did poke them about this on Twitter immediately after the NPR story and my initial signup attempt but they ignored said poke.

I’m also not providing any links to them given their lax security practices.

This is a brief (and likely obvious, for some folks) post on the dplyr::case_when() function.

Part of my work-work is dealing with data from internet scans. When we’re performing a deeper inspection of a particular internet protocol or service we try to capture as much system and service metadata as possible. Sifting through said metadata to find individual and collective insight is often a painful task given the diversity in the internet ecosystem.

One attribute we try to collect in all our service scans is operating system (OS) version. For many of our minutiae-focused researchers, it’s vital to know if a host is using “CoreOS 899.17.0” vs “CoreOS 835.9.0”. For much of the aggregation and clustering work we do, “CoreOS” is just fine.

In broad scans for any given service the OS diversity can be YUGE. There may be upwards of 10 different variations each of Windows, Red Hat, Ubuntu, Debian, et. al. present along with a smattering of very highly infrequent OS-types such as “Scientific Linux”. Plus, we can always count on probes returning many NA values for many discrete attribute queries, including OS type+version.

There are many ways to reduce a diverse list of OS type+version strings to a reduced target set. switch() and ifelse() are likely go-to solutions for many of you reading this. If you are in those camps and haven’t tried dplyr::case_when() read on!

Noise Reduction

To illustrate the utility of case_when(), let’s walk through an example. I created a tiny excerpt of just the OS type + version info from 500 observations out of a much larger internet scan. You can find that data at https://rud.is/dl/os.txt. Let’s take a look at the OS diversity:

library(ggalt)
library(hrbrthemes)
library(tidyverse)

os <- read_lines("https://rud.is/dl/os.txt", na = "NA")

str(table(os, useNA = "always"))
##  'table' int [1:28(1d)] 2 3 1 1 1 44 3 101 1 6 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ os: chr [1:28] "" "<unknown>" "Amazon Linux AMI 2016.03" "Amazon Linux AMI 2016.09" ...

sort(unique(os))
##  [1] ""                                           
##  [2] "<unknown>"                                  
##  [3] "Amazon Linux AMI 2016.03"                   
##  [4] "Amazon Linux AMI 2016.09"                   
##  [5] "Arch Linux"                                 
##  [6] "CentOS Linux 7 (Core)"                      
##  [7] "CoreOS 766.4.0"                             
##  [8] "CoreOS 899.17.0"                            
##  [9] "Debian GNU/Linux 7 (wheezy)"                
## [10] "Debian GNU/Linux 8 (jessie)"                
## [11] "Fedora 20 (Heisenbug)"                      
## [12] "linux"                                      
## [13] "openSUSE Leap 42.2"                         
## [14] "RancherOS v0.7.0"                           
## [15] "Red Hat Enterprise Linux Server 7.2 (Maipo)"
## [16] "Red Hat Enterprise Linux Server 7.3 (Maipo)"
## [17] "Ubuntu 14.04.1 LTS"                         
## [18] "Ubuntu 14.04.2 LTS"                         
## [19] "Ubuntu 14.04.3 LTS"                         
## [20] "Ubuntu 14.04.4 LTS"                         
## [21] "Ubuntu 14.04.5 LTS"                         
## [22] "Ubuntu 15.10"                               
## [23] "Ubuntu 16.04.1 LTS"                         
## [24] "Ubuntu 16.04.2 LTS"                         
## [25] "Ubuntu 16.10"                               
## [26] "Windows Server 2016 Datacenter"             
## [27] "Windows Server 2016 Standard"

There are 29 (including NA) different strings in just a tiny excerpt. Ugh.

If we want to group all Windows results as “Windows”, all Red Hat, CentOS and Fedora results as “Fedora”, all Ubuntu and Debian results as “Debian” and all CoreOS and Amazon results as “Amazon” while keeping NA_s_ NA and lumping everything else as “Other” it’s super-easy with case_when():

ELSE <- TRUE

case_when(
  grepl("Windows", os) ~ "Windows-ish",
  grepl("Red Hat|CentOS|Fedora", os) ~ "Fedora-ish",
  grepl("Ubuntu|Debian", os) ~ "Debian-ish",
  grepl("CoreOS|Amazon", os) ~ "Amazon-ish",
  is.na(os) ~ "Unknown",
  ELSE ~ "Other"
) %>%
  table() %>%
  as_data_frame() %>%
  set_names(c("os", "Node Count")) %>%
  arrange(`Node Count`) %>%
  mutate(os = factor(os, os)) %>%
  ggplot(aes(`Node Count`, os)) +
  geom_lollipop(horizontal = TRUE, size=1.5, color="#54278f") +
  scale_x_comma(limits=c(0,300)) +
  labs(y=NULL, title="OS Types") +
  theme_ipsum_rc(grid="X")

The clever formula (~) syntax used by case_when() enables you to cleanly and effortlessly reduce factor/categorical levels and also lets you preserve NA values (which I translated to “Unknown”). Since ELSE is used in the SQL CASE statement and dplyr::case_when() is a riff of said SQL cousin, I like to use an assigned ELSE to make it more visually explicit, but using TRUE is just as good (and, perhaps, better since TRUE can’t get namespace clobbered like the ELSE variable can).

FIN

If you’re in sequential or nested ifelse() Hades or are frustrated by switch() limitations, give dplyr::case_when() a try for your next project.

Epilogue

Not enough time earlier to add other methods, so this hint from @drob will have to suffice for now:

@mrshrbrmstr hinted that she would like this post by @RickWicklin translated into R for her stats class. She’s quite capable of cranking out the translation of the core component of that post — a call to chisq.test — but she wanted to show the entire post (in R) and really didn’t have time (she’s teaching a full load of classes and is department chair + a mom). I suggested that I, too, was a bit short on time which resulted in her putting out a call to the twitterverse for assistance which ultimately ended up coercing me into tackling the problem.

I won’t re-create Rick’s post or my riff of it here since you can check out the RPubs page for it and also get the source (you can get the source from the Rmd, too, but some folks like gists better).

So, why a blog post if not to present the translation?

Two reasons: I needed tidy Goodman simultaneous confidence intervals (SCIs) and Rick’s final plot was just begging to have “real” M&M’s as the point “geom”.

S[c]imple & Tidy SCIs

We’ve got options for calculating simultaneous CIs in R and I could have just used DescTools::MultinomCI except that I wanted a tibble and it returns a matrix plus it only has three of the more common methods implemented (yes, I am the ultimate package snob). I recalled that the CoinMinD package was tailor made for working with SCIs and has many more methods implemented, but the output is actually only that: print()ed to console output.

Yes, I shouted in disbelief at the glowing rectangle in front of me when I noticed that almost as loudly as you did when you read that sentence.

The algorithms implemented in CoinMinD are just dandy and the package is coming up on it’s 4th birthday. So, as a present from it (via me) to the R community, I whipped together scimple which generates tidy tibbles and has a function scimple_ci() which is similar to binom::binom.confint() in that it will generate the SCIs for all the available (non-Bayesian) methods, including Goodman.

Kick the tyres (pls!) and drop issues and/or PRs as you see fit.

You can’t plot just one

Rick’s post analyzes distributions of M&M’s so I went to the official M&M’s site to grab the official colors for the ones in his data set. I casually went about making the rest of the post with standard points with a superimposed white “m” when it dawned on me that the M&M’s site used those lentils (yes, it seems the candies are called lentils, or at least their icons are) were all over the site. After some site spelunking with Chrome Developer Tools I had the URLs for the candies in question and managed to use the nascent ggimage::geom_image() to place them on the plot:

The plot is a bit sparse as you have to get the aspect ratio just right to keep those tasty, tiny circles as circles.

The new geom_image() opens up many new possibilities for R visualizations (and not all are good possibilities). I think @mrshrbrmstr’s students got a kick out of a stats-y plot having real M&M’s on it so it worked OK this time. Just be wary of using gratuitous imagery and overdoing your watermarking.

As stated earlier you can get the code and see how you can improve upon Rick’s original post and my attempt at a quick riff. If you do end up cranking something out, drop a comment here or a tweet (@hrbrmstr) to show off your creation(s)!

I’m pleased to announce the inaugural release of my hrbrthemes (0.1.0) package on CRAN

The primary goal of said package is to provide opinionated typographical and other aesthetic defaults for ggplot2 charts.

Two core themes are included:

The Roboto Condensed Google Font comes with the package along with an installer for said font (it’s an R installer, you still need to install it on your OS manually).

Other niceties include:

  • scale_[xy]_comma() — shortcut for expand=c(0,0), labels=scales::comma
  • scale_[xy]_percent() — shortcut for expand=c(0,0), labels=scales::percent
  • scale_[color|fill]_ipsum() — discrete scale with 9 colors
  • gg_check() — pass-through spell checker for ggplot2 label elements

Source version is tracked on GitHub.

Critiques, bug reports and enhancement requests are most welcome as GitHub issues.

I’ve been threatening to do a series on “data science community security” for a while and had cause to issue this inaugural post today. It all started with this:

Let me begin with the following: @henrikbengtsson is an awesome member of the community. He makes great things and I trust his code and intentions. This post is not about him, it’s about raising awareness regarding security in the data science community.

I can totally see why folks would like Henrik’s tool. Package dependency management — including installing packages — is not the most beloved of R tasks, especially for new R users or those who prefer performing their respective science or statistical work vs delve deep into the innards of R. The suggestion to use:

source('http://callr.org/install#knitr')

no doubt came from a realization of how cumbersome it can be to deal with said dependency management. You can even ostensibly see what the script does since Henrik provides a link to it right on the page.

So, why the call to not use it?

For starters, if you do want to use this approach, grab his script and make a local copy of it. Read it. Try to grok what it does. Then, use it locally. It will likely be a time-/effort-saver for many R users.

My call was to not source it from the internet.

Why? To answer that I need to talk about trust.

hrbrmstr’s Hierarchy of Package Trust

When you install a package on your system you’re bringing someone else’s code into your personal work space. When you try to use said code with a library() call, R has a few mechanisms to run code on package startup. So, when you just install and load a package you’re executing real code in the context of your local user. Some of that code may be interpreted R code. Some may be calling compiled code. Some of it may be trying to execute binaries (apps) that are already on your system.

Stop and think about that for a second.

If you saw a USB stick outside your office with a label “Cool/Useful R Package” would you insert it into your system and install the package? (Please tell me you answered “No!” :-)

With that in mind, I have a personal “HieraRchy of Package Trust” that I try to stick by:

Tier 1

This should be a pretty obvious one, but if it’s my own code/server or my org’s code/server there’s inherent trust.

Tier 2

When you type install.pacakges() and rely on a known CRAN mirror, MRAN server or Bioconductor download using https you’re getting quite a bit in the exchange.

CRAN GuaRdians at least took some time to review the package. They won’t catch every possible potentially malicious bit and the efficacy of evaluating statistical outcomes is also left to the package user. What you’re getting — at least from the main cran.r-project.org repo and RStudio’s repos — are reviewed packages served from decently secured systems run by organizations with good intentions. Your trust in other mirror servers is up to you but there are no guarantees of security on them. I’ve evaluated the main CRAN and RStudio setups (remotely) and am comfortable with them but I also use my own personal, internal CRAN mirror for many reasons, including security.

Revolution-cum-Microsoft MRAN is also pretty trustworthy, especially since Microsoft has quite a bit to lose if there are security issues there.

Bioconductor also has solid package management practices in place, but I don’t use that ecosystem much (at all, really) so can’t speak too much more about it except that I’m comfortable enough with it to put it with the others at that level.

Tier 3

If I’m part of a known R cabal in private collaboration, I also trust it, but it’s still raw source and I have to scan through code to ensure the efficacy of it, so it’s a bit further down the list.

Tier 4

If I know the contributors to a public source repo, I’ll also consider trusting it, but I will still need to read through the source and doubly-so if there is compiled code involved.

Tiers 5 & 6

If the repo source is a new/out-of-the-blue contributor to the R community or hosted personally, it will be relegated to the “check back later” task list and definitely not installed without a thorough reading of the source.

NOTE

There are caveats to the list above — like CRAN R packages that download pre-compiled Windows libraries from GitHub — that I’ll go into in other posts, along with a demonstration of the perils of trust that I hope doesn’t get Hadley too upset (you’ll see why in said future post ?).

Also note that there is no place on said hierarchy for the random USB stick of cool/useful R code. #justdontdoit

Watering Holes

The places where folks come together to collaborate have a colloquial security name: a “watering hole”. Attackers use these known places to perform — you guessed it — “watering hole” attacks. They figure out where you go, who/what you trust and use that to do bad things. I personally don’t know of any current source-code attacks, but data scientists are being targeted in other ways by attackers. If attackers sense there is a source code soft-spot it will only be a matter of time before they begin to use that vector for attack. The next section mentions one possible attacker that you’re likely not thinking of as an “attacker”.

This isn’t FUD.

Governments, competitors and criminals know that the keys to the 21st century economy (for a while, anyway) reside in data and with those who can gather, analyze and derive insight from data. Not all of us have to worry about this, but many of us do and you should not dismiss the value of the work you’re doing, especially if you’re not performing open research. Imagine if a tiny bit of data exfiltration code managed to get on your Spark cluster or even your own laptop. This can easily happen with a tampered package (remember the incident a few years ago with usage tracking code in R scripts?).

A Bit More On https

I glossed over the https bit above, but by downloading a package over SSL/TLS you’re ensuring that the bits of code aren’t modified in transit from the server to your system and what you’re downloading is also not shown to prying eyes. That’s important since you really want to be sure you’re getting what you think your getting (i.e. no bits are changed) and you may be working in areas your oppressive, authoritarian government doesn’t approve of, such as protecting the environment or tracking global climate change (?).

The use of https also does show — at least in a limited sense — that the maintainers of the server knew enough to actually setup SSL/TLS and thought — at least for a minute or two — about security. The crazy move to “Let’s Encrypt” everything is a topic for another, non-R post, but you can use that service to get free certificates for your own sites with a pretty easy installation mechanism.

I re-mention SSL/TLS as a segue back to the original topic…

Back to the topic at hand

So, what’s so bad about:

source('http://callr.org/install#knitr')

On preview: nothing. Henrik’s a good dude and you can ostensibly see what that script is doing.

On review: much.

I won’t go into a great deal of detail, but that server is running a RHEL 5 server with 15 internet services enabled, ranging from FTP to mail to web serving along with two database servers (both older versions) directly exposed to the internet. The default serving mode is http and the SSL certificate it does have is not trusted by any common certificate store.

None of that was found by any super-elite security mechanism.

Point your various clients at those services on that system and you’ll get a response. To put it bluntly, that system is 100% vulnerable to attack. (How to setup a defensible system is a topic for another post.).

In other words, if said mechanism becomes a popular “watering hole” for easy installation of R packages, it’s also a pretty easy target for attackers to take surreptitious control of and then inject whatever they want, along with keeping track of what’s being installed, by whom and from which internet locale.

Plain base::source() does nothing on your end to validate the integrity of that script. It’s like using devtools::source() or devtools::source_gist() without the sha1 parameter (which uses a hash to validate the integrity of what you’re sourcing). Plus, it seems you cannot do:

devtools::source_url('http://callr.org/install#knitr', sha1="2c1c7fe56ea5b5127f0e709db9169691cc50544a")

since the httr call that lies beneath appears to be stripping away the #… bits. So, there’s no way to run this remotely with any level of integrity or confidentiality.

TLDR

If you like this script (it’s pretty handy) put it in a local directory and source it from there.

Fin

I can’t promise a frequency for “security in the data science community” posts but will endeavor to crank out a few more before summer. Note also that R is not the only ecosystem with these issues, so Python, Julia, node.js and other communities should not get smug :-)

Our pursuit of open, collaborative work has left us vulnerable to bad intentioned ne’er-do-wells and it’s important to at least be aware of the vulnerabilities in our processes, workflows and practices. I’m not saying you have to be wary of every devtools::instal_github() that you do, but you are now armed with information that might help you think twice about how often you trust calls to do such things.

In the meantime, download @henrikbengtsson’s script, thank him for making a very useful tool and run it locally (provided you’re cool with potentially installing things from non-CRAN repos :-)

The kind folks over at @RStudio gave a nod to my recently CRAN-released epidata package in their January data package roundup and I thought it might be useful to give it one more showcase using the recent CRAN update to ggalt and the new hrbrthemes (github-only for now) packages.

Labor force participation rate

The U.S. labor force participation rate (LFPR) is an oft-overlooked and under- or mis-reported economic indicator. I’ll borrow the definition from Investopedia:

The participation rate is a measure of the active portion of an economy’s labor force. It refers to the number of people who are either employed or are actively looking for work. During an economic recession, many workers often get discouraged and stop looking for employment, resulting in a decrease in the participation rate.

Population age distributions and other factors are necessary to honestly interpret this statistic. Parties in power usually dismiss/ignore this statistic outright and their opponents tend to wholly embrace it for criticism (it’s an easy target if you’re naive). “Yay” partisan democracy.

Since the LFPR has nuances when looked at categorically, let’s take a look at it by attained education level to see how that particular view has changed over time (at least since the gov-quants have been tracking it).

We can easily grab this data with epidata::get_labor_force_participation_rate()(and, we’ll setup some library() calls while we’re at it:

library(epidata)
library(hrbrthemes) # devtools::install_github("hrbrmstr/hrbrthemes")
library(ggalt)
library(tidyverse)
library(stringi)

part_rate <- get_labor_force_participation_rate("e")

glimpse(part_rate)
## Observations: 457
## Variables: 7
## $ date              <date> 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-01, 1979-04-01, 1979-05-01...
## $ all               <dbl> 0.634, 0.634, 0.635, 0.636, 0.636, 0.637, 0.637, 0.637, 0.638, 0.638, 0...
## $ less_than_hs      <dbl> 0.474, 0.475, 0.475, 0.475, 0.475, 0.474, 0.474, 0.473, 0.473, 0.473, 0...
## $ high_school       <dbl> 0.690, 0.691, 0.692, 0.692, 0.693, 0.693, 0.694, 0.694, 0.695, 0.696, 0...
## $ some_college      <dbl> 0.709, 0.710, 0.711, 0.712, 0.712, 0.713, 0.712, 0.712, 0.712, 0.712, 0...
## $ bachelor's_degree <dbl> 0.771, 0.772, 0.772, 0.773, 0.772, 0.772, 0.772, 0.772, 0.772, 0.773, 0...
## $ advanced_degree   <dbl> 0.847, 0.847, 0.848, 0.848, 0.848, 0.848, 0.847, 0.847, 0.848, 0.848, 0...

One of the easiest things to do is to use ggplot2 to make a faceted line chart by attained education level. But, let’s change the labels so they are a bit easier on the eyes in the facets and switch the facet order from alphabetical to something more useful:

gather(part_rate, category, rate, -date) %>% 
  mutate(category=stri_replace_all_fixed(category, "_", " "),
         category=stri_trans_totitle(category),
         category=stri_replace_last_regex(category, "Hs$", "High School"),
         category=factor(category, levels=c("Advanced Degree", "Bachelor's Degree", "Some College", 
                                            "High School", "Less Than High School", "All")))  -> part_rate

Now, we’ll make a simple line chart, tweaking the aesthetics just a bit:

ggplot(part_rate) +
  geom_line(aes(date, rate, group=category)) +
  scale_y_percent(limits=c(0.3, 0.9)) +
  facet_wrap(~category, scales="free") +
  labs(x=paste(format(range(part_rate$date), "%Y-%b"), collapse=" to "), 
       y="Participation rate (%)", 
       title="U.S. Labor Force Participation Rate",
       caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
  theme_ipsum_rc(grid="XY", axis="XY")

The “All” view is interesting in that the LFPR has held fairly “steady” between 60% & 70%. Those individual and fractional percentage points actually translate to real humans, so the “minor” fluctuations do matter.

It’s also interesting to see the direct contrast between the starting historical rate and current rate (you could also do the same with min/max rates, etc.) We can use a “dumbbell” chart to compare the 1978 value to today’s value, but we’ll need to reshape the data a bit first:

group_by(part_rate, category) %>% 
  arrange(date) %>% 
  slice(c(1, n())) %>% 
  spread(date, rate) %>% 
  ungroup() %>% 
  filter(category != "All") %>% 
  mutate(category=factor(category, levels=rev(levels(category)))) -> rate_range

filter(part_rate, category=="Advanced Degree") %>% 
  arrange(date) %>% 
  slice(c(1, n())) %>% 
  mutate(lab=lubridate::year(date)) -> lab_df

(We’ll be using the extra data frame to add labels the chart.)

Now, we can compare the various ranges, once again tweaking aesthetics a bit:

ggplot(rate_range) +
  geom_dumbbell(aes(y=category, x=`1978-12-01`, xend=`2016-12-01`),
                size=3, color="#e3e2e1",
                colour_x = "#5b8124", colour_xend = "#bad744",
                dot_guide=TRUE, dot_guide_size=0.25) +
  geom_text(data=lab_df, aes(x=rate, y=5.25, label=lab), vjust=0) +
  scale_x_percent(limits=c(0.375, 0.9)) +
  labs(x=NULL, y=NULL,
       title=sprintf("U.S. Labor Force Participation Rate %s-Present", lab_df$lab[1]),
       caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
  theme_ipsum_rc(grid="X")

Fin

One takeaway from both these charts is that it’s probably important to take education level into account when talking about the labor force participation rate. The get_labor_force_participation_rate() function — along with most other functions in the epidata package — also has options to factor the data by sex, race and age, so you can pull in all those views to get a more nuanced & informed understanding of this economic health indicator.