rud.is

2017-01 Authored Package Updates

The rest of the month is going to be super-hectic and it’s unlikely I’ll be able to do any more to help the push to CRAN 10K, so here’s a breakdown of CRAN and GitHub new packages & package updates that I felt were worth raising awareness on:

epidata

I mentioned this one last week but it wasn’t really a package announcement post. epidata is now on CRAN and is a package to pull data from the Economic Policy Institute (U.S. gov economic data, mostly). Their “hidden” API is well thought out and the data has been nicely curated (and seems to update monthly). It makes it super easy to do things like the following:

library(epidata)
library(tidyverse)
library(stringi)
library(hrbrmisc) # devtools::install_github("hrbrmstr/hrbrmisc")

us_unemp <- get_unemployment("e")

glimpse(us_unemp)
## Observations: 456
## Variables: 7
## $ date            <date> 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-0...
## $ all             <dbl> 0.061, 0.061, 0.060, 0.060, 0.059, 0.059, 0.05...
## $ less_than_hs    <dbl> 0.100, 0.100, 0.099, 0.099, 0.099, 0.099, 0.09...
## $ high_school     <dbl> 0.055, 0.055, 0.054, 0.054, 0.054, 0.053, 0.05...
## $ some_college    <dbl> 0.050, 0.050, 0.050, 0.049, 0.049, 0.049, 0.04...
## $ college         <dbl> 0.032, 0.031, 0.031, 0.030, 0.030, 0.029, 0.03...
## $ advanced_degree <dbl> 0.021, 0.020, 0.020, 0.020, 0.020, 0.020, 0.02...

us_unemp %>%
  gather(level, rate, -date) %>%
  mutate(level=stri_replace_all_fixed(level, "_", " ") %>%
           stri_trans_totitle() %>%
           stri_replace_all_regex(c("Hs$"), c("High School")),
         level=factor(level, levels=unique(level))) -> unemp_by_edu

col <- ggthemes::tableau_color_pal()(10)

ggplot(unemp_by_edu, aes(date, rate, group=level)) +
  geom_line(color=col[1]) +
  scale_y_continuous(labels=scales::percent, limits =c(0, 0.2)) +
  facet_wrap(~level, scales="free") +
  labs(x=NULL, y="Unemployment rate",
       title=sprintf("U.S. Monthly Unemployment Rate by Education Level (%s)", paste0(range(format(us_unemp$date, "%Y")), collapse=":")),
       caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
  theme_hrbrmstr(grid="XY")

us_unemp %>%
  select(date, high_school, college) %>%
  mutate(date_num=as.numeric(date)) %>%
  ggplot(aes(x=high_school, xend=college, y=date_num, yend=date_num)) +
  geom_segment(size=0.125, color=col[1]) +
  scale_x_continuous(expand=c(0,0), label=scales::percent, breaks=seq(0, 0.12, 0.02), limits=c(0, 0.125)) +
  scale_y_reverse(expand=c(0,100), label=function(x) format(as_date(x), "%Y")) +
  labs(x="Unemployment rate", y="Year ↓",
       title=sprintf("U.S. monthly unemployment rate gap (%s)", paste0(range(format(us_unemp$date, "%Y")), collapse=":")),
       subtitle="Segment width shows the gap between those with a high school\ndegree and those with a college degree",
       caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
  theme_hrbrmstr(grid="X") +
  theme(panel.ontop=FALSE) +
  theme(panel.grid.major.x=element_line(size=0.2, color="#2b2b2b25")) +
  theme(axis.title.x=element_text(family="Arial", face="bold")) +
  theme(axis.title.y=element_text(family="Arial", face="bold", angle=0, hjust=1, margin=margin(r=-14)))

(right edge is high school, left edge is college…I’ll annotate it better next time)

censys

Censys is a search engine by one of the cybersecurity research partners we publish data to at work (free for use by all). The API is moderately decent (it’s mostly a thin shim authentication layer to pass on Google BigQuery query strings to the back-end) and the R package to interface to it censys is now on CRAN.

waffle

The seminal square pie chart package waffle has been updated on CRAN to work better with recent ggplot2 2.x changes and has some additional parameters you may want to check out.

cdcfluview

The viral package cdcfluview has had some updates on the GitHub version to add saner behaviour when specifying dates and had to be updated as the CDC hidden API switched to all https URLs (major push in .gov-land to do that to get better scores on their cyber report cards). I’ll be adding some features before the next CRAN push to enable retrieval of additional mortality data.

sergeant

If you work with Apache Drill (if you don’t, you should), the sergeant package (GitHub) will help you whip it into shape. I’ve mentioned it before on the blog but it has a nigh-complete dplyr interface now that works pretty well. It also has a direct REST API interface and RJDBC interface plus many helper utilities that help you avoid typing SQL strings to get cluster status info. Once I add the ability to create parquet files with it I’ll push it up to CRAN.

The one thing I’d like to do with this package is support any user-defined functions (UDFs in Drill-speak) folks have written. So, if you have a UDF you’ve written or use and you want it wrapped in the package, just drop an issue and I’ll layer it in. I’ll be releasing some open source cybersecurity-related UDFs via the work github in a few weeks.

zkcmd

Drill (in non-standalone mode) relies on Apache Zookeeper to keep everything in sync and it’s sometimes necessary to peek at what’s happening inside the zookeeper cluster, so sergeant has a sister package zkcmd that provides an R interface to zookeeper instances.

ggalt

Some helpful folks tweaked ggalt for better ggplot2 2.x compatibility (#ty!) and I added a new geom_cartogram() (before you ask if it makes warped shapefiles: it doesn’t) that restores the old (and what I believe to be the correct/sane/proper) behaviour of geom_map(). I need to get this on CRAN soon as it has both fixes and many new geoms folks will want to play with in a non-GitHub context.

FIN

There have been some awesome packages released by others in the past month+ and you should add R Weekly to your RSS feeds if you aren’t following it already (there are other things you should have there for R updates as well, but that’s for another blog). I’m definitely looking forward to new packages, visualizations, services and utilities that will be coming this year to the R community.

The Most Important Commodity in 2017 is Data

2017-01-04 – 17:41
Posted in Data Analysis, Data Visualization, data wrangling, ggplot, R
Tagged post
Comments (5)

Despite being in cybersecurity nigh forever (a career that quickly turns one into a determined skeptic if you’re doing your job correctly) I have often trusted various (not to be named) news sources, reports and data sources to provide honest and as-unbiased-as-possible information. The debacle in the U.S. in late 2016 has proven (to me) that we’re all on our own when it comes to validating posited truth/fact. I’m teaching two data science college courses this Spring and one motivation for teaching is helping others on the path to data literacy so they don’t have to just believe what’s tossed at them.

It’s also one of the reasons I write R packages and blog/speak (when I can).

Today, I saw a chart in a WSJ article on the impending minimum wage changes (paywalled, so do the Google hack to read it) set to take effect in 19 states.

It’s a great chart and they cite the data source as coming from the Economic Policy Institute. I found some minimum wage data there and manually went through a bit of it to get enough of a feel that the WSJ wasn’t “mis-truthing” or truth-spinning. (As an aside, the minimum wage should definitely be higher, indexed and adjusted for inflation, but that’s a discussion for another time.)

While on EPI’s site, I did notice they had a “State of Working America Data Library” @ http://www.epi.org/data/. The data there is based on U.S. government published data and I validated that EPI isn’t fabricating anything (but you should not just take my word for it and do your own math from U.S. gov sources). I also noticed that the filtering interactions were delayed a bit and posited said condition was due to the site making AJAX/XHR calls to retrieve data. So, I peeked under the covers and discovered a robust, consistent, hidden API that is now wrapped in an R package dubbed epidata.

You can get the details of what’s available via the EPI site or through package docs.

What can you do with this data?

They seem to update the data (pretty much) monthly and it’s based on U.S. gov publications, so you can very easily validate “news” reports, potentially even easier than via packages that wrap the Bureau of Labor Statistics (BLS) API. On a lark, I wanted to compare the unemployment rate vs median wage over time (mostly to test the API and make a connected scatterplot). If you didn’t add the level of detail to the aesthetics I did, the following code is pretty small to do just that:

library(tidyverse)
library(epidata)
library(ggrepel)

unemployment <- get_unemployment()
wages <- get_median_and_mean_wages()

glimpse(wages)
## Observations: 43
## Variables: 3
## $ date    <int> 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, ...
## $ median  <dbl> 16.53, 16.17, 16.05, 16.15, 16.07, 16.36, 16.15, 16.07...
## $ average <dbl> 19.05, 18.67, 18.64, 18.87, 18.77, 18.83, 19.06, 18.66...

glimpse(unemployment)
## Observations: 456
## Variables: 2
## $ date <date> 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-01, 1979-04-...
## $ all  <dbl> 0.061, 0.061, 0.060, 0.060, 0.059, 0.059, 0.059, 0.058, 0...

group_by(unemployment, date=as.integer(lubridate::year(date))) %>%
  summarise(rate=mean(all)) %>%
  left_join(select(wages, date, median), by="date") %>%
  filter(!is.na(median)) %>%
  arrange(date) -> df

cols <- ggthemes::tableau_color_pal()(3)

ggplot(df, aes(rate, median)) +
  geom_path(color=cols[1], arrow=arrow(type="closed", length=unit(10, "points"))) +
  geom_point() +
  geom_label_repel(aes(label=date),
                   alpha=c(1, rep((4/5), (nrow(df)-2)), 1),
                   size=c(5, rep(3, (nrow(df)-2)), 5),
                   color=c(cols[2],
                           rep("#2b2b2b", (nrow(df)-2)),
                           cols[3]),
                   family="Hind Medium") +
  scale_x_continuous(name="Unemployment Rate", expand=c(0,0.001), label=scales::percent) +
  scale_y_continuous(name="Median Wage", expand=c(0,0.25), label=scales::dollar) +
  labs(title="U.S. Unemployment Rate vs Median Wage Since 1978",
       subtitle="Wage data is in 2015 USD",
       caption="Source: EPI analysis of Current Population Survey Outgoing Rotation Group microdata") +
  hrbrmisc::theme_hrbrmstr(grid="XY")

You can build your own “State of Working America Data Library” dashboard with this data (with flexdashboard and/or shiny) and be a bit of a citizen journalist, keeping the actual media in check and staying more informed an engaged.

As usual, drop issues & feature requests in the github repo. If you make some fun things with the data, drop a comment in the blog. Unless something major comes up the package should be on CRAN by the weekend.

Removing Personal Bias From Flu Severity Estimation (a.k.a. Misery Loves Data)

The family got hit pretty hard with the flu right as the Christmas festivities started and we were all pretty much bed-ridden zombies up until today (2017-01-02). When in the throes of a very bad ILI it’s easy to imagine that you’re a victim of a severe outbreak, especially with ancillary data from others that they, too, either just had/have the flu or know others who do. Thankfully, I didn’t have to accept this emotional opine and could turn to the cdcfluview package to see just how this year is measuring up.

Influenza cases are cyclical, and that’s super-easy to see with a longer-view of the CDC national data:

library(cdcfluview)
library(tidyverse)
library(stringi)

flu <- get_flu_data("national", years=2010:2016)

mutate(flu, week=as.Date(sprintf("%s %s 1", YEAR, WEEK), format="%Y %U %u")) %>% 
  select(-`AGE 25-64`) %>% 
  gather(age_group, count, starts_with("AGE")) %>% 
  mutate(age_group=stri_replace_all_fixed(age_group, "AGE", "Ages")) %>% 
  mutate(age_group=factor(age_group, levels=c("Ages 0-4", "Ages 5-24", "Ages 25-49", "Ages 50-64", "Ages 65"))) %>%
  ggplot(aes(week, count, group=age_group)) +
  geom_line(aes(color=age_group)) +
  scale_y_continuous(label=scales::comma, limits=c(0,20000)) +
  facet_wrap(~age_group, scales="free") +
  labs(x=NULL, y="Count of reported ILI cases", 
       title="U.S. National ILI Case Counts by Age Group (2010:2011 flu season through 2016:2017)",
       caption="Source: CDC ILInet via CRAN cdcfluview pacakge") +
  ggthemes::scale_color_tableau(name=NULL) +
  hrbrmisc::theme_hrbrmstr(grid="XY") +
  theme(legend.position="none")

We can use the same data to zoom in on this season:

mutate(flu, week=as.Date(sprintf("%s %s 1", YEAR, WEEK), format="%Y %U %u")) %>% 
  select(-`AGE 25-64`) %>% 
  gather(age_group, count, starts_with("AGE")) %>% 
  mutate(age_group=stri_replace_all_fixed(age_group, "AGE", "Ages")) %>% 
  mutate(age_group=factor(age_group, levels=c("Ages 0-4", "Ages 5-24", "Ages 25-49", "Ages 50-64", "Ages 65"))) %>%
  filter(week >= as.Date("2016-07-01")) %>% 
  ggplot(aes(week, count, group=age_group)) +
  geom_line(aes(color=age_group)) +
  scale_y_continuous(label=scales::comma, limits=c(0,20000)) +
  facet_wrap(~age_group, scales="free") +
  labs(x=NULL, y="Count of reported ILI cases", 
       title="U.S. National ILI Case Counts by Age Group (2016:2017 flu season)",
       caption="Source: CDC ILInet via CRAN cdcfluview pacakge") +
  ggthemes::scale_color_tableau(name=NULL) +
  hrbrmisc::theme_hrbrmstr(grid="XY") +
  theme(legend.position="none")

So, things are trending up, but how severe is this year compared to others? While looking at the number/percentage of ILI cases is one way to understand severity, another is to look at the mortality rate. The cdcfluview package has a get_mortality_surveillance_data() function, but it’s region-based and I’m really only looking at national data in this post. A helpful individual pointed out a new CSV file at https://www.cdc.gov/flu/weekly/index.htm#MS which we can reproducibly programmatically target (so we don’t have to track filename changes by hand) with:

library(rvest)

pg <- read_html("https://www.cdc.gov/flu/weekly/index.htm#MS")
html_nodes(pg, xpath=".//a[contains(@href, 'csv') and contains(@href, 'NCHS')]") %>% 
  html_attr("href") -> mort_ref
mort_url <- sprintf("https://www.cdc.gov%s", mort_ref)

df <- readr::read_csv(mort_url)

We can, then, take a look at the current “outbreak” status (when real-world mortality events exceed the model threshold):

mutate(df, week=as.Date(sprintf("%s %s 1", Year, Week), format="%Y %U %u")) %>% 
  select(week, Expected, Threshold, `Percent of Deaths Due to Pneumonia and Influenza`) %>% 
  gather(category, percent, -week) %>% 
  mutate(percent=percent/100) %>% 
  ggplot() +
  geom_line(aes(week, percent, group=category, color=category)) +
  scale_x_date(date_labels="%Y-%U") +
  scale_y_continuous(label=scales::percent) +
  ggthemes::scale_color_tableau(name=NULL) +
  labs(x=NULL, y=NULL, title="U.S. Pneumonia & Influenza Mortality",
       subtitle="Data through week ending December 10, 2016 as of December 28, 2016",
       caption="Source: National Center for Health Statistics Mortality Surveillance System") +
  hrbrmisc::theme_hrbrmstr(grid="XY") +
  theme(legend.position="bottom")

That view is for all mortality events from both influenza and pneumonia. We can look at the counts for just influenza as well:

mutate(df, week=as.Date(sprintf("%s %s 1", Year, Week), format="%Y %U %u")) %>% 
  select(week, `Influenza Deaths`) %>% 
  ggplot() +
  geom_line(aes(week, `Influenza Deaths`), color=ggthemes::tableau_color_pal()(1)) +
  scale_x_date(date_labels="%Y-%U") +
  scale_y_continuous(label=scales::comma) +
  ggthemes::scale_color_tableau(name=NULL) +
  labs(x=NULL, y=NULL, title="U.S. Influenza Mortality (count of mortality events)",
       subtitle="Data through week ending December 10, 2016 as of December 28, 2016",
       caption="Source: National Center for Health Statistics Mortality Surveillance System") +
  hrbrmisc::theme_hrbrmstr(grid="XY") +
  theme(legend.position="bottom")

It’s encouraging that the overall combined mortality rate is trending downwards and that the mortality rate for influenza is very low. Go. Science.

I’ll be adding a function to cdcfluview to retrieve this new data set a bit later in the year.

Hopefully you’ll avoid the flu and enjoy a healthy and prosperous 2017.

North Carolina’s Neighborhood

UPDATE: I’m glad I’m not the only one who was skeptical of this project: http://andrewgelman.com/2017/01/02/constructing-expert-indices-measuring-electoral-integrity-reply-pippa-norris/

When I saw the bombastic headline “North Carolina is no longer classified as a democracy” pop up in my RSS feeds today (article link: http://www.newsobserver.com/opinion/op-ed/article122593759.html) I knew it’d help feed polarization bear that’s been getting fat on ‘Murica for the past decade. Sure enough, others picked it up and ran with it. I can’t wait to see how the opposite extreme reacts (everybody’s gotta feed the bear).

As of this post, neither site linked to the actual data, so here’s an early Christmas present: The Electoral Integrity Project Data. I’m very happy this is public data since this is the new reality for “news” intake:

Read shocking headline
See no data, bad data, cherry-picked data or poorly-analyzed data
Look for the actual data
Validate data & findings
Possibly learn even more from the data that was deliberately left out or ignored

Data literacy is even more important than it has been.

Back to the title of the post: where exactly does North Carolina fall on the newly assessed electoral integrity spectrum in the U.S.? Right here (click to zoom in):

Focusing solely on North Carolina is pretty convenient (I know there’s quite a bit of political turmoil going on down there at the moment, but that’s no excuse for cherry picking) since — frankly — there isn’t much to be proud of on that entire chart. Here’s where the ‘States fit on the global rankings (we’re in the gray box):

You can page through the table to see where our ‘States fall (we’re between Guana & Latvia…srsly). We don’t always have the nicest neighbors:

This post isn’t a commentary on North Carolina, it’s a cautionary note to be very wary of scary headlines that talk about data but don’t really show it. It’s worth pointing out that I’m taking the PEI data as it stands. I haven’t validated the efficacy of their process or checked on how “activist-y” the researchers are outside the report. It’s somewhat sad that this is a necessary next step since there’s going to be quite a bit of lying with data and even more lying about-and/or-without data over the next 4+ years on both sides (more than in the past eight combined, probably).

The PEI folks provide methodology information and data. Read/study it. They provide raw and imputed confidence intervals (note how large some of those are in the two graphs) – do the same for your research. If their practices are sound, the ‘States chart is pretty damning. I would hope that all the U.S. states would be well above 75 on the rating scale and the fact that we aren’t is a suggestion that we all have work to do right “here” at home, beginning with ceasing to feed the polarization bear.

If you do download the data, here’s the R code that generated the charts:

library(tidyverse)

# u.s. ------------------------------------------------------------------------------

eip_state <- read_tsv("~/Data/eip_dataverse_files/PEI US 2016 state-level (PEI_US_1.0) 16-12-2018.tab")

arrange(eip_state, PEIIndexi) %>%
  mutate(state=factor(state, levels=state)) -> eip_state

ggplot() +
  geom_linerange(data=eip_state, aes(state, ymin=PEIIndexi_lci, ymax=PEIIndexi_hci), size=0.25, color="#2b2b2b00") +
  geom_segment(data=eip_state, aes(x="North Carolina", xend="North Carolina", y=-Inf, yend=Inf), size=5, color="#cccccc", alpha=1/10) +
  geom_linerange(data=eip_state, aes(state, ymin=PEIIndexi_lci, ymax=PEIIndexi_hci), size=0.25, color="#2b2b2b") +
  geom_point(data=eip_state, aes(state, PEIIndexi, fill=responserate), size=2, shape=21, color="#2b2b2b", stroke=0.5) +
  scale_y_continuous(expand=c(0,0.1), limits=c(0,100)) +
  viridis::scale_fill_viridis(name="Response rate\n", label=scales::percent) +
  labs(x="Vertical lines show upper & lower bounds of the 95% confidence interval\nSource: PEI Dataverse (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YXUV3W)\nNorris, Pippa; Nai, Alessandro; Grömping, Max, 2016, 'Perceptions of Electoral Integrity US 2016 (PEI_US_1.0)'\ndoi:10.7910/DVN/YXUV3W, Harvard Dataverse, V1, UNF:6:1cMrtJfvUs9uBoNewfUKqA==",
       y="PEI Index (imputed)",
       title="Perceptions of Electoral Integrity: U.S. 2016 POTUS State Ratings",
       subtitle="The PEI index is designed to provide an overall summary evaluation of expert perceptions that an election\nmeets international standards and global norms. It is generated at the individual level. Unlike the individual\nindex (PEIIndex) PEIIndexi is imputed and thus fully observed for all experts and states.") +
  hrbrmisc::theme_hrbrmstr(grid="Y", subtitle_family="Hind Light", subtitle_size=11) +
  theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1)) +
  theme(axis.title.x=element_text(margin=margin(t=15))) +
  theme(legend.position=c(0.8, 0.1)) +
  theme(legend.title.align=1) +
  theme(legend.title=element_text(size=8)) +
  theme(legend.key.size=unit(0.5, "lines")) +
  theme(legend.direction="horizontal") +
  theme(legend.key.width=unit(3, "lines"))

# global ----------------------------------------------------------------------------

eip_world <- read_csv("~/Data/eip_dataverse_files/PEI country-level data (PEI_4.5) 19-08-2016.csv")

arrange(eip_world, PEIIndexi) %>%
  mutate(country=factor(country, levels=country)) -> eip_world

ggplot() +
  geom_linerange(data=eip_world, aes(factor(country), ymin=PEIIndexi_lci, ymax=PEIIndexi_hci), size=0.25, color="#2b2b2b00") +
  geom_linerange(data=eip_world, aes(factor(country), ymin=PEIIndexi_lci, ymax=PEIIndexi_hci), size=0.25, color="#2b2b2b") +
  geom_point(data=eip_world, aes(country, PEIIndexi), size=2, shape=21, fill="steelblue", color="#2b2b2b", stroke=0.5) +
  scale_y_continuous(expand=c(0,0.1), limits=c(0,100)) +
  labs(x="Vertical lines show upper & lower bounds of the 95% confidence interval\nSource: PEI Dataverse (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LYO57K)\nNorris, Pippa; Nai, Alessandro; Grömping, Max, 2016, 'Perceptions of Electoral Integrity (PEI-4.5)\ndoi:10.7910/DVN/LYO57K, Harvard Dataverse, V2",
       y="PEI Index (imputed)",
       title="Perceptions of Electoral Integrity: 2016 Global Ratings",
       subtitle="The PEI index is designed to provide an overall summary evaluation of expert perceptions that an election\nmeets international standards and global norms. It is generated at the individual level. Unlike the individual\nindex (PEIIndex) PEIIndexi is imputed and thus fully observed for all experts and countries") +
  hrbrmisc::theme_hrbrmstr(grid="Y", subtitle_family="Hind Light", subtitle_size=11) +
  theme(axis.text.x=element_blank()) +
  theme(axis.title.x=element_text(margin=margin(t=15)))

Pipes (%>%) Everywhere

2016-12-22 – 15:09
Posted in Alfred, Apple, Chrome, macOS, R, RStudio
Tagged post
Comments (5)

An R user asked a question regarding whether it’s possible to have the RStudio pipe (%>%) shortcut (Cmd-Shift-M) available in other macOS applications. If you’re using Alfred then you can use this workflow for said task (IIRC this requires an Alfred license which is reasonably cheap).

When you add it to Alfred you must edit it to make Cmd-Shift-M the hotkey since Alfred strips the keys on import (for good reasons). I limited the workflow to a few apps (Safari, Chrome, Sublime Text, iTerm) and I think it makes sense to limit the apps it applies to (you don’t need the operator everywhere, at least IMO).

I can’t believe I didn’t do this earlier. I use R in the terminal a bit and mis-hit Cmd-Shift-M all the time when I do (since RStudio is my primary editor for R code and muscle memory is scarily powerful). I also have to use (ugh) Jupyter notebooks on occasion and this will help there, too. If you end up modifying or using the workflow, drop a note in the comments.

sergeant : An R Boot Camp for Apache Drill

2016-12-20 – 11:01
Posted in Apache Drill, dplyr, drill, R, SQL
Tagged post
Comments (3)

I recently mentioned that I’ve been working on a development version of an Apache Drill R package called sergeant. Here’s a lifted “TLDR” on Drill:

Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.
Drill’s datastore-aware optimizer automatically restructures a query plan to leverage the datastore’s internal processing capabilities. In addition, Drill supports data locality, so it’s a good idea to co-locate Drill and the datastore on the same nodes.

It also supports reading formats such as:

Avro
[CTP]SV ([C]omma-, [T]ab-, [P]ipe-Separated-Values)
Parquet
Hadoop Sequence Files

It’s a bit like Spark in that you can run it on a single workstation and scale up to a YUGE cluster. It lacks the ML components of Spark, but it connects to everything without the need to define a schema up front. Said “everything” includes parquet files on local filesystems, so if you need to slice through GBs of parquet data and have a beefy enough Linux workstation (I believe Drill runs on Windows and know it runs on macOS fine, too, but that’s $$$ for a bucket of memory & disk space) you can take advantage of the optimized processing power Drill offers on a single system (while also joining the data with any other data format you can think of). You can also seamlessly move the data to a cluster and barely tweak your code to support said capacity expansion.

Why `sergeant`?

There’s already an R package on CRAN to work with Drill: DrillR. It’s S4 class-based, has a decent implementation and interfaces with the REST API. However, it sticks httr::verbose() everywhere: https://github.com/cran/DrillR/search?utf8=%E2%9C%93&q=verbose.

The sergeant package interfaces with the REST API as well, but also works with the JDBC driver (the dev version includes the driver with the package, but this will be removed for the eventual CRAN submission) and includes some other niceties around Drill options viewing and setting and some other non-SQL bits. Of note: the REST API version shows an httr progress bar for data downloading and you can wrap the calls with httr::with_verbose(…) if you really like seeing cURL messages.

The other thing sergeant has going for it is a nascent dplyr interface. Presently, this is a hack-ish wrapper around the RJDBC JDBCConnection presented by the Drill JDBC driver. While basic functionality works, I firmly believe Drill needs it’s own DBI driver (like is second-cousin Preso has) to avoid collisions withy any other JDBC connections you might have open, plus more work needs to be done under the covers to deal with quoting properly and exposing more Drill built-in SQL functions.

SQL vs `dplyr`

For some truly complex data machinations you’re going to want to work at the SQL level and I think it’s important to know SQL if you’re ever going to do data work outside JSON & CSV files just to appreciate how much gnashing of teeth dplyr saves you from. Using SQL for many light-to-medium aggregation tasks that feed data to R can feel like you’re banging rocks together to make fire when you could just be using your R precision welder. What would you rather write:

SELECT  gender ,  marital_status , COUNT(*) AS  n 
FROM  cp.`employee.json` 
GROUP BY  gender ,  marital_status

in a drill-embedded or drill-localhost SQL shell? Or:

library(RJDBC)
library(dplyr)
library(sergeant)

ds <- src_drill("localhost:31010", use_zk=FALSE)

db <- tbl(ds, "cp.`employee.json`") 

count(db, gender, marital_status) %>% collect()

(NOTE: that SQL statement is what ultimately gets sent to Drill from dplyr)

Now, dplyr tbl_df idioms don’t translate 1:1 to all other src_es, but they are much easier on the eyes and more instructive in analysis code (and, I fully admit that said statement is more opinion than fact).

`sergeant` and `dplyr`

The src_drill() function uses the JDBC Drill driver and, hence, has an RJDBC dependency. The Presto folks (a “competing” offering to Drill) wrapped a DBI interface around their REST API to facilitate the use of dplyr idioms. I’m not sold on whether I’ll continue with a lightweight DBI wrapper using RJDBC or go the RPresto route, but for now the basic functionality works and changing the back-end implementation should not break anything (much).

You’ve said “parquet” alot…

Yes. Yes, I have. Parquet is a “big data” compressed columnar storage format that is generally used in Hadoop shops. Parquet is different from ‘feather’ (‘feather’ is based on another Apache foundation project: Arrow). Arrow/feather is great for things that fit in memory. Parquet and the idioms that sit on top of it enable having large amounts data available in a cluster for processing with Hadoop / Spark / Drill / Presto (etc). Parquet is great for storing all kinds of data, including log and event data which I have to work with quite a bit and it’s great being able to prototype on a single workstation then move code to hit a production cluster. Plus, it’s super-easy to, say, convert an entire, nested directory tree of daily JSON log files into parquet with Drill:

CREATE TABLE dfs.destination.`source/2016/12/2016_12_source_event_logs.parquet` AS
  SELECT src_ip, dst_ip, src_port, dst_port, event_message, ts 
  FROM dfs.source.`/log/dir/root/2016/12/*/event_log.json`;

Kick the tyres

The REST and JDBC functions are solid (I’ve been using them at work for a while) and the dplyr support has handled some preliminary production work well (though, remember, it’s not fully-baked). There are plenty of examples — including a dplyr::left_join() between parquet and JSON data — in the README and all the exposed functions have documentation.

File an issue with a feature request or bug report.

I expect to have this CRAN-able in January, 2017.

Package update: longurl 0.3.0 is hitting CRAN mirrors

The longurl package has been updated to version 0.3.0 as a result of a bug report noting that the URL expansion API it was using went pay-for-use. Since this was the second time a short URL expansion service either went belly-up or had breaking changes the package is now completely client-side-based and a very thin, highly-focused wrapper around the httr::HEAD() function.

Why `longurl`?

On the D&D alignment scale, short links are chaotic evil. [Full-disclosure: I use shortened links all the time, so the pot is definitely kettle-calling here]. Ostensibly, they are for making it easier to show memorable links on tiny, glowing rectangles or printed prose but they are mostly used to directly track you and mask other tracking parameters that the target site is using to keep tabs on you. Furthermore, short URLs are also used by those with even more malicious intent than greedy startups or mega-corporations.

In retrospect, giving a third-party API service access to URLs you are interested in expanding just exacerbated the tracking problem, but many of these third-party URL expansion services do use some temporal caching of results, so they can be a bit faster than doing this in a non-caching package (but, there’s nothing stopping you putting caching code around it if you are using it in a “production” capacity).

How does the updated package work without a URL expansion API?

By default, httr “verb” requests use the curl package and that is a wrapper for libcurl. The httr verb calls set the “please follow all HTTP status 3xx redirects that are found in responses” option (this is the libcurl CURLOPT_FOLLOWLOCATION equivalent option). There are other options that can be set to help configure minutae around how redirect following works. So, just by calling httr::HEAD(some_url) you get built-in short URL expansion (if what you passed in was a short URL or a URL with a redirect).

Take, for example, this innocent link: http://lnk.direct/zFu. We can see what goes on under the covers by passing in the verbose() option to an httr::HEAD() call:

httr::HEAD("http://lnk.direct/zFu", verbose())

## -> HEAD /zFu HTTP/1.1
## -> Host: lnk.direct
## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Cookie: shorturl=4e0aql3p49rat1c8kqcrmv4gn2
## -> Accept: application/json, text/xml, application/xml, */*
## -> 
## <- HTTP/1.1 301 Moved Permanently
## <- Server: nginx/1.0.15
## <- Date: Sun, 18 Dec 2016 19:03:48 GMT
## <- Content-Type: text/html; charset=UTF-8
## <- Connection: keep-alive
## <- X-Powered-By: PHP/5.6.20
## <- Expires: Thu, 19 Nov 1981 08:52:00 GMT
## <- Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
## <- Pragma: no-cache
## <- Location: http://ow.ly/Ko70307eKmI
## <- 
## -> HEAD /Ko70307eKmI HTTP/1.1
## -> Host: ow.ly
## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Accept: application/json, text/xml, application/xml, */*
## -> 
## <- HTTP/1.1 301 Moved Permanently
## <- Content-Length: 0
## <- Location: http://bit.ly/2gZq7qG
## <- Connection: close
## <- 
## -> HEAD /2gZq7qG HTTP/1.1
## -> Host: bit.ly
## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Accept: application/json, text/xml, application/xml, */*
## -> 
## <- HTTP/1.1 301 Moved Permanently
## <- Server: nginx
## <- Date: Sun, 18 Dec 2016 19:04:36 GMT
## <- Content-Type: text/html; charset=utf-8
## <- Content-Length: 127
## <- Connection: keep-alive
## <- Cache-Control: private, max-age=90
## <- Location: http://example.com/IT_IS_A_SURPRISE
## <- 
## -> HEAD /IT_IS_A_SURPRISE HTTP/1.1
## -> Host: example.com
## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Cookie: _csrf/link=g3iBgezgD_OYN0vOh8yI930E1O9ZAKLr4uHmVioxwwQ; mc=null; dmvk=5856d9e39e747; ts=475630; v1st=03AE3C5AD67E224DEA304AEB56361C9F
## -> Accept: application/json, text/xml, application/xml, */*
## -> 
## <- HTTP/1.1 200 OK
## ...
## <-

We can reduce the clutter and see that it follows multiple redirects from multiple URL shorteners:

Here’s what the output of a request to longurl::expand_urls() returns:

longurl::expand_urls("http://lnk.direct/zFu")
## # A tibble: 1 × 3
##                orig_url                        expanded_url status_code
##                   <chr>                               <chr>       <int>
## 1 http://lnk.direct/zFu http://example.com/IT_IS_A_SURPRISE         200

NOTE: the link does actually go somewhere, and somewhere not malicious, political or preachy (a rarity in general in this post-POTUS-election world of ours).

What else is different?

The longurl::expand_urls() function returns a tbl_df and now includes the HTTP status code of the final, resolved link. You can also pass in a custom HTTP referrer since many (many) sites will change behavior depending on the referrer.

What’s next?

This bug-fix release had to go out fairly quickly since the package was essentially broken. With the new foundation being built on client-side machinations, future enhancements will be to pull more features (in the machine learning sense) out of the curl or httr requests (I may switch directly to using curl if I need more granular data) and include some basic visualizations for both request trees (mostly likely using the DiagrammeR and ggplot2 packages). I may try to add a caching layer, but I believe that’s more of a situation-specific feature folks should add on their own, so I may just add a “check hook” capability that will add an extra function call to a cache checking function of your choosing.

If you have a feature request, please add it to the github repo.

Minding the zoo[keeper] with R

I’ve been drafting a new R package — [`sergeant`](https://github.com/hrbrmstr/sergeant) — to work with [Apache Drill](http://drill.apache.org/) and have found it much easier to manage having Drill operating in a single node cluster vs `drill-embedded` mode (esp when I need to add a couple more nodes for additional capacity). That means running [Apache Zookeeper](https://zookeeper.apache.org/) and I’ve had the occasional need to ping the Zookeeper process to see various stats (especially when I add some systems to the cluster).

Yes, it’s very easy+straightforward to `nc hostname 2181` from the command line to issue commands (or use the Zookeeper CLIs) but when I’m in R I like to stay in R. To that end, I made a small reference class that makes it super-easy to query Zookeeper from within R. Said class will eventually be in a sister package to `sergeant` (and, a more detailed post on `sergeant` is forthcoming), but there may be others who want/need to poke at Zookeeper processes with [4-letter words](https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_zkCommands) and this will help you stay within R. The only command not available is `stmk` since I don’t have the pressing need to set the trace mask.

After you source the self-contained reference class you connect and then issue commands as so:

zk <- zookeeper$new(host="drill.local")

zk$ruok()

zk$conf()
zk$stat()

Drop a note in the comments (since this isn’t on github, yet) with any questions/issues.

zookeeper <- setRefClass(

  Class="zookeeper",

  fields=list(
    host="character",
    port="integer",
    timeout="integer"
  ),

  methods=list(

    initialize=function(..., host="localhost", port=2181L, timeout=30L) {
      host <<- host ; port <<- port ; timeout <<- timeout; callSuper(...)
    },

    available_commands=function() {
      return(sort(c("conf", "envi", "stat", "srvr", "whcs", "wchc", "wchp", "mntr",
                    "cons", "crst", "srst", "dump", "ruok", "dirs", "isro", "gtmk")))
    },

    connect=function(host="localhost", port=2181L, timeout=30L) {
      .self$host <- host ; .self$port <- port ; .self$timeout <- timeout
    },

    conf=function() {
      res <- .self$send_cmd("conf")
      res <- stringi::stri_split_fixed(res, "=", 2, simplify=TRUE)
      as.list(setNames(res[,2], res[,1]))
    },

    envi=function() {
      res <- .self$send_cmd("envi")
      res <- stringi::stri_split_fixed(res[-1], "=", 2, simplify=TRUE)
      as.list(setNames(res[,2], res[,1]))
    },

    stat=function() {
      res <- .self$send_cmd("stat")
      version <- stri_replace_first_regex(res[1], "^Zoo.*: ", "")
      res <- res[-(1:2)]
      sep <- which(res=="")
      clients <- stri_trim(res[1:(sep-1)])
      zstats <- stringi::stri_split_fixed(res[(sep+1):length(res)], ": ", 2, simplify=TRUE)
      zstats <- as.list(setNames(zstats[,2], zstats[,1]))
      list(version=version, clients=clients, stats=zstats)
    },

    srvr=function() {
      res <- .self$send_cmd("srvr")
      zstats <- stringi::stri_split_fixed(res, ": ", 2, simplify=TRUE)
      as.list(setNames(zstats[,2], zstats[,1]))
    },

    wchs=function() {
      res <- .self$send_cmd("wchs")
      conn_path <- stri_match_first_regex(res[1], "^([[:digit:]]+) connections watching ([[:digit:]]+) paths")
      tot_watch <- stri_match_first_regex(res[2], "Total watches:([[:digit:]]+)")
      list(connections_watching=conn_path[,2], paths=conn_path[,3], total_watches=tot_watch[,2])
    },

    wchc=function() {
      stri_trim(.self$send_cmd("wchc")) %>% discard(`==`, "") -> res
      setNames(list(res[2:length(res)]), res[1])
    },

    wchp=function() {
      .self$send_cmd("wchp") %>% stri_trim() %>% discard(`==`, "") -> res
      data.frame(
        path=qq[seq(1, length(qq), 2)],
        address=qq[seq(2, length(qq), 2)],
        stringsAsFactors=FALSE
      )
    },

    mntr=function() {
      res <- .self$send_cmd("mntr")
      res <- stringi::stri_split_fixed(res, "\t", 2, simplify=TRUE)
      as.list(setNames(res[,2], res[,1]))
    },

    cons=function() { list(clients=stri_trim(.self$send_cmd("cons") %>% discard(`==`, ""))) },
    crst=function() { message(.self$send_cmd("crst")) ; invisible() },
    srst=function() { message(.self$send_cmd("srst")) ; invisible() },
    dump=function() { paste0(.self$send_cmd("dump"), collapse="\n") },
    ruok=function() { .self$send_cmd("ruok") == "imok" },
    dirs=function() { .self$send_cmd("dirs") },
    isro=function() { .self$send_cmd("isro") },
    gtmk=function() { R.utils::intToBin(as.integer(.self$send_cmd("gtmk"))) },

    send_cmd=function(cmd) {
      require(purrr)
      require(stringi)
      require(R.utils)
      sock <- purrr::safely(socketConnection)
      con <- sock(host=.self$host, port=.self$port, blocking=TRUE, open="r+", timeout=.self$timeout)
      if (!is.null(con$result)) {
        con <- con$result
        cat(cmd, file=con)
        response <- readLines(con, warn=FALSE)
        a <- try(close(con))
        purrr::flatten_chr(stringi::stri_split_lines(response))
      } else {
        warning(sprintf("Error connecting to [%s:%s]", host, port))
      }
    }

  )

)