dplyr Archives

Category Archives: dplyr

Lost In [SQL] Translation: Charting d[b]plyr Mapped SQL Function Support Across All Backends

Like more posts than I care to admit, this one starts innocently enough with a tweet by @gshotwell:

Is there a reference document somewhere of which dplyr commands work on various database backends? #rstats

— Gordon Shotwell (@gshotwell) April 9, 2019

Since I use at least 4 different d[b]plyr backends every week, this same question surfaces in my own noggin on occasion and I couldn’t resist going all Columbo on this mystery.

I should note that if you only really care about the backends that come with dbplyr @paleolimbot has you covered with this post, which also shows you the translated SQL!

Executing The Plan

There are at least 24 separate backends for dbplyr. Most folks won’t need more than one if their databases all have a decent ODBC or JDBC driver. To be able to use dplyr idioms with databases there needs to be a way to translate R code (e.g. function calls) into SQL. A ton of functions are pre-mapped in dbplyr already and most backend implementations start by relying on these defaults. Furthermore, since SQL is not nearly as “standard” across installations as one might think, some common tasks — such as string manipulation — have a default noop translation.

If you do have to switch across backends with any frequency, knowing which backend provides support for which functions might be nice, but there hasn’t been a reference for this until Dewey & I accepted Gordon’s challenge. What makes this a “challenge” is that you first have to figure out what packages provide a d[b]plyr backend interface then figure out what SQL translations they offer (they don’t necessarily have to inherit from the ones provided by dbplyr and may add other ones to account for SQL clauses that aren’t in functional form). So the first step was just a look through CRAN for which packages import dbplyr and also adding in some I knew were on GitHub:

library(stringi)
library(hrbrthemes)
library(tidyverse)

# All the pkgs from the home CRAN mirror that import 'dbplyr'
c(
  "arkdb", "bigrquery", "childesr", "chunked", "civis", "corrr", "cytominer", "dbplot",
  "dbplyr", "dexter", "dexterMST", "dlookr", "dplyr", "dplyr.teradata", "etl",
  "healthcareai", "hydrolinks", "implyr", "infuser", "ipumsr", "macleish", "mdsr",
  "mlbgameday", "modeldb", "MonetDBLite", "mudata2", "parsemsf", "pivot", "pleiades",
  "pool", "poplite", "RClickhouse", "replyr", "RPresto", "sergeant", "sparklyr",
  "sqlscore", "srvyr", "taxizedb", "valr", "wordbankr", "metis.tidy"
) -> pkgs

I ended up doing install.pkgs(pkgs) which was easy since I have a home CRAN mirror and use macOS (so binary package installs).

The presence of a dbplyr import does not mean a package implements a backend, so we have to load their namespaces and see if they have the core “tell” (i.e. they implement sql_translate()):

(map_df(pkgs, ~{
  tibble(
    pkg = .x,
    trans = loadNamespace(.x) %>%
      names() %>%
      keep(stri_detect_fixed, "sql_translate")
  )
}) -> xdf)
## # A tibble: 28 x 2
##    pkg       trans
##    <chr>     <chr>
##  1 bigrquery sql_translate_env.BigQueryConnection
##  2 civis     sql_translate_env.CivisConnection
##  3 dbplyr    sql_translate_env.ACCESS
##  4 dbplyr    sql_translate_env.Oracle
##  5 dbplyr    sql_translate_env.SQLiteConnection
##  6 dbplyr    sql_translate_env.Impala
##  7 dbplyr    sql_translate_env.OdbcConnection
##  8 dbplyr    sql_translate_env.MySQLConnection
##  9 dbplyr    sql_translate_env.PqConnection
## 10 dbplyr    sql_translate_env.PostgreSQLConnection
## # … with 18 more rows

Now we know the types of connections that package has SQL translation support for. But, we’re looking for the actual functions they provide. To discover that, we’re going to make dummy classed connection objects and get the translations they offer.

However, some may take the defaults from dbplyr and not override them so we also need to test if they use the sql_not_supported() noop, which we can do by seeing if the function body has a call to stop() in it. We’re also going to ignore maths operators along the way:

(filter(xdf, stri_detect_fixed(trans, ".")) %>%
  filter(trans != "sql_translate_env.NULL") %>% # ignore NULL
  filter(trans != "sql_translate_env.Pool") %>% # ignore db connection pooling 
  filter(trans != "sql_translate_env.PrestoConnection") %>% # this one errored out
  mutate(ƒ = map(trans, ~{

    # get the sql translate functions
    con <- NA
    cls <- stri_replace_first_fixed(.x, "sql_translate_env.", "")
    class(con) <- cls

    env <- sql_translate_env(con)

    # but ^^ rly isn't a nice, tidy object, it's a list of environments
    # with functions in it so we have to iterate through it to extract
    # the function names.

    map_df(env, ~{

      part <- .x
      fs <- names(part)

      # but it's not just good enough to do that b/c a given function name
      # might just implement the "sql_not_supported()" pass through. So we have
      # to actually look to see if the function body has a "stop()" call in it
      # and ignore it if it does.

      map_df(fs, ~{
        tibble(ƒ = .x, src = paste0(as.character(body(part[[.x]])), collapse = "; ")) %>% # this gets the body of the function
          filter(!stri_detect_fixed(src, "stop(")) %>%
          filter(stri_detect_regex(ƒ, "[[:alpha:]]")) %>% # and we rly don't care about maths
          select(-src)
      })
    })
  })) %>%
  unnest(ƒ) %>%
  mutate(trans = stri_replace_first_fixed(trans, "sql_translate_env.", "")) -> xdf)
## # A tibble: 1,318 x 3
##    pkg       trans              ƒ
##    <chr>     <chr>              <chr>
##  1 bigrquery BigQueryConnection median
##  2 bigrquery BigQueryConnection gsub
##  3 bigrquery BigQueryConnection as.logical
##  4 bigrquery BigQueryConnection is.null
##  5 bigrquery BigQueryConnection case_when
##  6 bigrquery BigQueryConnection is.na
##  7 bigrquery BigQueryConnection if_else
##  8 bigrquery BigQueryConnection str_replace_all
##  9 bigrquery BigQueryConnection as.integer
## 10 bigrquery BigQueryConnection as.character
## # … with 1,308 more rows

The rest is all just ggplot2 basics:

 mutate(xdf, db = glue::glue("{pkg}\n{trans}")) %>% # make something useful to display for the DB/conn
  mutate(n = 1) %>% # heatmap block on
  complete(db, ƒ) %>% # complete the heatmap
  arrange(ƒ) %>%
  mutate(ƒ = factor(ƒ, levels=rev(unique(ƒ)))) %>% # arrange the Y axis in the proper order
  ggplot(aes(db, ƒ)) +
  geom_tile(aes(fill = n), color="#2b2b2b", size=0.125, show.legend=FALSE) +
  scale_x_discrete(expand=c(0,0.1), position = "top") +
  scale_fill_continuous(na.value="white") +
  labs(
    x = NULL, y = NULL,
    title = "SQL Function Support In Known d[b]plyr Backends"
  ) +
  theme_ipsum_ps(grid="", axis_text_size = 9) + # you'll need to use the dev version of hrbrthemes for this function; just sub out a diff theme if you already have hrbrthemes loaded
  theme(axis.text.y = element_text(family = "mono", size = 7))

Which makes:

(WP wouldn’t make the featured image linkable so I had to stick it in again to enabled the link so folks can make it full size which is absolutely necessary to see it).

FIN

If you do play with the above, don’t forget to go one more step and incorporate Dewey’s actual SQL mapping to see just how unstandardized the SQL standard is.

Contiguous code for the above is over at SourceHut.

Apache Drill 1.15.0 + sergeant 0.8.0 = pcapng Support, Proper Column Types & Mounds of New Metadata

Apache Drill is an innovative distributed SQL engine designed to enable data exploration and analytics on non-relational datastores […] without having to create and manage schemas. […] It has a schema-free JSON document model similar to MongoDB and Elasticsearch; [a plethora of APIs, including] ANSI SQL, ODBC/JDBC, and HTTP[S] REST; [is] extremely user and developer friendly; [and, has a] pluggable architecture enables connectivity to multiple datastores.

To ring in the new year the Drill team knocked out a new 1.15.0 release with a cadre of new functionality including:

CROSS JOIN support
New metadata tables for files, system functiosn, and system functions
pcapng support
enhanced datetime functions
better security
and, web UI improvements

One super-helpful new feature of the REST API is that it now returns query results metadata along with the query results themselves. This means REST API endpoints finally know both column order and column type. This gave me cause to re-visit the sergeant package [GL|GH] and make some accommodations for some of these new features.

Ushering In A New Order

Drill REST API queries return a "columns" field and "metadata" field with the data itself. We can use that to force an order to the columns as well as mostly use proper types (vs JSON-parsed/guessed types). I say mostly since the package still uses jsonlite to parse the results and there’s no support for 64-bit integers in jsonlite (more on this later).

We’ll use the example from DRILL-6847 and use the example provided by Charles Givre in his Jira issue since it will let me demonstrate more of that “mostly” comment and show off another new feature:

library(sergeant) # 0.8.0 branch of sergeant on gitlab or github
library(tidyverse)

con <- src_drill("localhost")

x <- tbl(con, "cp.`employee.json`")

mutate(x, employee_id = as.integer64(employee_id)) %>% 
  mutate(position_id = as.integer64(position_id)) %>% 
  select(
    employee_id, full_name, first_name, last_name, 
    position_id, position_title
  ) -> bigint_result

The above is (logically):

SELECT 
  CAST (employee_id AS INT) AS employee_id,
  full_name,
  first_name, 
  last_name, 
  CAST (position_id AS BIGINT) AS position_id, 
  position_title 
FROM cp.`employee.json`

What do we get when we take a preview of the result?

bigint_result
## # Source:   lazy query [?? x 6]
## # Database: DrillConnection
##    employee_id full_name  first_name last_name position_id position_title 
##          <dbl> <chr>      <chr>      <chr>           <dbl> <chr>          
##  1           1 Sheri Now… Sheri      Nowmer              1 President      
##  2           2 Derrick W… Derrick    Whelply             2 VP Country Man…
##  3           4 Michael S… Michael    Spence              2 VP Country Man…
##  4           5 Maya Guti… Maya       Gutierrez           2 VP Country Man…
##  5           6 Roberta D… Roberta    Damstra             3 VP Information…
##  6           7 Rebecca K… Rebecca    Kanagaki            4 VP Human Resou…
##  7           8 Kim Brunn… Kim        Brunner            11 Store Manager  
##  8           9 Brenda Bl… Brenda     Blumberg           11 Store Manager  
##  9          10 Darren St… Darren     Stanz               5 VP Finance     
## 10          11 Jonathan … Jonathan   Murraiin           11 Store Manager  
## # ... with more rows

Warning message:
One or more columns are of type BIGINT. The sergeant package currently uses jsonlite::fromJSON()
to process Drill REST API result sets. Since jsonlite does not support 64-bit integers BIGINT 
columns are initially converted to numeric since that's how jsonlite::fromJSON() works. This is
problematic for many reasons, including trying to use 'dplyr' idioms with said converted 
BIGINT-to-numeric columns. It is recommended that you 'CAST' BIGINT columns to 'VARCHAR' prior to
working with them from R/'dplyr'.

If you really need BIGINT/integer64 support, consider using the R ODBC interface to Apache Drill 
with the MapR ODBC drivers.

This informational warning will only be shown once per R session and you can disable them from 
appearing by setting the 'sergeant.bigint.warnonce' option to 'FALSE' 
(i.e. options(sergeant.bigint.warnonce = FALSE)).

The first thing sergeant users will notice is proper column order (before it just returned the columns in the order they came back in the JSON rows[] structure). The second thing is that we didn’t get integer64s back. Instead, we got doubles plus an information warning about why and what you can do about it. Said warning only displays once per-session and can be silenced with the option sergeant.bigint.warnonce. i.e. just put:

options(sergeant.bigint.warnonce = FALSE)

in your script or ~/.Rprofile and you won’t hear from it again.

The as.integer64() we used is not from the bit64 package but an internal sergeant package function that knows how to translate said operation to, e.g. CAST( employee_id AS BIGINT ).

You can use the ODBC drivers to gain BIGINT support and there are plans for the 0.8.0 branch to eventually use rapidjsonr at the C++-level to provide direct in-package support for BIGINTs as well.

Better Error Messages

Drill query errors that the sergeant package bubbled up through its various interfaces have not been pretty or all that useful. This has changed with the 0.8.0 branch. Let’s take a look:

tbl(con, "cp.employees.json")
## # Source:   table<cp.employees.json> [?? x 4]
## # Database: DrillConnection

Warning message:
VALIDATION ERROR: From line 2, column 6 to line 2, column 24: Object 'cp.employees.json' not found

Original Query:

  1: SELECT *
  2: FROM `cp.employees.json`
  3: LIMIT 10

Query Profile Error Link:
http://localhost:8047/profiles/079fc8cf-19c6-4c78-95a9-0b949a3ecf4c

As you can see in the above output, you now get a highly-formatted return value with the original SQL query broken into lines (with line numbers) and a full link to the Drill query profile so you can dig in to the gnarly details of complex query issues. As you work with this and find edge cases I missed for messages, drop an issue on your social-coding site of choice.

SUPPORT ALL THE PCAPs!

Drill has had packet capture (PCAP) file support for a while now and 1.15.0 adds support for the more modern/rich pcapng format. To enable support for this you need to add "pcapng": {"type": "pcapng", "extensions": ["pcapng"] }, to the "formats" section of your storage plugins and also configure a workspace directory to use that as the default (the principle of which is covered here).

We’ll use one of the Wireshark example captures to demonstrate:

pcaps <- tbl(con, "dfs.caps.`*.pcapng`")

glimpse(pcaps)
## Observations: ??
## Variables: 25
## $ tcp_flags_ece_ecn_capable            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags_ece_congestion_experienced <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags_psh                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ type                                 <chr> "TCP", "TCP", "TCP", "TCP...
## $ tcp_flags_cwr                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ dst_ip                               <chr> "74.125.28.139", "10.254....
## $ src_ip                               <chr> "10.254.157.208", "74.125...
## $ tcp_flags_fin                        <int> 1, 1, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags_ece                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags                            <int> 17, 17, 16, 16, 16, 0, 0,...
## $ tcp_flags_ack                        <int> 1, 1, 1, 1, 1, 0, 0, 0, 0...
## $ src_mac_address                      <chr> "00:05:9A:3C:7A:00", "00:...
## $ tcp_flags_syn                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_flags_rst                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ timestamp                            <dttm> 2015-04-14 07:19:25, 201...
## $ tcp_session                          <dbl> 8.353837e+17, 8.353837e+1...
## $ packet_data                          <chr> "\"3DU...<z...E..(J.@.......
## $ tcp_parsed_flags                     <chr> "ACK|FIN", "ACK|FIN", "AC...
## $ tcp_flags_ns                         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ src_port                             <int> 60268, 443, 60268, 58382,...
## $ packet_length                        <int> 54, 54, 54, 55, 66, 78, 7...
## $ tcp_flags_urg                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ tcp_ack                              <int> 662445631, 1496589825, 66...
## $ dst_port                             <int> 443, 60268, 443, 29216, 5...
## $ dst_mac_address                      <chr> "00:11:22:33:44:55", "00:...

count(pcaps, src_ip, dst_ip, sort=TRUE)
## # Source:     lazy query [?? x 3]
## # Database:   DrillConnection
## # Groups:     src_ip
## # Ordered by: desc(n)
##    src_ip         dst_ip             n
##    <chr>          <chr>          <dbl>
##  1 10.254.157.208 10.254.158.25    298
##  2 10.254.158.25  10.254.157.208   204
##  3 174.137.42.81  10.254.157.208    76
##  4 10.254.157.208 10.254.158.8      54
##  5 10.254.158.8   10.254.157.208    49
##  6 74.125.28.102  10.254.157.208    49
##  7 10.254.157.208 74.125.28.102     44
##  8 10.254.157.208 174.137.42.81     41
##  9 54.84.98.25    10.254.157.208    25
## 10 157.55.56.168  10.254.157.208    25
## # ... with more rows

More work appears to be planned by the Drill team to enable digging into the packet (binary) contents.

Drill Metadata As Data

Drill has provided ways to lookup Drill operational information as actual tables but the Drill team has added support for even more metadata-as-data queries.

First up is finally having better access to filesystem information. Prior to 1.15.0 one could get file and path attributes as part of other queries, but now we can treat filesystems as actual data. Let’s list all the PCAPs in the above workspace:

tbl(con, "information_schema.`schemata`") %>% 
  filter(SCHEMA_NAME == "dfs.caps") %>% 
  print() %>% 
  pull(SCHEMA_NAME) -> pcap_schema
## # Source:   lazy query [?? x 9]
## # Database: DrillConnection
##   CATALOG_NAME SCHEMA_NAME SCHEMA_OWNER TYPE  IS_MUTABLE
##   <chr>        <chr>       <chr>        <chr> <chr>     
## 1 DRILL        dfs.caps    <owner>      file  NO

tbl(con, "information_schema.`files`") %>% 
  filter(schema_name == pcap_schema) %>% 
  glimpse()
## Observations: ??
## Variables: 13
## $ SCHEMA_NAME       <chr> "dfs.caps"
## $ ROOT_SCHEMA_NAME  <chr> "dfs"
## $ WORKSPACE_NAME    <chr> "caps"
## $ FILE_NAME         <chr> "dof-short-capture.pcapng"
## $ RELATIVE_PATH     <chr> "dof-short-capture.pcapng"
## $ IS_DIRECTORY      <lgl> FALSE
## $ IS_FILE           <lgl> TRUE
## $ LENGTH            <dbl> 634280
## $ OWNER             <chr> "hrbrmstr"
## $ GROUP             <chr> "staff"
## $ PERMISSION        <chr> "rw-r--r--"
## $ ACCESS_TIME       <dttm> 1969-12-31 19:00:00
## $ MODIFICATION_TIME <dttm> 2019-01-01 19:12:17

The Drill system options table now has full descriptions for the options and also provides a new table that knows about all of Drills functions and all your custom UDFs. drill_opts() and drill_functions() return a data frame of all this info and have an optional browse parameter which, if set to TRUE, will show a DT interactive data table for them. I find this especially handy when I forget something like regexp_like syntax (I use alot of back-ends and many are wildly different) and can now do this:

FIN

Keep on the lookout for the rapidjsonr/BIGINT integration and more new features of the sergeant package. NOTE: The better error messages have been ported over to the sergeant.caffeinated package (the RJDBC interface) and the other niceties will make their way into that package soon as well.

So, make sure you’re using the 0.8.0 GL / GH, kick the tyres, file issues where you’re most comfortable working.

May your queries all be optimized and results sets complete in the new year!

Engaging the tidyverse Clean Slate Protocol

I caught the 0.7.0 release of dplyr on my home CRAN server early Friday morning and immediately set out to install it since I’m eager to finish up my sergeant package and get it on CRAN. “Tidyverse” upgrades aren’t trivial for me as I tinker quite a bit with the tidyverse and create packages that depend on various components. The sergeant package provides — amongst other things — a dplyr back-end for Apache Drill, so it has more tidyverse tendrils than other bits of code I maintain.

macOS binaries weren’t available yet (it generally takes 24-48 hrs for that) so I did an install.packages("dplyr", type="source") and was immediately hit with gcc 7 compilation errors. This seemed odd, but switching back to clang worked fine.

I, then, proceeded to run chunks in an Rmd I’m working on and hit “Encoding” errors on mutate() calls. Not having time to debug further I reverted to 0.5.0 of dplyr and went about my day and promised the tidyverse maintainers that I’d work on a reproducible example after work.

I made R data files from the data frames that were tossing errors and extracted & tweaked a code snippet that consistently generated the error and created a rocker container on one of my linux boxes to validate that this was an error and a cross-platform one. The rocker container used a full fresh-from-source copy of the tidyverse including dplyr 0.7.0. The code worked and no error was generated, so I immediately suspected package rot on my main dev macOS box.

Now, my situation is complicated by an insanely hasty migration to macOS 10.13β1 (I refuse to use the Apple macOS catchy names anymore since the most recent one is just silly) and a move to the gcc 7 toolchain (initially prompted to both get rJava working nicely and reproduce some CRAN noted errors with some packages). Further complications were also created by many invocations of install_github() of various packages regularly overwriting bits of the tidyverse over the past few weeks since the R 3.4.0 release. In other words, the integrity of the “tidyverse” was in serious question on my system and it was time for the Clean Slate Protocol.

Rather than itemize package versions and surgically nipping and tucking, I opted to use packrat to get to my desired end-state of a full-integrity tidyverse install. There are many ways to do this. Feel free to “one-up” me and show your l33t method in the comments. This one will likely be accessible to most — if not all — R users.

I started a new RStudio project in a new session and told it to use packrat. In the new project console, I did install.packages("tidyverse", type="source") and let it go for many minutes. I, then, navigated to the packrat subdirectory where the 3.4 package binaries are housed (just follow the project packrat tree down to the R version directory) and moved all 51 packages (yes, 51 O_o) to the main R library path (which you can figure out by running .libPaths() in any non-packrat-maintained project).

After doing that, I fired up the originally failing Rmd and everything worked fine. ?

I don’t do the Clean Slate Protocol too often (we all get to for new R dot-releases) but it came in handy this time. If you run into errors when trying to get the new dplyr working, you may benefit from the Clean Slate Protocol as well.

If you haven’t seen the changes in 0.6.0/0.7.0 you should check them out and give it a go.

All-in on R⁶ : Progress [bars] on first post

@eddelbuettel’s idea is a good one. (it’s a quick read…jump there and come back). We’ll avoid confusion and call it R⁶ over here. Feel free to don the superclass.

I often wait for a complete example or new package announcement to blog something when a briefly explained snippet might have sufficient utility for many R users. Also, tweets are fleeting and twitter could end up on the island of misfit social media sites if it can’t generate revenue or find a giant buyer this year. Don’t get me wrong, twitter convos are fine/useful, but blogs are at least semi-permanent, especially if you let them be hoovered up by the Internet Archive (“Save Page Now” on their site or use this handy Chrome extension).

I’ll tag all R⁶ posts as “r6” if you want to auto-filter those out of your stream or just page through them.

I’ll also lead off the these micro-posts with a simple one: adding progress bars to your tidyverse purrr operations.

The purrr::map* functions enable expressive and type-safe vectorized operations. Mine are usually over a few million/billion IPv4 addresses or domain names/URLs and often involve moderately lengthy tasks so I usually add the ability to incorporate progress bars to functions I make (and, I’m trying hard to get out of the bad habit of long-ish anonymous functions in purrr calls). The following is a toy example, but it’s a working example you can run in your interactive R session now:

library(tidyverse)

arduously_long_nchar <- function(input_var, .pb=NULL) {
  
  if ((!is.null(.pb)) && inherits(.pb, "Progress") && (.pb$i < .pb$n)) .pb$tick()$print()
  
  Sys.sleep(1)
  
  nchar(input_var)
  
}

pb <- progress_estimated(length(letters))

map_int(letters, arduously_long_nchar, .pb=pb)

And, yes, I did make you wait ~26 seconds (unless you were intrepid enough to reduce the amount of sleep time :-)

If you happen to forget the progress bar object (or know you don’t need one):

map_int(letters, arduously_long_nchar)

the function still works (sans progress bars).

If you happen to also mess up what you pass in to the .pb parameter or get your progress bar out of sync with your object it won’t error out on you (it can be made much safer and wrapped in another function, say — tick_off(.pb) — but this is supposed to be a small post).

Comments/feedback/your-own-progress-methods are most welcome and encouraged.

Making a Case for case_when

This is a brief (and likely obvious, for some folks) post on the dplyr::case_when() function.

Part of my work-work is dealing with data from internet scans. When we’re performing a deeper inspection of a particular internet protocol or service we try to capture as much system and service metadata as possible. Sifting through said metadata to find individual and collective insight is often a painful task given the diversity in the internet ecosystem.

One attribute we try to collect in all our service scans is operating system (OS) version. For many of our minutiae-focused researchers, it’s vital to know if a host is using “CoreOS 899.17.0” vs “CoreOS 835.9.0”. For much of the aggregation and clustering work we do, “CoreOS” is just fine.

In broad scans for any given service the OS diversity can be YUGE. There may be upwards of 10 different variations each of Windows, Red Hat, Ubuntu, Debian, et. al. present along with a smattering of very highly infrequent OS-types such as “Scientific Linux”. Plus, we can always count on probes returning many NA values for many discrete attribute queries, including OS type+version.

There are many ways to reduce a diverse list of OS type+version strings to a reduced target set. switch() and ifelse() are likely go-to solutions for many of you reading this. If you are in those camps and haven’t tried dplyr::case_when() read on!

Noise Reduction

To illustrate the utility of case_when(), let’s walk through an example. I created a tiny excerpt of just the OS type + version info from 500 observations out of a much larger internet scan. You can find that data at https://rud.is/dl/os.txt. Let’s take a look at the OS diversity:

library(ggalt)
library(hrbrthemes)
library(tidyverse)

os <- read_lines("https://rud.is/dl/os.txt", na = "NA")

str(table(os, useNA = "always"))
##  'table' int [1:28(1d)] 2 3 1 1 1 44 3 101 1 6 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ os: chr [1:28] "" "<unknown>" "Amazon Linux AMI 2016.03" "Amazon Linux AMI 2016.09" ...

sort(unique(os))
##  [1] ""                                           
##  [2] "<unknown>"                                  
##  [3] "Amazon Linux AMI 2016.03"                   
##  [4] "Amazon Linux AMI 2016.09"                   
##  [5] "Arch Linux"                                 
##  [6] "CentOS Linux 7 (Core)"                      
##  [7] "CoreOS 766.4.0"                             
##  [8] "CoreOS 899.17.0"                            
##  [9] "Debian GNU/Linux 7 (wheezy)"                
## [10] "Debian GNU/Linux 8 (jessie)"                
## [11] "Fedora 20 (Heisenbug)"                      
## [12] "linux"                                      
## [13] "openSUSE Leap 42.2"                         
## [14] "RancherOS v0.7.0"                           
## [15] "Red Hat Enterprise Linux Server 7.2 (Maipo)"
## [16] "Red Hat Enterprise Linux Server 7.3 (Maipo)"
## [17] "Ubuntu 14.04.1 LTS"                         
## [18] "Ubuntu 14.04.2 LTS"                         
## [19] "Ubuntu 14.04.3 LTS"                         
## [20] "Ubuntu 14.04.4 LTS"                         
## [21] "Ubuntu 14.04.5 LTS"                         
## [22] "Ubuntu 15.10"                               
## [23] "Ubuntu 16.04.1 LTS"                         
## [24] "Ubuntu 16.04.2 LTS"                         
## [25] "Ubuntu 16.10"                               
## [26] "Windows Server 2016 Datacenter"             
## [27] "Windows Server 2016 Standard"

There are 29 (including NA) different strings in just a tiny excerpt. Ugh.

If we want to group all Windows results as “Windows”, all Red Hat, CentOS and Fedora results as “Fedora”, all Ubuntu and Debian results as “Debian” and all CoreOS and Amazon results as “Amazon” while keeping NA_s_ NA and lumping everything else as “Other” it’s super-easy with case_when():

ELSE <- TRUE

case_when(
  grepl("Windows", os) ~ "Windows-ish",
  grepl("Red Hat|CentOS|Fedora", os) ~ "Fedora-ish",
  grepl("Ubuntu|Debian", os) ~ "Debian-ish",
  grepl("CoreOS|Amazon", os) ~ "Amazon-ish",
  is.na(os) ~ "Unknown",
  ELSE ~ "Other"
) %>%
  table() %>%
  as_data_frame() %>%
  set_names(c("os", "Node Count")) %>%
  arrange(`Node Count`) %>%
  mutate(os = factor(os, os)) %>%
  ggplot(aes(`Node Count`, os)) +
  geom_lollipop(horizontal = TRUE, size=1.5, color="#54278f") +
  scale_x_comma(limits=c(0,300)) +
  labs(y=NULL, title="OS Types") +
  theme_ipsum_rc(grid="X")

The clever formula (~) syntax used by case_when() enables you to cleanly and effortlessly reduce factor/categorical levels and also lets you preserve NA values (which I translated to “Unknown”). Since ELSE is used in the SQL CASE statement and dplyr::case_when() is a riff of said SQL cousin, I like to use an assigned ELSE to make it more visually explicit, but using TRUE is just as good (and, perhaps, better since TRUE can’t get namespace clobbered like the ELSE variable can).

FIN

If you’re in sequential or nested ifelse() Hades or are frustrated by switch() limitations, give dplyr::case_when() a try for your next project.

Epilogue

Not enough time earlier to add other methods, so this hint from @drob will have to suffice for now:

@hrbrmstr alternative: fuzzyjoin's regex_left_join? Advantages:

1. regexes can be in config file
2. one string can match multiple regexes pic.twitter.com/9IavUTEDMJ

— David Robinson (@drob) March 10, 2017

2017-01 Authored Package Updates

The rest of the month is going to be super-hectic and it’s unlikely I’ll be able to do any more to help the push to CRAN 10K, so here’s a breakdown of CRAN and GitHub new packages & package updates that I felt were worth raising awareness on:

epidata

I mentioned this one last week but it wasn’t really a package announcement post. epidata is now on CRAN and is a package to pull data from the Economic Policy Institute (U.S. gov economic data, mostly). Their “hidden” API is well thought out and the data has been nicely curated (and seems to update monthly). It makes it super easy to do things like the following:

library(epidata)
library(tidyverse)
library(stringi)
library(hrbrmisc) # devtools::install_github("hrbrmstr/hrbrmisc")

us_unemp <- get_unemployment("e")

glimpse(us_unemp)
## Observations: 456
## Variables: 7
## $ date            <date> 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-0...
## $ all             <dbl> 0.061, 0.061, 0.060, 0.060, 0.059, 0.059, 0.05...
## $ less_than_hs    <dbl> 0.100, 0.100, 0.099, 0.099, 0.099, 0.099, 0.09...
## $ high_school     <dbl> 0.055, 0.055, 0.054, 0.054, 0.054, 0.053, 0.05...
## $ some_college    <dbl> 0.050, 0.050, 0.050, 0.049, 0.049, 0.049, 0.04...
## $ college         <dbl> 0.032, 0.031, 0.031, 0.030, 0.030, 0.029, 0.03...
## $ advanced_degree <dbl> 0.021, 0.020, 0.020, 0.020, 0.020, 0.020, 0.02...

us_unemp %>%
  gather(level, rate, -date) %>%
  mutate(level=stri_replace_all_fixed(level, "_", " ") %>%
           stri_trans_totitle() %>%
           stri_replace_all_regex(c("Hs$"), c("High School")),
         level=factor(level, levels=unique(level))) -> unemp_by_edu

col <- ggthemes::tableau_color_pal()(10)

ggplot(unemp_by_edu, aes(date, rate, group=level)) +
  geom_line(color=col[1]) +
  scale_y_continuous(labels=scales::percent, limits =c(0, 0.2)) +
  facet_wrap(~level, scales="free") +
  labs(x=NULL, y="Unemployment rate",
       title=sprintf("U.S. Monthly Unemployment Rate by Education Level (%s)", paste0(range(format(us_unemp$date, "%Y")), collapse=":")),
       caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
  theme_hrbrmstr(grid="XY")

us_unemp %>%
  select(date, high_school, college) %>%
  mutate(date_num=as.numeric(date)) %>%
  ggplot(aes(x=high_school, xend=college, y=date_num, yend=date_num)) +
  geom_segment(size=0.125, color=col[1]) +
  scale_x_continuous(expand=c(0,0), label=scales::percent, breaks=seq(0, 0.12, 0.02), limits=c(0, 0.125)) +
  scale_y_reverse(expand=c(0,100), label=function(x) format(as_date(x), "%Y")) +
  labs(x="Unemployment rate", y="Year ↓",
       title=sprintf("U.S. monthly unemployment rate gap (%s)", paste0(range(format(us_unemp$date, "%Y")), collapse=":")),
       subtitle="Segment width shows the gap between those with a high school\ndegree and those with a college degree",
       caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
  theme_hrbrmstr(grid="X") +
  theme(panel.ontop=FALSE) +
  theme(panel.grid.major.x=element_line(size=0.2, color="#2b2b2b25")) +
  theme(axis.title.x=element_text(family="Arial", face="bold")) +
  theme(axis.title.y=element_text(family="Arial", face="bold", angle=0, hjust=1, margin=margin(r=-14)))

(right edge is high school, left edge is college…I’ll annotate it better next time)

censys

Censys is a search engine by one of the cybersecurity research partners we publish data to at work (free for use by all). The API is moderately decent (it’s mostly a thin shim authentication layer to pass on Google BigQuery query strings to the back-end) and the R package to interface to it censys is now on CRAN.

waffle

The seminal square pie chart package waffle has been updated on CRAN to work better with recent ggplot2 2.x changes and has some additional parameters you may want to check out.

cdcfluview

The viral package cdcfluview has had some updates on the GitHub version to add saner behaviour when specifying dates and had to be updated as the CDC hidden API switched to all https URLs (major push in .gov-land to do that to get better scores on their cyber report cards). I’ll be adding some features before the next CRAN push to enable retrieval of additional mortality data.

sergeant

If you work with Apache Drill (if you don’t, you should), the sergeant package (GitHub) will help you whip it into shape. I’ve mentioned it before on the blog but it has a nigh-complete dplyr interface now that works pretty well. It also has a direct REST API interface and RJDBC interface plus many helper utilities that help you avoid typing SQL strings to get cluster status info. Once I add the ability to create parquet files with it I’ll push it up to CRAN.

The one thing I’d like to do with this package is support any user-defined functions (UDFs in Drill-speak) folks have written. So, if you have a UDF you’ve written or use and you want it wrapped in the package, just drop an issue and I’ll layer it in. I’ll be releasing some open source cybersecurity-related UDFs via the work github in a few weeks.

zkcmd

Drill (in non-standalone mode) relies on Apache Zookeeper to keep everything in sync and it’s sometimes necessary to peek at what’s happening inside the zookeeper cluster, so sergeant has a sister package zkcmd that provides an R interface to zookeeper instances.

ggalt

Some helpful folks tweaked ggalt for better ggplot2 2.x compatibility (#ty!) and I added a new geom_cartogram() (before you ask if it makes warped shapefiles: it doesn’t) that restores the old (and what I believe to be the correct/sane/proper) behaviour of geom_map(). I need to get this on CRAN soon as it has both fixes and many new geoms folks will want to play with in a non-GitHub context.

FIN

There have been some awesome packages released by others in the past month+ and you should add R Weekly to your RSS feeds if you aren’t following it already (there are other things you should have there for R updates as well, but that’s for another blog). I’m definitely looking forward to new packages, visualizations, services and utilities that will be coming this year to the R community.

sergeant : An R Boot Camp for Apache Drill

2016-12-20 – 11:01
Posted in Apache Drill, dplyr, drill, R, SQL
Tagged post
Comments (3)

I recently mentioned that I’ve been working on a development version of an Apache Drill R package called sergeant. Here’s a lifted “TLDR” on Drill:

Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.
Drill’s datastore-aware optimizer automatically restructures a query plan to leverage the datastore’s internal processing capabilities. In addition, Drill supports data locality, so it’s a good idea to co-locate Drill and the datastore on the same nodes.

It also supports reading formats such as:

Avro
[CTP]SV ([C]omma-, [T]ab-, [P]ipe-Separated-Values)
Parquet
Hadoop Sequence Files

It’s a bit like Spark in that you can run it on a single workstation and scale up to a YUGE cluster. It lacks the ML components of Spark, but it connects to everything without the need to define a schema up front. Said “everything” includes parquet files on local filesystems, so if you need to slice through GBs of parquet data and have a beefy enough Linux workstation (I believe Drill runs on Windows and know it runs on macOS fine, too, but that’s $$$ for a bucket of memory & disk space) you can take advantage of the optimized processing power Drill offers on a single system (while also joining the data with any other data format you can think of). You can also seamlessly move the data to a cluster and barely tweak your code to support said capacity expansion.

Why `sergeant`?

There’s already an R package on CRAN to work with Drill: DrillR. It’s S4 class-based, has a decent implementation and interfaces with the REST API. However, it sticks httr::verbose() everywhere: https://github.com/cran/DrillR/search?utf8=%E2%9C%93&q=verbose.

The sergeant package interfaces with the REST API as well, but also works with the JDBC driver (the dev version includes the driver with the package, but this will be removed for the eventual CRAN submission) and includes some other niceties around Drill options viewing and setting and some other non-SQL bits. Of note: the REST API version shows an httr progress bar for data downloading and you can wrap the calls with httr::with_verbose(…) if you really like seeing cURL messages.

The other thing sergeant has going for it is a nascent dplyr interface. Presently, this is a hack-ish wrapper around the RJDBC JDBCConnection presented by the Drill JDBC driver. While basic functionality works, I firmly believe Drill needs it’s own DBI driver (like is second-cousin Preso has) to avoid collisions withy any other JDBC connections you might have open, plus more work needs to be done under the covers to deal with quoting properly and exposing more Drill built-in SQL functions.

SQL vs `dplyr`

For some truly complex data machinations you’re going to want to work at the SQL level and I think it’s important to know SQL if you’re ever going to do data work outside JSON & CSV files just to appreciate how much gnashing of teeth dplyr saves you from. Using SQL for many light-to-medium aggregation tasks that feed data to R can feel like you’re banging rocks together to make fire when you could just be using your R precision welder. What would you rather write:

SELECT  gender ,  marital_status , COUNT(*) AS  n 
FROM  cp.`employee.json` 
GROUP BY  gender ,  marital_status

in a drill-embedded or drill-localhost SQL shell? Or:

library(RJDBC)
library(dplyr)
library(sergeant)

ds <- src_drill("localhost:31010", use_zk=FALSE)

db <- tbl(ds, "cp.`employee.json`") 

count(db, gender, marital_status) %>% collect()

(NOTE: that SQL statement is what ultimately gets sent to Drill from dplyr)

Now, dplyr tbl_df idioms don’t translate 1:1 to all other src_es, but they are much easier on the eyes and more instructive in analysis code (and, I fully admit that said statement is more opinion than fact).

`sergeant` and `dplyr`

The src_drill() function uses the JDBC Drill driver and, hence, has an RJDBC dependency. The Presto folks (a “competing” offering to Drill) wrapped a DBI interface around their REST API to facilitate the use of dplyr idioms. I’m not sold on whether I’ll continue with a lightweight DBI wrapper using RJDBC or go the RPresto route, but for now the basic functionality works and changing the back-end implementation should not break anything (much).

You’ve said “parquet” alot…

Yes. Yes, I have. Parquet is a “big data” compressed columnar storage format that is generally used in Hadoop shops. Parquet is different from ‘feather’ (‘feather’ is based on another Apache foundation project: Arrow). Arrow/feather is great for things that fit in memory. Parquet and the idioms that sit on top of it enable having large amounts data available in a cluster for processing with Hadoop / Spark / Drill / Presto (etc). Parquet is great for storing all kinds of data, including log and event data which I have to work with quite a bit and it’s great being able to prototype on a single workstation then move code to hit a production cluster. Plus, it’s super-easy to, say, convert an entire, nested directory tree of daily JSON log files into parquet with Drill:

CREATE TABLE dfs.destination.`source/2016/12/2016_12_source_event_logs.parquet` AS
  SELECT src_ip, dst_ip, src_port, dst_port, event_message, ts 
  FROM dfs.source.`/log/dir/root/2016/12/*/event_log.json`;

Kick the tyres

The REST and JDBC functions are solid (I’ve been using them at work for a while) and the dplyr support has handled some preliminary production work well (though, remember, it’s not fully-baked). There are plenty of examples — including a dplyr::left_join() between parquet and JSON data — in the README and all the exposed functions have documentation.

File an issue with a feature request or bug report.

I expect to have this CRAN-able in January, 2017.