R Archives - Page 22 of 56

Category Archives: R

OS Secrets Exposed: Extracting Extended File Attributes and Exploring Hidden Download URLs With The xattrs Package

Most modern operating systems keep secrets from you in many ways. One of these ways is by associating extended file attributes with files. These attributes can serve useful purposes. For instance, macOS uses them to identify when files have passed through the Gatekeeper or to store the URLs of files that were downloaded via Safari (though most other browsers add the com.apple.metadata:kMDItemWhereFroms attribute now, too).

Attributes are nothing more than a series of key/value pairs. They key must be a character value & unique, and it’s fairly standard practice to keep the value component under 4K. Apart from that, you can put anything in the value: text, binary content, etc.

When you’re in a terminal session you can tell that a file has extended attributes by looking for an @ sign near the permissions column:

$ cd ~/Downloads
$ ls -l
total 264856
-rw-r--r--@ 1 user  staff     169062 Nov 27  2017 1109.1968.pdf
-rw-r--r--@ 1 user  staff     171059 Nov 27  2017 1109.1968v1.pdf
-rw-r--r--@ 1 user  staff     291373 Apr 27 21:25 1804.09970.pdf
-rw-r--r--@ 1 user  staff    1150562 Apr 27 21:26 1804.09988.pdf
-rw-r--r--@ 1 user  staff     482953 May 11 12:00 1805.01554.pdf
-rw-r--r--@ 1 user  staff  125822222 May 14 16:34 RStudio-1.2.627.dmg
-rw-r--r--@ 1 user  staff    2727305 Dec 21 17:50 athena-ug.pdf
-rw-r--r--@ 1 user  staff      90181 Jan 11 15:55 bgptools-0.2.tar.gz
-rw-r--r--@ 1 user  staff    4683220 May 25 14:52 osquery-3.2.4.pkg

You can work with extended attributes from the terminal with the xattr command, but do you really want to go to the terminal every time you want to examine these secret settings (now that you know your OS is keeping secrets from you)?

I didn’t think so. Thus begat the xattrs? package.

Exploring Past Downloads

Data scientists are (generally) inquisitive folk and tend to accumulate things. We grab papers, data, programs (etc.) and some of those actions are performed in browsers. Let’s use the xattrs package to rebuild a list of download URLs from the extended attributes on the files located in ~/Downloads (if you’ve chosen a different default for your browsers, use that directory).

We’re not going to work with the entire package in this post (it’s really straightforward to use and has a README on the GitHub site along with extensive examples) but I’ll use one of the example files from the directory listing above to demonstrate a couple functions before we get to the main example.

First, let’s see what is hidden with the RStudio disk image:


library(xattrs)
library(reticulate) # not 100% necessary but you'll see why later
library(tidyverse) # we'll need this later

list_xattrs("~/Downloads/RStudio-1.2.627.dmg")
## [1] "com.apple.diskimages.fsck"            "com.apple.diskimages.recentcksum"    
## [3] "com.apple.metadata:kMDItemWhereFroms" "com.apple.quarantine"

There are four keys we can poke at, but the one that will help transition us to a larger example is com.apple.metadata:kMDItemWhereFroms. This is the key Apple has standardized on to store the source URL of a downloaded item. Let’s take a look:


get_xattr_raw("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms")
##   [1] 62 70 6c 69 73 74 30 30 a2 01 02 5f 10 4c 68 74 74 70 73 3a 2f 2f 73 33 2e 61 6d 61
##  [29] 7a 6f 6e 61 77 73 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2d 69 64 65 2d 62 75 69 6c 64
##  [57] 2f 64 65 73 6b 74 6f 70 2f 6d 61 63 6f 73 2f 52 53 74 75 64 69 6f 2d 31 2e 32 2e 36
##  [85] 32 37 2e 64 6d 67 5f 10 2c 68 74 74 70 73 3a 2f 2f 64 61 69 6c 69 65 73 2e 72 73 74
## [113] 75 64 69 6f 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2f 6f 73 73 2f 6d 61 63 2f 08 0b 5a
## [141] 00 00 00 00 00 00 01 01 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 00
## [169] 00 00 00 89

Why “raw”? Well, as noted above, the value component of these attributes can store anything and this one definitely has embedded nul[l]s (0x00) in it. We can try to read it as a string, though:


get_xattr("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms")
## [1] "bplist00\xa2\001\002_\020Lhttps://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg_\020,https://dailies.rstudio.com/rstudio/oss/mac/\b\vZ"

So, we can kinda figure out the URL but it’s definitely not pretty. The general practice of Safari (and other browsers) is to use a binary property list to store metadata in the value component of an extended attribute (at least for these URL references).

There will eventually be a native Rust-backed property list reading package for R, but we can work with that binary plist data in two ways: first, via the read_bplist() function that comes with the xattrs package and wraps Linux/BSD or macOS system utilities (which are super expensive since it also means writing out data to a file each time) or turn to Python which already has this capability. We’re going to use the latter.

I like to prime the Python setup with invisible(py_config()) but that is not really necessary (I do it mostly b/c I have a wild number of Python — don’t judge — installs and use the RETICULATE_PYTHON env var for the one I use with R). You’ll need to install the biplist module via pip3 install bipist or pip install bipist depending on your setup. I highly recommended using Python 3.x vs 2.x, though.


biplist <- import("biplist", as="biplist")

biplist$readPlistFromString(
  get_xattr_raw(
    "~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms"
  )
)
## [1] "https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg"
## [2] "https://dailies.rstudio.com/rstudio/oss/mac/"

That's much better.

Let's work with metadata for the whole directory:


list.files("~/Downloads", full.names = TRUE) %>% 
  keep(has_xattrs) %>% 
  set_names(basename(.)) %>% 
  map_df(read_xattrs, .id="file") -> xdf

xdf
## # A tibble: 24 x 4
##    file            name                                  size contents   
##                                                     
##  1 1109.1968.pdf   com.apple.lastuseddate#PS               16  
##  2 1109.1968.pdf   com.apple.metadata:kMDItemWhereFroms   110 
##  3 1109.1968.pdf   com.apple.quarantine                    74  
##  4 1109.1968v1.pdf com.apple.lastuseddate#PS               16  
##  5 1109.1968v1.pdf com.apple.metadata:kMDItemWhereFroms   116 
##  6 1109.1968v1.pdf com.apple.quarantine                    74  
##  7 1804.09970.pdf  com.apple.metadata:kMDItemWhereFroms    86  
##  8 1804.09970.pdf  com.apple.quarantine                    82  
##  9 1804.09988.pdf  com.apple.lastuseddate#PS               16  
## 10 1804.09988.pdf  com.apple.metadata:kMDItemWhereFroms   104 
## # ... with 14 more rows

## count(xdf, name, sort=TRUE)
## # A tibble: 5 x 2
##   name                                     n
##                                   
## 1 com.apple.metadata:kMDItemWhereFroms     9
## 2 com.apple.quarantine                     9
## 3 com.apple.lastuseddate#PS                4
## 4 com.apple.diskimages.fsck                1
## 5 com.apple.diskimages.recentcksum         1

Now we can focus on the task at hand: recovering the URLs:


list.files("~/Downloads", full.names = TRUE) %>% 
  keep(has_xattrs) %>% 
  set_names(basename(.)) %>% 
  map_df(read_xattrs, .id="file") %>% 
  filter(name == "com.apple.metadata:kMDItemWhereFroms") %>% 
  mutate(where_from = map(contents, biplist$readPlistFromString)) %>% 
  select(file, where_from) %>% 
  unnest() %>% 
  filter(!where_from == "")
## # A tibble: 15 x 2
##    file                where_from                                                       
##                                                                               
##  1 1109.1968.pdf       https://arxiv.org/pdf/1109.1968.pdf                              
##  2 1109.1968.pdf       https://www.google.com/                                          
##  3 1109.1968v1.pdf     https://128.84.21.199/pdf/1109.1968v1.pdf                        
##  4 1109.1968v1.pdf     https://www.google.com/                                          
##  5 1804.09970.pdf      https://arxiv.org/pdf/1804.09970.pdf                             
##  6 1804.09988.pdf      https://arxiv.org/ftp/arxiv/papers/1804/1804.09988.pdf           
##  7 1805.01554.pdf      https://arxiv.org/pdf/1805.01554.pdf                             
##  8 athena-ug.pdf       http://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf        
##  9 athena-ug.pdf       https://www.google.com/                                          
## 10 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/bgptools-0.2.tar.gz 
## 11 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/                    
## 12 osquery-3.2.4.pkg   https://osquery-packages.s3.amazonaws.com/darwin/osquery-3.2.4.p…
## 13 osquery-3.2.4.pkg   https://osquery.io/downloads/official/3.2.4                      
## 14 RStudio-1.2.627.dmg https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio…
## 15 RStudio-1.2.627.dmg https://dailies.rstudio.com/rstudio/oss/mac/

(There are multiple URL entries due to the fact that some browsers preserve the path you traversed to get to the final download.)

Note: if Python is not an option for you, you can use the hack-y read_bplist() function in the package, but it will be much, much slower and you'll need to deal with an ugly list object vs some quaint text vectors.

FIN

Have some fun exploring what other secrets your OS may be hiding from you and if you're on Windows, give this a go. I have no idea if it will compile or work there, but if it does, definitely report back!

Remember that the package lets you set and remove extended attributes as well, so you can use them to store metadata with your data files (they don't always survive file or OS transfers but if you keep things local they can be an interesting way to tag your files) or clean up items you do not want stored.

The Fix Is In: Finding infix functions inside contributed R package “utilities” files

Regular readers will recall the “utility belt” post from back in April of this year. This is a follow-up to a request made asking for a list of all the % infix functions in those files.

We’re going to:

collect up all of the sources
parse them
find all the definitions of % infix functions
write them to a file

We’ll start by grabbing the data from the previous post and look at it as a refresher:


library(stringi)
library(tidyverse)

utils <- read_rds(url("https://rud.is/dl/utility-belt.rds"))

utils
## # A tibble: 1,746 x 13
##    permsissions links owner     group      size month day   year_hr path   date       pkg   fil   file_src             
##  1 -rw-r--r--       0 hornik    users      1658 Jun   05    2016    AHR/R… 2016-06-05 AHR   util… "## \\int f(x)dg(x) …
##  2 -rw-r--r--       0 ligges    users     12609 Dec   13    2016    ALA4R… 2016-12-13 ALA4R util… "## some utility fun…
##  3 -rw-r--r--       0 hornik    users         0 Feb   24    2017    AWR.K… 2017-02-24 AWR.… util… ""                   
##  4 -rw-r--r--       0 ligges    users      4127 Aug   30    2017    Alpha… 2017-08-30 Alph… util… "#\n#' Assign API ke…
##  5 -rw-r--r--       0 ligges    users       121 Jan   19    2017    Amylo… 2017-01-19 Amyl… util… "make_decision <- fu…
##  6 -rw-r--r--       0 herbrandt herbrandt    52 Aug   10    2017    BANES… 2017-08-10 BANE… util… "#' @importFrom dply…
##  7 -rw-r--r--       0 ripley    users     36977 Jan   06    2015    BEQI2… 2015-01-06 BEQI2 util… "#' \tRemove Redunda…
##  8 -rw-r--r--       0 hornik    users     34198 May   10    2017    BGDat… 2017-05-10 BGDa… util… "# A more memory-eff…
##  9 -rwxr-xr-x       0 ligges    users      3676 Aug   14    2016    BGLR/… 2016-08-14 BGLR  util… "\n readBinMat=funct…
## 10 -rw-r--r--       0 ripley    users      2547 Feb   04    2015    BLCOP… 2015-02-04 BLCOP util… "###################…
## # ... with 1,736 more rows

Note that we somewhat expected the file source to potentially come in handy at a later date and also expected the need to revisit that post, so the R data file [←direct link to RDS] included a file_src column.

Now, let's find all the source files with at least one infix definition, collect them together and parse them so we can do more code spelunking:


filter(utils, stri_detect_fixed(file_src, "`%")) %>% # only find sources with infix definitions
  pull(file_src) %>%
  paste0(collapse="\n\n") %>%
  parse(text = ., keep.source=TRUE) -> infix_src

str(infix_src, 1)
## length 1364 expression(dplyr::`%>%`, `%||%` <- function(a, b) if (is.null(a)) b else a, get_pkg_path <- function(ctx) {  pkg_| __truncated__ ...
##  - attr(*, "srcref")=List of 1364
##  - attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile'  
##  - attr(*, "wholeSrcref")= 'srcref' int [1:8] 1 0 15768 0 0 0 1 15768
##   ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile'

We can now take all of that lovely parsed source and tokenize it to work with the discrete elements in a very tidy manner:


infix_parsed <- tbl_df(getParseData(infix_src)) # tbl_df() is mainly for pretty printing 

infix_parsed
## # A tibble: 118,242 x 9
##    line1  col1 line2  col2    id parent token          terminal text      
##  1     1     1     1    24     1    -10 COMMENT        TRUE     #' @impor…
##  2     2     1     2    10     4    -10 COMMENT        TRUE     #' @export
##  3     3     1     3    12    10      0 expr           FALSE    ""        
##  4     3     1     3     5     7     10 SYMBOL_PACKAGE TRUE     dplyr     
##  5     3     6     3     7     8     10 NS_GET         TRUE     ::        
##  6     3     8     3    12     9     10 SYMBOL         TRUE     `%>%`     
##  7     5     1     5    49    51      0 expr           FALSE    ""        
##  8     5     1     5     6    16     18 SYMBOL         TRUE     `%||%`    
##  9     5     1     5     6    18     51 expr           FALSE    ""        
## 10     5     8     5     9    17     51 LEFT_ASSIGN    TRUE     <-        
## # ... with 118,232 more rows

We just need to find a sequence of tokens that make up a function definition, then whittle those down to ones that look like our % infix names:


pat <- c("SYMBOL", "expr", "LEFT_ASSIGN", "expr", "FUNCTION") # pattern for function definition

# find all of ^^ sequences (there's a good twitter discussion on this abt a month ago)
idx <- which(infix_parsed$token == pat[1]) # find location of match of start of seq

# look for the rest of the sequences starting at each idx position
map_lgl(idx, ~{
  all(infix_parsed$token[.x:(.x+(length(pat)-1))] == pat)
}) -> found

f_defs <- idx[found] # starting indices of all the places where functions are defined

# filter ^^ to only find infix ones
infix_defs <- f_defs[stri_detect_regex(infix_parsed$text[f_defs], "^`\\%")]

# there aren't too many, but remember we're just searching `util` functions
length(infix_defs)
## [1] 106

Now, write it out to a file so we can peruse the infix functions:


# nuke a file and fill it with the function definition
cat("", sep="", file="infix_functions.R")
walk2(
  getParseText(infix_parsed, infix_parsed$id[infix_defs]),     # extract the infix name
  getParseText(infix_parsed, infix_parsed$id[infix_defs + 3]), # extract the function definition body
  ~{
    cat(.x, " <- ", .y, "\n\n", sep="", file="infix_functions.R", append=TRUE)
  }
)

There are 106 of them so you can find the extracted ones in this gist.

Here's an overview of what you can expect to find:

# A tibble: 39 x 2
   name                 n
 1 `%||%`              47
 2 `%+%`                7
 3 `%AND%`              4
 4 `%notin%`            4
 5 `%:::%`              3
 6 `%==%`               3
 7 `%!=%`               2
 8 `%*diag%`            2
 9 `%diag*%`            2
10 `%nin%`              2
11 `%OR%`               2
12 `%::%`               1
13 `%??%`               1
14 `%.%`                1
15 `%@%`                1
16 `%&&%`               1
17 `%&%`                1
18 `%+&%`               1
19 `%++%`               1
20 `%+|%`               1
21 `%<<%`               1
22 `%>>%`               1
23 `%~~%`               1
24 `%assert_class%`     1
25 `%contains%`         1
26 `%din%`              1
27 `%fin%`              1
28 `%identical%`        1
29 `%In%`               1
30 `%inr%`              1
31 `%M%`                1
32 `%notchin%`          1
33 `%or%`               1
34 `%p%`                1
35 `%pin%`              1
36 `%R%`                1
37 `%s%`                1
38 `%sub_in%`           1
39 `%sub_nin%`          1

FIN

If any of those are useful, feel free to PR them in to https://github.com/hrbrmstr/freebase/blob/master/inst/templates/infix-helpers.R (and add yourself to the DESCRIPTION if you do).

Hopefully this provided some further inspiration to continue to use R not only as your language of choice but also as a fun data source.

The Power of Standards and Consistency

I’m going to (eventually) write a full post on the package I’m mentioning in this one : osqueryr?. The TLDR on osqueryr is that it is an R DBI wrapper (that has just enough glue to also be plugged into dbplyr) for osquery?. The TLDR on osquery is that it “exposes an operating system as a high-performance relational database. This design allows you to write SQL-based queries efficiently and easily to explore operating systems.”

In short, osquery turns the metadata and state information of your local system (or remote system(s)) into a SQL-compliant database. It also works on Windows, Linux, BSD and macOS. This means you can query a fleet of systems with a (mostly) normalized set of tables and get aggregated results. Operations and information security staff use this to manage systems and perform incident response tasks, but you can use it to get just about anything and there are even more powerful modes of operation for osquery. But, more on all the features of osquery[r] in another post.

If you are skeptical, here’s some proof (which I need to show regardless of your skepticism state). First, a local “connection”:


library(DBI)
library(osqueryr)

con <- DBI::dbConnect(Osquery())

head(dbListTables(con), 10)
##  [1] "account_policy_data" "acpi_tables"         "ad_config"          
##  [4] "alf"                 "alf_exceptions"      "alf_explicit_auths" 
##  [7] "alf_services"        "app_schemes"         "apps"               
## [10] "apt_sources"

dbListFields(con, "processes")
##  [1] "cmdline"            "cwd"                "disk_bytes_read"   
##  [4] "disk_bytes_written" "egid"               "euid"              
##  [7] "gid"                "name"               "nice"              
## [10] "on_disk"            "parent"             "path"              
## [13] "pgroup"             "pid"                "resident_size"     
## [16] "root"               "sgid"               "start_time"        
## [19] "state"              "suid"               "system_time"       
## [22] "threads"            "total_size"         "uid"               
## [25] "user_time"          "wired_size"

dbGetQuery(con, "SELECT name, system_time FROM processes WHERE name LIKE '%fire%'")
## # A tibble: 2 x 2
##   name     system_time
## 1 Firewall 3          
## 2 firefox  517846

then, a remote "connection":


con2 <- osqueryr::dbConnect(Osquery(), host = "hrbrmstr@osq1")

head(dbListTables(con2), 10)
##  [1] "account_policy_data" "acpi_tables"         "ad_config"          
##  [4] "alf"                 "alf_exceptions"      "alf_explicit_auths" 
##  [7] "alf_services"        "app_schemes"         "apps"               
## [10] "apt_sources"

dbListFields(con2, "processes")
##  [1] "cmdline"            "cwd"                "disk_bytes_read"   
##  [4] "disk_bytes_written" "egid"               "euid"              
##  [7] "gid"                "name"               "nice"              
## [10] "on_disk"            "parent"             "path"              
## [13] "pgroup"             "pid"                "resident_size"     
## [16] "root"               "sgid"               "start_time"        
## [19] "state"              "suid"               "system_time"       
## [22] "threads"            "total_size"         "uid"               
## [25] "user_time"          "wired_size"

dbGetQuery(con2, "SELECT name, system_time FROM processes WHERE name LIKE '%fire%'")
## # A tibble: 1 x 2
##   name    system_time
## 1 firefox 1071992

"You're talking an awful lot about the package when you said this was a post on 'standards' and 'consistency'."

True, but we needed that bit above for context. To explain what this post has to do with "standards" and "consistency" I also need to tell you a bit more about how both osquery and the osqueryr package are implemented.

You can read about osquery in-depth starting at the link at the top of this post, but the authors of the tool really wanted a consistent idiom for accessing system metadata with usable, normalized output. They chose (to use a word they didn't but one that works for an R audience) a "data frame" as the output format and picked the universal language of "data frames" -- SQL -- as the inquiry interface. So, right there are examples of both standards and consistency: using SQL vs coming up with yet-another-query-language and avoiding the chaos of the myriad of outputs from various system commands by making all results conform to a rectangular data structure.

Let's take this one-step further with a specific example. All modern operating systems have the concept of a "process" and said processes have (mostly) similar attributes. However, the commands used to get a detailed listing of those processes differ (sometimes wildly) from OS to OS. The authors of osquery came up with a set of schemas to ensure a common, rectangular output and naming conventions (note that some schemas are unique to a particular OS since some elements of operating systems have no useful counterparts on other operating systems).

The osquery authors also took consistency and standards to yet-another-level by taking advantage of a feature of SQLite called virtual tables. That enables them to have C/C++/Objective-C "glue" that gets called when a query is made so they can dispatch the intent to the proper functions or shell commands and then send all the results back -- or -- use the SQLite engine capabilities to do joining, filtering, UDF-calling, etc to produce rich, targeted rectangular output back.

By not reinventing the wheel and relying on well-accepted features like data frames, SQL and SQLite the authors could direct all their focus on solving the problem they posited.

"Um, you're talking alot about everything but R now."

We're getting to the good (i.e. "R") part now.

Because the authors didn't try to become SQL parser writer experts and relied on the standard SQL offerings of SQLite, the queries made are "real" SQL (if you've worked with more than one database engine, you know how they all implement different flavours of SQL).

Because these queries are "real" SQL, we can write an R DBI driver for it. The DBI package aims "[to define] a common interface between R and database management systems (DBMS). The interface defines a small set of classes and methods similar in spirit to Perl's DBI, Java's JDBC, Python's DB-API, and Microsoft's ODBC. It defines a set of classes and methods defines what operations are possible and how they are performed."

If you look at the osqueryr package source, you'll see a bunch of DBI boilerplate code (which is in the r-dbi organization example code) and only a handful of "touch points" for the actual calls to osqueryi (the command that processes SQL). No handling of anything but passing on SQL to the osqueryi engine and getting rectangular results back. By abstracting the system call details, R users can work with a familiar, consistent, standard interface and have full access to the power of osquery without firing up a terminal.

But it gets even better.

As noted above, one design aspect of osquery was to enable remote usage. Rather than come up with yet-another-daemon-and-custom-protocol, the osquery authors suggest ssh? as one way of invoking the command on remote systems and getting the rectangular results back.

Because the osqueryr package used the sys? package for making local system calls, there was only a tiny bit of extra effort required to switch from sys::exec_internal() to a sibling call in the ssh? package -- ssh::ssh_exec_internal() when remote connections were specified. (Said effort could have been zero if I chose a slightly different function in sys, too.)

Relying on well-accepted standards made both osqueryi and the R DBI-driver work seamlessly without much code at all and definitely without a rats nest of nested if/else statements and custom httr functions.

But it gets even more better-er

Some folks like & grok SQL, others don't. (Humans have preferences, go figure.)

A few years ago, Hadley (do I even need to use his last name at this point in time?) came up with the idea to have a more expressive and consistent way to work with data frames. We now know this as the tidyverse but one core element of the tidyverse is dplyr, which can really level-up your data frame game (no comments about data.table, or the beauty of base R, please). Not too long after the birth of dplyr came the ability to work with remote, rectangular, SQL-based data sources with (mostly) the same idioms.

And, not too long after that, the remote dplyr interface (now, dbplyr) got up close and personal with DBI. Which ultimately means that if you make a near-fully-compliant DBI interface to a SQL back-end you can now do something like this:


library(DBI)
library(dplyr)
library(osqueryr)

con <- DBI::dbConnect(Osquery())

osqdb <- src_dbi(con)

procs <- tbl(osqdb, "processes")
listen <- tbl(osqdb, "listening_ports")

left_join(procs, listen, by="pid") %>%
  filter(port != "", protocol == "17") %>% # 17 == TCP
  distinct(name, port, address, pid)
## # Source:   lazy query [?? x 4]
## # Database: OsqueryConnection
##    address name              pid   port 
##  1 0.0.0.0 BetterTouchTool   46317 57183
##  2 0.0.0.0 Dropbox           1214  17500
##  3 0.0.0.0 SystemUIServer    429   0    
##  4 0.0.0.0 SystemUIServer    429   62240
##  5 0.0.0.0 UserEventAgent    336   0    
##  6 0.0.0.0 WiFiAgent         493   0    
##  7 0.0.0.0 WiFiProxy         725   0    
##  8 0.0.0.0 com.docker.vpnkit 732   0    
##  9 0.0.0.0 identityservicesd 354   0    
## 10 0.0.0.0 loginwindow       111   0    
## # ... with more rows

The src_dbi() call wires up everything for us because d[b]plyr can rely on DBI doing it's standard & consistent job and DBI can rely on the SQLite processing crunchy goodness of osqueryi to ultimately get us a list of really dangerous (if not firewalled off) processes that are listening on all network interfaces. (Note to self: find out why the BetterTouchTool and Dropbox authors feel the need to bind to 0.0.0.0…)

FIN

What did standards and consistency get us?

The osquery authors spent time solving a hard problem vs creating new data formats and protocols
Rectangular data (i.e. "data frame") provides consistency and structure which ends up causing more freedom
"Standard" SQL enables a consistent means to work with rectangular data
ssh normalizes (secure) access across systems with a consistent protocol
A robust, well-defined standard mechanism for working with SQL databases enabled nigh instantaneous wiring up of a whole new back-end to R
ssh and sys common idioms made working with the new back-end on remote systems as easy as is on a local system
Another robust, well-defined modern mechanism for working with rectangular data got wired up to this new back-end with (pretty much) one line of code because of the defined standard and expectation of consistency (and works for local and remote)

Standards and consistency are pretty darned cool.

Create Code Metrics with cloc

The cloc Perl script (yes, Perl!) by Al Danial (https://github.com/AlDanial/cloc) has been one of the go-to tools for generating code metrics. Given a single file, directory tree, archive, or git repo, cloc can speedily give you metrics on the count of blank lines, comment lines, and physical lines of source code in a vast array of programming languages.

I don’t remember the full context but someone in the R community asked about about this type of functionality and I had tossed together a small script-turned-package to thinly wrap the Perl cloc utility. Said package was and is unimaginatively named cloc?. Thanks to some collaborative input from @ma_salmon, the package gained more features. Recently I added the ability to process R markdown (Rmd) files (i.e. only count lines in code chunks) to the main cloc Perl script and was performing some general cleanup when the idea to create some RStudio addins hit me.

`cloc` Basics

As noted, you can cloc just about anything. Here’s some metrics for dplyr::group_by:


cloc("https://raw.githubusercontent.com/tidyverse/dplyr/master/R/group-by.r")
## # A tibble: 1 x 10
##   source language file_count file_count_pct   loc loc_pct blank_lines blank_line_pct comment_lines comment_line_pct
## 1 group… R                 1             1.    44      1.          13             1.           110               1.

and, here’s a similar set of metrics for the whole dplyr package:


cloc_cran("dplyr")
## # A tibble: 7 x 11
##   source language file_count file_count_pct   loc loc_pct blank_lines blank_line_pct comment_lines comment_line_pct
## 1 dplyr… R               148        0.454   13216 0.442          2671       0.380             3876          0.673  
## 2 dplyr… C/C++ H…        125        0.383    6687 0.223          1836       0.261              267          0.0464 
## 3 dplyr… C++              33        0.101    4724 0.158           915       0.130              336          0.0583 
## 4 dplyr… HTML             11        0.0337   3602 0.120           367       0.0522              11          0.00191
## 5 dplyr… Markdown          2        0.00613  1251 0.0418          619       0.0880               0          0.     
## 6 dplyr… Rmd               6        0.0184    421 0.0141          622       0.0884            1270          0.220  
## 7 dplyr… C                 1        0.00307    30 0.00100           7       0.000995             0          0.     
## # ... with 1 more variable: pkg

We can also measure (in bulk) from afar, such as the measuring the dplyr git repo:


cloc_git("git://github.com/tidyverse/dplyr.git")
## # A tibble: 12 x 10
##    source    language     file_count file_count_pct   loc  loc_pct blank_lines blank_line_pct comment_lines
##  1 dplyr.git HTML                108        0.236   21467 0.335           3829       0.270             1114
##  2 dplyr.git R                   156        0.341   13648 0.213           2682       0.189             3736
##  3 dplyr.git Markdown             12        0.0263  10100 0.158           3012       0.212                0
##  4 dplyr.git C/C++ Header        126        0.276    6891 0.107           1883       0.133              271
##  5 dplyr.git CSS                   2        0.00438  5684 0.0887          1009       0.0711              39
##  6 dplyr.git C++                  33        0.0722   5267 0.0821          1056       0.0744             393
##  7 dplyr.git Rmd                   7        0.0153    447 0.00697          647       0.0456            1309
##  8 dplyr.git XML                   1        0.00219   291 0.00454            0       0.                   0
##  9 dplyr.git YAML                  6        0.0131    212 0.00331           35       0.00247             12
## 10 dplyr.git JavaScript            2        0.00438    44 0.000686          10       0.000705             4
## 11 dplyr.git Bourne Shell          3        0.00656    34 0.000530          15       0.00106             10
## 12 dplyr.git C                     1        0.00219    30 0.000468           7       0.000493             0
## # ... with 1 more variable: comment_line_pct

All in on Addins

The Rmd functionality made me realize that some interactive capabilities might be handy, so I threw together three of them.

Two of them extraction of code chunks from Rmd documents. One uses cloc other uses knitr::purl() (h/t @yoniceedee). The knitr one adds in some very nice functionality if you want to preserve chunk options and have “eval=FALSE” chunks commented out.

The final one will gather up code metrics for all the sources in an active project.

FIN

If you’d like additional features or want to contribute, give (https://github.com/hrbrmstr/cloc) a visit and drop an issue or PR.

“LMX ot NOSJ!” Interchanging Classic Data Formats With Single blackmagic Incantations

The D.C. Universe magic hero Zatanna used spells (i.e. incantations) to battle foes and said spells were just sentences said backwards, hence the mixed up jumble in the title. But, now I’m regretting not naming the package zatanna and reversing the function names to help ensure they’re only used deliberately & carefully. You’ll see why in a bit.

Just like their ore-seeking speleological counterparts, workers in our modern day data mines process a multitude of mineralic data formats to achieve our goals of ~~world domination~~ finding meaning, insight & solutions to hard problems.

Two formats in particular are common occurrences in many of our $DAYJOBs: XML and JSON. The rest of this (hopefully short-ish) post is going to assume you have at least a passing familiarity with — if not full-on battle scars from working with — them.

XML and JSON are, in many ways, very similar. This similarity is on purpose since JSON was originally created to make is easier to process data in browsers and help make data more human-readable. If your $DAYJOB involves processing small or large streams of nested data, you likely prefer JSON over XML.

There are times, though, that — even if one generally works with only JSON data — one comes across a need to ingest some XML and turn it into JSON. This was the case for a question-poser on Stack Overflow this week (I won’t point-shill with a direct link but it’ll be easy to find if you are interested in this latest SODD package).

Rather than take on the potentially painful task of performing the XML to JSON transformation on their own the OP wished for a simple incantation to transform the entirety of the incoming XML into JSON.

We’ll switch comic universes for a moment to issue a warning that all magic comes with a price. And, the cost for automatic XML<->JSON conversion can be quite high. XML has namespaces, attributes tags and values and requires schemas to convey data types and help validate content structure. JSON has no attributes, implicitly conveys types and is generally schema-less (though folks have bolted on that concept).

If one is going to use magic for automatic data conversion there must be rules (no, not those kind of Magic rules), otherwise how various aspects of XML become encoded into JSON (and the reverse) will generate inconsistency and may even result in significant data corruption. Generally speaking, you are always better off writing your own conversion utility vs rely on specific settings in a general conversion script/function. However, if your need is a one-off (which anyone who has been doing this type of work for a while knows is also generally never the case) you may have cause to throw caution to the wind, get your quick data fix, and move on. If that is the case, the blackmagic? package may be of use to you.

gnitrevnoC eht ANAI sserddA ecapS yrtsigeR ot NOSJ

One file that’s in XML that I only occasionally have to process is the IANA IPv4 Address Space Registry. If you visited that link you may have uttered “Hey! That’s not XML it’s HTML!”, to wit — I would respond — “Well, HTML is really XML anyway, but use the View Source, Luke! and see that it is indeed XML with some clever XSL style sheet processing being applied in-browser to make the gosh awful XML human readable.”.

With blackmagic we can make quick work of converting this monstrosity into JSON.

The blackmagic package itself uses even darker magic to accomplish its goals. The package is just a thin V8 wrapper around the xml-js? javascript library. Because of this, it is recommended that you do not try to process gigabytes of XML with it as there is a round trip of data marshalling between R and the embedded v8 engine.


requireNamespace("jsonlite") # jsonlite::flatten clobbers purrr::flatten in the wrong order so I generally fully-qualify what I need
## Loading required namespace: jsonlite
library(xml2)
library(blackmagic) # devtools::install_github("hrbrmstr/blackmagic")
library(purrr)
requireNamespace("dplyr") # I'm going to fully qualify use of dplyr:data_frame() below
## Loading required namespace: dplyr

You can thank @yoniceedee for the URL processing capability in blackmagic:


source_url <- "https://www.iana.org/assignments/ipv4-address-space/ipv4-address-space.xml"

iana_json <- blackmagic::xml_to_json(source_url)

# NOTE: cat the whole iana_json locally to see it — perhaps to file="..." vs clutter your console
cat(substr(iana_json, 1800, 2300))
## me":"prefix","elements":[{"type":"text","text":"000/8"}]},{"type":"element","name":"designation","elements":[{"type":"text","text":"IANA - Local Identification"}]},{"type":"element","name":"date","elements":[{"type":"text","text":"1981-09"}]},{"type":"element","name":"status","elements":[{"type":"text","text":"RESERVED"}]},{"type":"element","name":"xref","attributes":{"type":"note","data":"2"}}]},{"type":"element","name":"record","elements":[{"type":"element","name":"prefix","elements":[{"type":"

By by the hoary hosts of Hoggoth that's not very "human readable"! And, it looks super-verbose. Thankfully, Yousuf Almarzooqi knew we'd want to fine-tune the output and we can use those options to make this a bit better:


blackmagic::xml_to_json(
  doc = source_url, 
  spaces = 2,                # Number of spaces to be used for indenting XML output
  compact = FALSE,           # Whether to produce detailed object or compact object
  ignoreDeclaration = TRUE   # No declaration property will be generated.
) -> iana_json

# NOTE: cat the whole iana_json locally to see it — perhaps to file="..." vs clutter your console
cat(substr(iana_json, 3000, 3300))
## pe": "element",
##               "name": "prefix",
##               "elements": [
##                 {
##                   "type": "text",
##                   "text": "000/8"
##                 }
##               ]
##             },
##             {
##               "type": "element",
##               "name": "designation",
##

One "plus side" for doing the mass-conversion is that we don't really need to do much processing to have it be "usable" data in R:


blackmagic::xml_to_json(
  doc = source_url, 
  compact = FALSE,        
  ignoreDeclaration = TRUE
) -> iana_json

# NOTE: consider taking some more time to explore this monstrosity than this
str(processed <- jsonlite::fromJSON(iana_json), 3)
## List of 1
##  $ elements:'data.frame':    3 obs. of  5 variables:
##   ..$ type       : chr [1:3] "instruction" "instruction" "element"
##   ..$ name       : chr [1:3] "xml-stylesheet" "oxygen" "registry"
##   ..$ instruction: chr [1:3] "type=\"text/xsl\" href=\"ipv4-address-space.xsl\"" "RNGSchema=\"ipv4-address-space.rng\" type=\"xml\"" NA
##   ..$ attributes :'data.frame':  3 obs. of  2 variables:
##   .. ..$ xmlns: chr [1:3] NA NA "http://www.iana.org/assignments"
##   .. ..$ id   : chr [1:3] NA NA "ipv4-address-space"
##   ..$ elements   :List of 3
##   .. ..$ : NULL
##   .. ..$ : NULL
##   .. ..$ :'data.frame':  280 obs. of  4 variables:

compact(processed$elements$elements[[3]]$elements) %>% 
  head(6) %>% 
  str(3) 
## List of 6
##  $ :'data.frame':    1 obs. of  2 variables:
##   ..$ type: chr "text"
##   ..$ text: chr "IANA IPv4 Address Space Registry"
##  $ :'data.frame':    1 obs. of  2 variables:
##   ..$ type: chr "text"
##   ..$ text: chr "Internet Protocol version 4 (IPv4) Address Space"
##  $ :'data.frame':    1 obs. of  2 variables:
##   ..$ type: chr "text"
##   ..$ text: chr "2018-04-23"
##  $ :'data.frame':    3 obs. of  4 variables:
##   ..$ type      : chr [1:3] "text" "element" "text"
##   ..$ text      : chr [1:3] "Allocations to RIRs are made in line with the Global Policy published at " NA ". \nAll other assignments require IETF Review."
##   ..$ name      : chr [1:3] NA "xref" NA
##   ..$ attributes:'data.frame':   3 obs. of  2 variables:
##   .. ..$ type: chr [1:3] NA "uri" NA
##   .. ..$ data: chr [1:3] NA "http://www.icann.org/en/resources/policy/global-addressing" NA
##  $ :'data.frame':    3 obs. of  4 variables:
##   ..$ type      : chr [1:3] "text" "element" "text"
##   ..$ text      : chr [1:3] "The allocation of Internet Protocol version 4 (IPv4) address space to various registries is listed\nhere. Origi"| __truncated__ NA " documents most of these allocations."
##   ..$ name      : chr [1:3] NA "xref" NA
##   ..$ attributes:'data.frame':   3 obs. of  2 variables:
##   .. ..$ type: chr [1:3] NA "rfc" NA
##   .. ..$ data: chr [1:3] NA "rfc1466" NA
##  $ :'data.frame':    5 obs. of  4 variables:
##   ..$ type      : chr [1:5] "element" "element" "element" "element" ...
##   ..$ name      : chr [1:5] "prefix" "designation" "date" "status" ...
##   ..$ elements  :List of 5
##   .. ..$ :'data.frame':  1 obs. of  2 variables:
##   .. ..$ :'data.frame':  1 obs. of  2 variables:
##   .. ..$ :'data.frame':  1 obs. of  2 variables:
##   .. ..$ :'data.frame':  1 obs. of  2 variables:
##   .. ..$ : NULL
##   ..$ attributes:'data.frame':   5 obs. of  2 variables:
##   .. ..$ type: chr [1:5] NA NA NA NA ...
##   .. ..$ data: chr [1:5] NA NA NA NA ...

As noted previously, all magic comes with a price and we just traded XML processing for some gnarly list processing. This isn't the case for all XML files and you can try to tweak the parameters to xml_to_json() to make the output more usable (NOTE: key name transformation parameters still need to be implemented in the package), but this seems a whole lot easier (to me):


doc <- read_xml(source_url)

xml_ns_strip(doc)

dplyr::data_frame(
  
  prefix = xml_find_all(doc, ".//record/prefix") %>% xml_text(),
  
  designation = xml_find_all(doc, ".//record/designation") %>% xml_text(),
  
  date = xml_find_all(doc, ".//record/date") %>% 
    xml_text() %>% 
    sprintf("%s-01", .) %>% 
    as.Date(),
  
  whois = xml_find_all(doc, ".//record") %>% 
    map(xml_find_first, "./whois") %>% 
    map_chr(xml_text),
  
  status = xml_find_all(doc, ".//record/status") %>% xml_text()
  
)
## # A tibble: 256 x 5
##    prefix designation                     date       whois        status  
##                                                 
##  1 000/8  IANA - Local Identification     1981-09-01          RESERVED
##  2 001/8  APNIC                           2010-01-01 whois.apnic… ALLOCAT…
##  3 002/8  RIPE NCC                        2009-09-01 whois.ripe.… ALLOCAT…
##  4 003/8  Administered by ARIN            1994-05-01 whois.arin.… LEGACY  
##  5 004/8  Level 3 Parent, LLC             1992-12-01 whois.arin.… LEGACY  
##  6 005/8  RIPE NCC                        2010-11-01 whois.ripe.… ALLOCAT…
##  7 006/8  Army Information Systems Center 1994-02-01 whois.arin.… LEGACY  
##  8 007/8  Administered by ARIN            1995-04-01 whois.arin.… LEGACY  
##  9 008/8  Administered by ARIN            1992-12-01 whois.arin.… LEGACY  
## 10 009/8  Administered by ARIN            1992-08-01 whois.arin.… LEGACY  
## # ... with 246 more rows

NIF

xml_to_json() has a sibling function --- json_to_xml() for the reverse operation and you're invited to fill in the missing parameters with a PR as there is a fairly consistent and straightforward way to do that. Note that a small parameter tweak can radically change the output, which is one of the aforementioned potentially costly pitfalls of this automagic conversion.

Before using either function, seriously consider taking the time to write a dedicated, small package that exposes a function or two to perform the necessary conversions.

Wrangling Data Table Out Of the FBI 2017 IC3 Crime Report

The U.S. FBI Internet Crime Complaint Center was established in 2000 to receive complaints of Internet crime. They produce an annual report, just released 2017’s edition, and I need the data from it. Since I have to wrangle it out, I thought some folks might like to play long at home, especially since it turns out I had to use both tabulizer and pdftools to accomplish my goal.

Concepts presented:

PDF scraping (with both tabulizer and pdftools)
asciiruler
general string manipulation
case_when() vs ifelse() for text cleanup
reformatting data for ggraph treemaps

Let’s get started! (NOTE: you can click/tap on any image for a larger version)


library(stringi)
library(pdftools)
library(tabulizer)
library(igraph)
library(ggraph) # devtools::install_github("thomasp85/ggraph")
library(hrbrthemes)
library(tidyverse)

ic3_file <- "~/Data/2017-ic3-report.pdf" # change "~/Data" for your system

if (!file.exists(ic3_file)) { # don't waste anyone's bandwidth
  download.file("https://pdf.ic3.gov/2017_IC3Report.pdf", ic3_file)
}

Let's try pdftools since I like text wrangling


cat(pdftools::pdf_text(ic3_file)[[20]])
##                                                             2017 Internet Crime Report         20
## 2017 Crime Types
##                                  By Victim Count
## Crime Type                         Victims     Crime Type                            Victims
## Non-Payment/Non-Delivery           84,079      Misrepresentation                        5,437
## Personal Data Breach               30,904      Corporate Data Breach                    3,785
## Phishing/Vishing/Smishing/Pharming 25,344      Investment                               3,089
## Overpayment                        23,135      Malware/Scareware/Virus                  3,089
## No Lead Value                      20,241      Lottery/Sweepstakes                      3,012
## Identity Theft                     17,636      IPR/Copyright and                        2,644
##                                                Counterfeit
## Advanced Fee                       16,368      Ransomware                               1,783
## Harassment/Threats of Violence     16,194      Crimes Against Children                  1,300
## Employment                         15,784      Denial of Service/TDoS                   1,201
## BEC/EAC                            15,690      Civil Matter                             1,057
## Confidence Fraud/Romance           15,372      Re-shipping                              1,025
## Credit Card Fraud                  15,220      Charity                                    436
## Extortion                          14,938      Health Care Related                        406
## Other                              14,023      Gambling                                   203
## Tech Support                       10,949      Terrorism                                  177
## Real Estate/Rental                  9,645      Hacktivist                                 158
## Government Impersonation            9,149
## Descriptors*
## Social Media                       19,986      *These descriptors relate to the medium or
## Virtual Currency                    4,139      tool used to facilitate the crime, and are used
##                                                by the IC3 for tracking purposes only. They
##                                                are available only after another crime type
##                                                has been selected.

OK, I don't like text wrangling that much. How about tabulizer?


tabulizer::extract_tables(ic3_file, pages = 20)

## list()

Well, that's disappointing. Perhaps if we target the tables on the PDF pages we'll have better luck. You can find them on pages 20 and 21 if you downloaded your own copy. Here are some smaller, static views of them:

I can't show the tabulizer pane (well I could if I had time to screen capture and make an animated gif) but run this to get the areas:


areas <- tabulizer::locate_areas(ic3_file, pages = 20:21)

# this is what ^^ produces for my rectangles:

list(
  c(top = 137.11911357341, left = 66.864265927978, bottom = 413.5512465374, right = 519.90581717452),
  c(top = 134.92520775623, left = 64.670360110803, bottom = 458.52631578947, right = 529.7783933518)
) -> areas

Now, see if tabulizer can do a better job. We'll start with the first page:


tab <- tabulizer::extract_tables(ic3_file, pages = 20, area = areas[1])

tab
## [[1]]
##       [,1]                                 [,2]              
##  [1,] ""                                   "By Victim Cou nt"
##  [2,] "Crime Type"                         "Victims"         
##  [3,] "Non-Payment/Non-Delivery"           "84,079"          
##  [4,] "Personal Data Breach"               "30,904"          
##  [5,] "Phishing/Vishing/Smishing/Pharming" "25,344"          
##  [6,] "Overpayment"                        "23,135"          
##  [7,] "No Lead Value"                      "20,241"          
##  [8,] "Identity Theft"                     "17,636"          
##  [9,] ""                                   ""                
## [10,] "Advanced Fee"                       "16,368"          
## [11,] "Harassment/Threats of Violence"     "16,194"          
## [12,] "Employment"                         "15,784"          
## [13,] "BEC/EAC"                            "15,690"          
## [14,] "Confidence Fraud/Romance"           "15,372"          
## [15,] "Credit Card Fraud"                  "15,220"          
## [16,] "Extortion"                          "14,938"          
## [17,] "Other"                              "14,023"          
## [18,] "Tech Support"                       "10,949"          
## [19,] "Real Estate/Rental"                 "9,645"           
## [20,] "G overnment Impersonation"          "9,149"           
## [21,] ""                                   ""                
## [22,] "Descriptors*"                       ""                
##       [,3]                      [,4]     
##  [1,] ""                        ""       
##  [2,] "Crime Type"              "Victims"
##  [3,] "Misrepresentation"       "5,437"  
##  [4,] "Corporate Data Breach"   "3,785"  
##  [5,] "Investment"              "3,089"  
##  [6,] "Malware/Scareware/Virus" "3,089"  
##  [7,] "Lottery/Sweepstakes"     "3,012"  
##  [8,] "IPR/Copyright and"       "2,644"  
##  [9,] "Counterfeit"             ""       
## [10,] "Ransomware"              "1,783"  
## [11,] "Crimes Against Children" "1,300"  
## [12,] "Denial of Service/TDoS"  "1,201"  
## [13,] "Civil Matter"            "1,057"  
## [14,] "Re-shipping"             "1,025"  
## [15,] "Charity"                 "436"    
## [16,] "Health Care Related"     "406"    
## [17,] "Gambling"                "203"    
## [18,] "Terrorism"               "177"    
## [19,] "Hacktivist"              "158"    
## [20,] ""                        ""       
## [21,] ""                        ""       
## [22,] ""                        ""

Looking good. How does it look data-frame'd?


tab <- as_data_frame(tab[[1]])

print(tab, n=50)
## # A tibble: 22 x 4
##    V1                                 V2               V3            V4   
##  1 ""                                 By Victim Cou nt ""            ""   
##  2 Crime Type                         Victims          Crime Type    Vict…
##  3 Non-Payment/Non-Delivery           84,079           Misrepresent… 5,437
##  4 Personal Data Breach               30,904           Corporate Da… 3,785
##  5 Phishing/Vishing/Smishing/Pharming 25,344           Investment    3,089
##  6 Overpayment                        23,135           Malware/Scar… 3,089
##  7 No Lead Value                      20,241           Lottery/Swee… 3,012
##  8 Identity Theft                     17,636           IPR/Copyrigh… 2,644
##  9 ""                                 ""               Counterfeit   ""   
## 10 Advanced Fee                       16,368           Ransomware    1,783
## 11 Harassment/Threats of Violence     16,194           Crimes Again… 1,300
## 12 Employment                         15,784           Denial of Se… 1,201
## 13 BEC/EAC                            15,690           Civil Matter  1,057
## 14 Confidence Fraud/Romance           15,372           Re-shipping   1,025
## 15 Credit Card Fraud                  15,220           Charity       436  
## 16 Extortion                          14,938           Health Care … 406  
## 17 Other                              14,023           Gambling      203  
## 18 Tech Support                       10,949           Terrorism     177  
## 19 Real Estate/Rental                 9,645            Hacktivist    158  
## 20 G overnment Impersonation          9,149            ""            ""   
## 21 ""                                 ""               ""            ""   
## 22 Descriptors*                       ""               ""            ""

Still pretty good. Cleaning it up is pretty simple from here. Just filter out some rows, parse some numbers, fix some chopped labels and boom - done:


tab <- filter(tab[3:21,], !V2 == "")

bind_rows(
  select(tab, crime = V1, n_victims = V2),
  select(tab, crime = V3, n_victims = V4)
) %>%
  filter(crime != "") %>%
  mutate(n_victims = readr::parse_number(n_victims)) %>%
  mutate(crime = case_when(
    stri_detect_fixed(crime, "G o") ~ "Government Impersonation",
    stri_detect_fixed(crime, "IPR/C") ~ "IPR/Copyright and Counterfeit",
    TRUE ~ crime
  )) %>%
  print(n=50) -> ic3_2017_crimes_victim_ct
## # A tibble: 33 x 2
##    crime                              n_victims
##                                      
##  1 Non-Payment/Non-Delivery              84079.
##  2 Personal Data Breach                  30904.
##  3 Phishing/Vishing/Smishing/Pharming    25344.
##  4 Overpayment                           23135.
##  5 No Lead Value                         20241.
##  6 Identity Theft                        17636.
##  7 Advanced Fee                          16368.
##  8 Harassment/Threats of Violence        16194.
##  9 Employment                            15784.
## 10 BEC/EAC                               15690.
## 11 Confidence Fraud/Romance              15372.
## 12 Credit Card Fraud                     15220.
## 13 Extortion                             14938.
## 14 Other                                 14023.
## 15 Tech Support                          10949.
## 16 Real Estate/Rental                     9645.
## 17 Government Impersonation               9149.
## 18 Misrepresentation                      5437.
## 19 Corporate Data Breach                  3785.
## 20 Investment                             3089.
## 21 Malware/Scareware/Virus                3089.
## 22 Lottery/Sweepstakes                    3012.
## 23 IPR/Copyright and Counterfeit          2644.
## 24 Ransomware                             1783.
## 25 Crimes Against Children                1300.
## 26 Denial of Service/TDoS                 1201.
## 27 Civil Matter                           1057.
## 28 Re-shipping                            1025.
## 29 Charity                                 436.
## 30 Health Care Related                     406.
## 31 Gambling                                203.
## 32 Terrorism                               177.
## 33 Hacktivist                              158.

Now, on to page 2!


tab <- tabulizer::extract_tables(ic3_file, pages = 21, area = areas[2])

tab
## [[1]]
##       [,1]                         [,2]                                
##  [1,] ""                           "By Victim Lo ss"                   
##  [2,] "Crime Type"                 "Loss  Crime Type"                  
##  [3,] "BEC/EAC"                    "$676,151,185 Misrepresentation"    
##  [4,] "Confidence Fraud/Romance"   "$211,382,989 Harassment/Threats"   
##  [5,] ""                           "of Violence"                       
##  [6,] "Non-Payment/Non-Delivery"   "$141,110,441 Government"           
##  [7,] ""                           "Impersonation"                     
##  [8,] "Investment"                 "$96,844,144 Civil Matter"          
##  [9,] "Personal Data Breach"       "$77,134,865 IPR/Copyright and"     
## [10,] ""                           "Counterfeit"                       
## [11,] "Identity Theft"             "$66,815,298 Malware/Scareware/"    
## [12,] ""                           "Virus"                             
## [13,] "Corporate Data Breach"      "$60,942,306 Ransomware"            
## [14,] "Advanced Fee"               "$57,861,324 Denial of Service/TDoS"
## [15,] "Credit Card Fraud"          "$57,207,248 Charity"               
## [16,] "Real Estate/Rental"         "$56,231,333 Health Care Related"   
## [17,] "Overpayment"                "$53,450,830 Re-Shipping"           
## [18,] "Employment"                 "$38,883,616 Gambling"              
## [19,] "Phishing/Vishing/Smishing/" "$29,703,421 Crimes Against"        
## [20,] "Pharming"                   "Children"                          
## [21,] "Other"                      "$23,853,704 Hacktivist"            
## [22,] "Lottery/Sweepstakes"        "$16,835,001 Terrorism"             
## [23,] "Extortion"                  "$15,302,792 N o Lead Value"        
## [24,] "Tech Support"               "$14,810,080"                       
## [25,] ""                           ""                                  
## [26,] ""                           ""                                  
##       [,3]          
##  [1,] ""            
##  [2,] "Loss"        
##  [3,] "$14,580,907" 
##  [4,] "$12,569,185" 
##  [5,] ""            
##  [6,] "$12,467,380" 
##  [7,] ""            
##  [8,] "$5,766,550"  
##  [9,] "$5,536,912"  
## [10,] ""            
## [11,] "$5,003,434"  
## [12,] ""            
## [13,] "$2,344,365"  
## [14,] "$1,466,195"  
## [15,] "$1,405,460"  
## [16,] "$925,849"    
## [17,] "$809,746"    
## [18,] "$598,853"    
## [19,] "$46,411"     
## [20,] ""            
## [21,] "$20,147"     
## [22,] "$18,926"     
## [23,] "$0"          
## [24,] ""            
## [25,] ""            
## [26,] "Descriptors*"

:facepalm: That's disappointing. Way too much scrambled content. So, back to pdftools!


cat(pg21 <- pdftools::pdf_text(ic3_file)[[21]])
##                                                    Internet Crime Complaint Center         21
## 2017 Crime Types Continued
##                             By Victim Loss
## Crime Type                 Loss            Crime Type                      Loss
## BEC/EAC                    $676,151,185    Misrepresentation               $14,580,907
## Confidence Fraud/Romance   $211,382,989    Harassment/Threats              $12,569,185
##                                            of Violence
## Non-Payment/Non-Delivery   $141,110,441    Government                      $12,467,380
##                                            Impersonation
## Investment                  $96,844,144    Civil Matter                      $5,766,550
## Personal Data Breach        $77,134,865    IPR/Copyright and                 $5,536,912
##                                            Counterfeit
## Identity Theft              $66,815,298    Malware/Scareware/                $5,003,434
##                                            Virus
## Corporate Data Breach       $60,942,306    Ransomware                        $2,344,365
## Advanced Fee                $57,861,324    Denial of Service/TDoS            $1,466,195
## Credit Card Fraud           $57,207,248    Charity                           $1,405,460
## Real Estate/Rental          $56,231,333    Health Care Related                 $925,849
## Overpayment                 $53,450,830    Re-Shipping                         $809,746
## Employment                  $38,883,616    Gambling                            $598,853
## Phishing/Vishing/Smishing/  $29,703,421    Crimes Against                        $46,411
## Pharming                                   Children
## Other                       $23,853,704    Hacktivist                            $20,147
## Lottery/Sweepstakes         $16,835,001    Terrorism                             $18,926
## Extortion                   $15,302,792    No Lead Value                                $0
## Tech Support                $14,810,080
##                                                                            Descriptors*
## Social Media                $56,478,483    *These descriptors relate to the medium or
## Virtual Currency            $58,391,810    tool used to facilitate the crime, and are used
##                                            by the IC3 for tracking purposes only. They
##                                            are available only after another crime type
##                                            has been selected.

This is (truthfully) not too bad. Just make columns from substring ranges and do some cleanup. The asciiruler package can definitely help here since it makes it much easier to see start/stop points (I used a new editor pane and copied some lines into it):


stri_split_lines(pg21)[[1]] %>%
  .[-(1:4)] %>% # remove header & bits above header
  .[-(26:30)] %>% # remove trailing bits
  map_df(~{
    list(
      crime = stri_trim_both(c(stri_sub(.x, 1, 25), stri_sub(.x, 43, 73))),
      cost = stri_trim_both(c(stri_sub(.x, 27, 39), stri_sub(.x, 74))) # no length/to in the last one so it goes until eol
    )
  }) %>%
  filter(!(crime == "" | cost == "")) %>% # get rid of blank rows
  mutate(cost = suppressWarnings(readr::parse_number(cost))) %>% # we can use NAs generated to remove non-data rows
  filter(!is.na(cost)) %>%
  mutate(crime = case_when(
    stri_detect_fixed(crime, "Phish") ~ "Phishing/Vishing/Smishing/Pharming",
    stri_detect_fixed(crime, "Malware") ~ "Malware/Scareware/Virus",
    stri_detect_fixed(crime, "IPR") ~ "IPR/Copyright and Counterfeit",
    stri_detect_fixed(crime, "Harassment") ~ "Harassment/Threats of Violence",
    TRUE ~ crime
  )) %>%
  print(n=50) -> ic3_2017_crimes_cost
## # A tibble: 35 x 2
##    crime                                    cost
##  1 BEC/EAC                            676151185.
##  2 Misrepresentation                   14580907.
##  3 Confidence Fraud/Romance           211382989.
##  4 Harassment/Threats of Violence      12569185.
##  5 Non-Payment/Non-Delivery           141110441.
##  6 Government                          12467380.
##  7 Investment                          96844144.
##  8 Civil Matter                         5766550.
##  9 Personal Data Breach                77134865.
## 10 IPR/Copyright and Counterfeit        5536912.
## 11 Identity Theft                      66815298.
## 12 Malware/Scareware/Virus              5003434.
## 13 Corporate Data Breach               60942306.
## 14 Ransomware                           2344365.
## 15 Advanced Fee                        57861324.
## 16 Denial of Service/TDoS               1466195.
## 17 Credit Card Fraud                   57207248.
## 18 Charity                              1405460.
## 19 Real Estate/Rental                  56231333.
## 20 Health Care Related                   925849.
## 21 Overpayment                         53450830.
## 22 Re-Shipping                           809746.
## 23 Employment                          38883616.
## 24 Gambling                              598853.
## 25 Phishing/Vishing/Smishing/Pharming  29703421.
## 26 Crimes Against                         46411.
## 27 Other                               23853704.
## 28 Hacktivist                             20147.
## 29 Lottery/Sweepstakes                 16835001.
## 30 Terrorism                              18926.
## 31 Extortion                           15302792.
## 32 No Lead Value                              0.
## 33 Tech Support                        14810080.
## 34 Social Media                        56478483.
## 35 Virtual Currency                    58391810.

Now that we have real data, we can take a look at the IC3 crimes by loss and victims.

We'll use treemaps first then take a quick look at the relationship between counts and losses.

Just need to do some data wrangling for ggraph, starting with victims:


ic3_2017_crimes_victim_ct %>%
  mutate(crime = case_when(
    crime == "Government Impersonation" ~ "Government\nImpersonation",
    crime == "Corporate Data Breach" ~ "Corporate\nData\nBreach",
    TRUE ~ crime
  )) %>%
  mutate(crime = stri_replace_all_fixed(crime, "/", "/\n")) %>%
  mutate(grp = "ROOT") %>%
  add_row(grp = "ROOT", crime="ROOT", n_victims=0) %>%
  select(grp, crime, n_victims) %>%
  arrange(desc(n_victims)) -> gdf

select(gdf, -grp) %>%
  mutate(lab = sprintf("%s\n(%s)", crime, scales::comma(n_victims))) %>%
  mutate(lab = ifelse(n_victims > 1300, lab, "")) %>% # don't show a label when blocks are small
  mutate(lab_col = ifelse(n_victims > 40000, "#2b2b2b", "#cccccc")) -> vdf # change up colors when blocks are lighter

g <- graph_from_data_frame(gdf, vertices=vdf)

ggraph(g, "treemap", weight=n_victims) +
  geom_node_tile(aes(fill=n_victims), size=0.25) +
  geom_text(
    aes(x, y, label=lab, size=n_victims, color = I(lab_col)),
    family=font_ps, lineheight=0.875
  ) +
  scale_x_reverse(expand=c(0,0)) +
  scale_y_continuous(expand=c(0,0)) +
  scale_size_continuous(trans = "sqrt", range = c(0.5, 8)) +
  labs(title = "FBI 2017 Internet Crime Report — Crimes By Victim Count") +
  ggraph::theme_graph(base_family = font_ps) +
  theme(plot.title = element_text(color="#cccccc", family = "IBMPlexSans-Bold")) +
  theme(panel.background = element_rect(fill="black")) +
  theme(plot.background = element_rect(fill="black")) +
  theme(legend.position="none")


# Now, do the same for losses:

ic3_2017_crimes_cost %>%
  mutate(crime = case_when(
    crime == "Phishing/Vishing/Smishing/Pharming" ~ "Phishing/Vishing/\nSmishing/Pharming",
    crime == "Harassment/Threats of Violence" ~ "Harassment/\nThreats of Violence",
    crime == "Lottery/Sweepstakes" ~ "Lottery\nSweepstakes",
    TRUE ~ crime
  )) %>%
  filter(cost > 0) %>%
  mutate(cost = cost / 1000000) %>%
  mutate(grp = "ROOT") %>%
  add_row(grp = "ROOT", crime="ROOT", cost=0) %>%
  select(grp, crime, cost) %>%
  arrange(desc(cost)) -> gdf_cost

select(gdf_cost, -grp) %>%
  mutate(lab = sprintf("%s\n($%s M)", crime, prettyNum(cost, digits=2))) %>%
  mutate(lab = ifelse(cost > 5.6, lab, "")) %>%
  mutate(lab_col = ifelse(cost > 600, "#2b2b2b", "#cccccc")) -> vdf_cost

g_cost <- graph_from_data_frame(gdf_cost, vertices=vdf_cost)

ggraph(g_cost, "treemap", weight=cost) +
  geom_node_tile(aes(fill=cost), size=0.25) +
  geom_text(
    aes(x, y, label=lab, size=cost, color=I(lab_col)),
    family=font_ps, lineheight=0.875
  ) +
  scale_x_reverse(expand=c(0,0)) +
  scale_y_continuous(expand=c(0,0)) +
  scale_size_continuous(trans = "sqrt", range = c(0.5, 8)) +
  labs(title = "FBI 2017 Internet Crime Report — Crime Loss By Category") +
  ggraph::theme_graph(base_family = font_ps) +
  theme(plot.title = element_text(color="#cccccc", family = "IBMPlexSans-Bold")) +
  theme(panel.background = element_rect(fill="black")) +
  theme(plot.background = element_rect(fill="black")) +
  theme(legend.position="none")

Let's plot victim counts vs losses to see what stands out:


left_join(ic3_2017_crimes_victim_ct, ic3_2017_crimes_cost) %>%
  filter(!is.na(cost)) %>%
  ggplot(aes(n_victims, cost)) +
  geom_point() +
  ggrepel::geom_label_repel(aes(label = crime), family=font_ps, size=3) +
  scale_x_comma() +
  scale_y_continuous(labels=scales::dollar) +
  labs(
    x = "# of Victims", y = "Loss magnitude",
    title = "FBI 2017 Internet Crime Report — Crime Loss By Victim Count ~ Category"
  ) +
  theme_ipsum_ps(grid="XY")

BEC == "Business email compromise" and it's definitely a major problem, but those two count/loss outliers are not helping us see the rest of the data. Let's zoom in:


left_join(ic3_2017_crimes_victim_ct, ic3_2017_crimes_cost) %>%
  filter(!is.na(cost)) %>%
  filter(cost < 300000000) %>%
  filter(n_victims < 40000) %>%
  ggplot(aes(n_victims, cost)) +
  geom_point() +
  ggrepel::geom_label_repel(aes(label = crime), family=font_ps, size=3) +
  scale_x_comma() +
  scale_y_continuous(labels=scales::dollar) +
  labs(
    x = "# of Victims", y = "Loss magnitude",
    title = "FBI 2017 Internet Crime Report — Crime Loss By Victim Count ~ Category",
    subtitle = "NOTE: BEC/EAC and Non-payment/Non-delivery removed"
  ) +
  theme_ipsum_ps(grid="XY")

Better, but let's go zoom in a bit more:


left_join(ic3_2017_crimes_victim_ct, ic3_2017_crimes_cost) %>%
  filter(!is.na(cost)) %>%
  filter(cost < 50000000) %>%
  filter(n_victims < 10000) %>%
  ggplot(aes(n_victims, cost)) +
  geom_point() +
  ggrepel::geom_label_repel(aes(label = crime), family=font_ps, size=3) +
  scale_x_comma() +
  scale_y_continuous(labels=scales::dollar) +
  labs(
    x = "# of Victims", y = "Loss magnitude",
    title = "FBI 2017 Internet Crime Report — Crime Loss By Victim Count ~ Category",
    subtitle = "NOTE: Only includes losses between $0-50 M USD & victim counts <= 10,000 "
  ) +
  theme_ipsum_ps(grid="XY")

Looks like the ransomware folks have quite a bit of catching up to do (at least when it comes to crimes reported to the IC3).

Seventeen Minutes From Tweet To Package

Earlier today, @noamross posted to Twitter:

#rstats #lazyweb What's the R/httr/curl equivalent of

curl -F "file=@somefile.html" https://t.co/abbugLz9ZW

— Noam Ross (@noamross) May 3, 2018

The answer was a 1:1 “file upload” curl to httr translation:

httr::POST(
  url = "https://file.io",
  encode = "multipart",
  body = list(file = httr::upload_file("/some/path/to/file")),
)

but I wanted to do more than that since Noam took 20 minutes out his day this week (with no advance warning) to speak to my intro-to-stats class about his work and R.

The Twitter request was (ultimately) a question on how to use R to post content to https://file.io. They have a really simple API, and the timespan from Noam’s request to the initial commit of a fully functional package was roughly 17 minutes. The end product included the ability to post files, strings and R data (something that seemed like a good thing to add).

Not too long after came a v0.1.0 release complete with tests and passing CRAN checks on all platforms.

Noam also suggested I do a screencast:

You should do a screencast of your process for this.

— Noam Ross (@noamross) May 3, 2018

I don’t normally do screencasts but had some conference call time so folks can follow along at home:

That’s not the best screencast in the world, but it’s very representative of the workflow I used. A great deal of boilerplate package machinations is accomplished with this bash script.

I wasn’t happy with the hurried function name choices I made nor was I thrilled with the package title, description, tests and basic docs, so I revamped all those into another release. That took a while, mostly due to constantly triggering API warnings about being rate-limited.

So, if you have a 5 GB or less file, character vector or in-memory R data you’d like to ephemerally share with others, take the fileio package for a spin:

devtools::install_github("hrbrmstr/fileio")

fileio::fi_post_text("TWFrZSBzdXJlIHRvIEAgbWUgb24gVHdpdHRlciBpZiB5b3UgZGVjb2RlIHRoaXM=")
##   success    key                   link  expiry
## 1    TRUE n18ZSB https://file.io/n18ZSB 14 days

(bonus points if you can figure out what that seemingly random sequence of characters says).

Painless ODBC + dplyr Connections to Amazon Athena and Apache Drill with R & odbc

2018-04-20 – 10:44
Posted in Apache Drill, athena, drill, R
Comments (8)

I spent some time this morning upgrading the JDBC driver (and changing up some supporting code to account for changes to it) for my metis package? which connects R up to Amazon Athena via RJDBC. I’m used to JDBC and have to deal with Java separately from R so I’m also comfortable with Java, JDBC and keeping R working with Java. I notified the #rstats Twitterverse about it and it started this thread (click on the embed to go to it — and, yes, this means Twitter is tracking you via this post unless you’ve blocked their JavaScript):

The (GitHub only for now) #rstats metis package for wiring up R to @amazonathema via RJDBC now uses & includes the new Simba Athena JDBC Driver 2.0.2 JAR https://t.co/wvwV6IxCNd (cc: @dabdine)

— hrbrmstr (@hrbrmstr) April 20, 2018

If you do scroll through the thread you’ll see @hadleywickham suggested using the odbc package with the ODBC driver for Athena.

I, and others, have noted that ODBC on macOS (and — for me, at least — Linux) never really played well together for us. Given that I’m familiar with JDBC, I just gravitated towards using it after trying it out with raw Java and it worked fine in R.

Never one to discount advice from Hadley, I quickly grabbed the Athena ODBC driver and installed it and wired up an odbc + dplyr connection almost instantly:

library(odbc)
library(tidyverse)

DBI::dbConnect(
  odbc::odbc(), 
  driver = "/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib", 
  Schema = "sampledb",
  AwsRegion = "us-east-1",
  AuthenticationType = "Default Credentials",
  S3OutputLocation = "s3://aws-athena-query-results-redacted"
) -> con

some_tbl <- tbl(con, "elb_logs")

some_tbl
## # Source:   table<elb_logs> [?? x 16]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##    timestamp    elbname requestip  requestport backendip backendport
##    <chr>        <chr>   <chr>            <int> <chr>           <int>
##  1 2014-09-26T… lb-demo 249.6.80.…        5123 249.6.80…        8888
##  2 2014-09-26T… lb-demo 246.22.15…        5123 248.178.…        8888
##  3 2014-09-26T… lb-demo 248.179.3…       45667 254.70.2…         443
##  4 2014-09-26T… lb-demo 243.2.127…       14496 248.178.…          80
##  5 2014-09-26T… lb-demo 247.76.18…        6887 252.0.81…        8888
##  6 2014-09-26T… lb-demo 254.110.3…       22052 248.178.…        8888
##  7 2014-09-26T… lb-demo 249.113.2…       24902 245.241.…        8888
##  8 2014-09-26T… lb-demo 246.128.7…        5123 244.202.…        8888
##  9 2014-09-26T… lb-demo 249.6.80.…       24902 255.226.…        8888
## 10 2014-09-26T… lb-demo 253.102.6…        6887 246.22.1…        8888
## # ... with more rows, and 10 more variables:
## #   requestprocessingtime <dbl>, backendprocessingtime <dbl>,
## #   clientresponsetime <dbl>, elbresponsecode <chr>,
## #   backendresponsecode <chr>, receivedbytes <S3: integer64>,
## #   sentbytes <S3: integer64>, requestverb <chr>, url <chr>,
## #   protocol <chr>##

The TLDR is that I can now use 100% dplyr idioms with Athena vs add one to the RJDBC driver I made for metis. The metis package will still be around to support JDBC on systems that do have issues with ODBC and to add other methods that work with the AWS Athena API (managing Athena vs the interactive queries part).

The downside is that I’m now even more likely to run up the AWS bill ;-)

What About Drill?

I also maintain the sergeant package? which provides REST API and REST query access to Apache Drill along with a REST API DBI driver and an RJDBC interface for Drill. I remember trying to get the MapR ODBC client working with R a few years ago so I made the package (which was also a great learning experience).

I noticed there was a very recent MapR Drill ODBC driver released. Since I was on a roll, I figured why not try it one more time, especially since the RStudio team has made it dead simple to work with ODBC from R.

library(odbc)
library(tidyverse)

DBI::dbConnect(
  odbc::odbc(), 
  driver = "/Library/mapr/drill/lib/libdrillodbc_sbu.dylib",
  ConnectionType = "Zookeeper",
  AuthenticationType = "No Authentication",
  ZKCLusterID = "CLUSTERID",
  ZkQuorum = "HOST:2181",
  AdvancedProperties = "CastAnyToVarchar=true;HandshakeTimeout=30;QueryTimeout=180;TimestampTZDisplayTimezone=utc;
ExcludedSchemas=sys,INFORMATION_SCHEMA;NumberOfPrefetchBuffers=5;"
) -> drill_con

(employee <- tbl(drill_con, sql("SELECT * FROM cp.`employee.json`")))
## # Source:   SQL [?? x 16]
## # Database: Drill 01.13.0000[@Apache Drill Server/DRILL]
##    employee_id   full_name    first_name last_name position_id   position_title   store_id  
##    <S3: integer> <chr>        <chr>      <chr>     <S3: integer> <chr>            <S3: inte>
##  1 1             Sheri Nowmer Sheri      Nowmer    1             President        0         
##  2 2             Derrick Whe… Derrick    Whelply   2             VP Country Mana… 0         
##  3 4             Michael Spe… Michael    Spence    2             VP Country Mana… 0         
##  4 5             Maya Gutier… Maya       Gutierrez 2             VP Country Mana… 0         
##  5 6             Roberta Dam… Roberta    Damstra   3             VP Information … 0         
##  6 7             Rebecca Kan… Rebecca    Kanagaki  4             VP Human Resour… 0         
##  7 8             Kim Brunner  Kim        Brunner   11            Store Manager    9         
##  8 9             Brenda Blum… Brenda     Blumberg  11            Store Manager    21        
##  9 10            Darren Stanz Darren     Stanz     5             VP Finance       0         
## 10 11            Jonathan Mu… Jonathan   Murraiin  11            Store Manager    1         
## # ... with more rows, and 9 more variables: department_id <S3: integer64>, birth_date <chr>,
## #   hire_date <chr>, salary <dbl>, supervisor_id <S3: integer64>, education_level <chr>,
## #   marital_status <chr>, gender <chr>, management_role <chr>## 

count(employee, position_title, sort=TRUE)
## # Source:     lazy query [?? x 2]
## # Database:   Drill 01.13.0000[@Apache Drill Server/DRILL]
## # Ordered by: desc(n)
##    position_title            n              
##    <chr>                     <S3: integer64>
##  1 Store Temporary Checker   268            
##  2 Store Temporary Stocker   264            
##  3 Store Permanent Checker   226            
##  4 Store Permanent Stocker   222            
##  5 Store Shift Supervisor    52             
##  6 Store Permanent Butcher   32             
##  7 Store Manager             24             
##  8 Store Assistant Manager   24             
##  9 Store Information Systems 16             
## 10 HQ Finance and Accounting 8              
## # ... with more rows##

Apart from having to do that sql(…) to make the table connection work, it was pretty painless and I had both Athena and Drill working with dplyr verbs in under ten minutes (total).

You can head on over to the main Apache Drill site to learn all about the ODBC driver configuration parameters and I’ve updated my ongoing Using Apache Drill with R e-book to include this information. I will also keep maintaining the existing sergeant package but also be including some additional methods provide ODBC usage guidance and potentially other helpers if there are any “gotchas” that arise.

FIN

The odbc package is super-slick and it’s refreshing to be able to use dplyr verbs with Athena vs gosh-awful SQL. However, for some of our needs the hand-crafted queries will still be necessary as they are far more optimized than what would likely get pieced together via the dplyr verbs. However, those queries can also be put right into sql() with the Athena ODBC driver connection and used via the same dplyr verb magic afterwards.

Today is, indeed, a good day to query!