Skip navigation

Category Archives: R

The in-dev htmlunit package for javascript-“enabled” web-scraping without the need for Selenium, Splash or headless Chrome relies on the HtmlUnit library and said library just released version 2.34.0 with a wide array of changes that should make it possible to scrape more gnarly javascript-“enabled” sites. The Chrome emulation is now also on-par with Chrome 72 series (my Chrome beta is at 73.0.3683.56 so it’s super close to very current).

In reality, the update was to the htmlunitjars package where the main project JAR and dependent JARs all received a refresh.

The README and tests were all re-run on both packages and Travis is happy.

If you’ve got a working rJava installation (aye, it’s 2019 and that’s still “a thing”) then you can just do:

install.packages(c("htmlunitjars", "htmlunit"), repos = "https://cinc.rud.is/")

to get them installed and start playing with the DSL or work directly with the Java classes.

FIN

As usual, use your preferred social coding site to log feature requests or problems.

I’ve got a work-in-progress drat-ified CRAN-like repo for (eventually) all my packages over at CINC🔗 (“CINC is not CRAN” and it also sounds like “sync”). This is in parallel with a co-location/migration of all my packages to SourceHut (just waiting for the sr.ht alpha API to be baked) and a self-hosted public Gitea instance. Everything will still be on that legacy social coding site y’all use but the ultimate goal is to have all installs be possible via the CINC repository (i.e. install.packages()) or via a remotes::install_git() install from this standalone or any social coding site.

I’ll eventually publish the workflow but the idea is to customize a pkgdown YAML file in each package repo so the navbar has links back to CINC and other pages (this will take some time as I seem to have made alot of little packages over the years) and then to add a package to the CINC repo:

The above processes helped shine a light on some bad README practices I’ve had and also about how to make it a bit easier (in the future) to install C[++]-backed packages. Speaking of READMEs, I also need to get all the README’s updated to use either install.packages() from CINC or a remotes install from Gitea.

Another couple of goals are to possibly get binary package versions added (though that’s going to be interesting orchestration exercise) and see if I can’t get some notary🔗 concepts implemented.

It’s actually been a fun mini-project since the drat part is a simple as drat::insertPackage('PKG', '/path/to/cinc') (#ty Dirk!) — though I need to think through some logic around maintaining Archive versions and also deleting packages which drat doesn’t do yet but is also as simple as removing tarballs and running tools::write_PACKAGES().

As an aside, I also drat-ified all our $WORK packages and made that repo work-internally-accessible via static S3 web hosting. At $0.023 USD per GB (per-month) for just hosting the objects and $0.0004 USD per 1,000 GET requests (plus minimal setup charges for SSL) it’s super cheap and also super-easy to maintain. Drop a note in the comments if you’re interested in more details of the S3 drat setup.

FIN

After a few more weeks’ baking period for the self-hosed Gitea and CINC sites will have all non-error web-logging disabled and error logs won’t save IP addresses or referrers (I welcome anyone who wants to third-party audit the nginx configs) since another goal is also to help folks not be a product for tech startups or giant, soulless, global multi-national companies with a history of being horrendously evil.

Be on the lookout for a full writeup with code in the coming weeks.

P.S.

For Safari-users on 10.14+ I’ve made some tweaks to the “batman mode” version of the site. If you do use Safari (but…why?!) and have any issues with readability in “dark mode” just drop a note in the comments and I’ll see what I can do.

There are two fledgling rJava-based R packages that enable working with the AWS SDK for Athena:

They’re both needed to conform with the way CRAN like rJava-based packages submitted that also have large JAR dependencies. The goal is to eventually have wrappers for anything R folks need under the AWS Java SDK menu.

All package pairs will eventually cohabitate under the Cloudy R Project once each gets to 90% API coverage, passes CRAN checks and has passing Travis checks.

One thing I did get working right up front was the asynchronous dplyr chain query execution collect_async(), so if you need that and would rather not use reticulated wrappers, now’s your chance.

You would be correct in assuming this is an offshoot of the recent work on updating metis. My primary impetus for this is to remove the reticulate dependency from our Dockerized production setups but I also have discovered I like the Java libraries more than the boto3-based ones (not really a shocker there if you know my views on Python). As a result I should be able to quickly wrap most any library you may need (see below).

FIN

The next major wrapper coming is S3 (there are bits of it implemented in awsathena now but that’s temporary) and — for now — you can toss a comment here or file an issue in any of the social coding sites you like for priority wrapping of other AWS Java SDK libraries. Also, if you want some experience working with rJava packages in a judgement-free zone, drop a note into these or any new AWS rJava-based package repos and I’ll gladly walk you through your first PR.

Those outside the Colonies may not know that Payless—a national chain that made footwear affordable for millions of ‘Muricans who can’t spare $100.00 USD for a pair of shoes their 7 year old will outgrow in a year— is closing. CNBC also had a story that featured a choropleth with a tiny button at the bottom that indicated one could get the data:

I should have known this would turn out to be a chore since they used Tableau—the platform of choice when you want to take advantage of all the free software libraries they use to power their premier platform which, in turn, locks up all the data for you so others can’t adopt, adapt and improve. Go. Egregious. Predatory. Capitalism.

Anyway.

I wanted the data to do some real analysis vs produce a fairly unhelpful visualization (TLDR: layer in Census data for areas impacted, estimate job losses, compute nearest similar Payless stores to see impact on transportation-challenged homes, etc. Y’now, citizen data journalism-y things) so I pressed the button and watched for the URL in Chrome (aye, for those that remember I moved to Firefox et al in 2018, I switched back; more on that in March) and copied it to try to make this post actually reproducible (a novel concept for Tableau fanbois):

library(tibble)
library(readr)

# https://www.cnbc.com/2019/02/19/heres-a-map-of-where-payless-shoesource-is-closing-2500-stores.html

tfil <- "~/Data/Sheet_3_data.csv"

download.file(
  "https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true",
  tfil
)
## trying URL 'https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true'
## Error in download.file("https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true",  : 
##   cannot open URL 'https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true'
## In addition: Warning message:
## In download.file("https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true",  :
##   cannot open URL 'https://public.tableau.com/vizql/w/PAYLESSSTORECLOSINGS/v/Dashboard2/vud/sessions/6A678928620645FF99C7EF6353426CE8-0:0/views/10625182665948828489_7202546092381496425?csv=true&showall=true': HTTP status was '410 Gone'

WAT

Truth be told I expected a time-boxed URL of some sort (prior experience FTW). Selenium or Splash were potential alternatives but I didn’t want to research the legality of more forceful scraping (I just wanted the data) so I manually downloaded the file (*the horror*) and proceeded to read it in. Well, try to read it in:

read_csv(tfil)
## Parsed with column specification:
## cols(
##   A = col_logical()
## )
## Warning: 2092 parsing failures.
## row col           expected actual                      file
##   1   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
##   2   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
##   3   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
##   4   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
##   5   A 1/0/T/F/TRUE/FALSE        '~/Data/Sheet_3_data.csv'
## ... ... .................. ...... .........................
## See problems(...) for more details.
## 
## # A tibble: 2,090 x 1
##    A    
##    <lgl>
##  1 NA   
##  2 NA   
##  3 NA   
##  4 NA   
##  5 NA   
##  6 NA   
##  7 NA   
##  8 NA   
##  9 NA   
## 10 NA   
## # … with 2,080 more rows

WAT

Getting a single column back from readr::read_[ct]sv() is (generally) a tell-tale sign that the file format is amiss. Before donning a deerstalker (I just wanted the data!) I tried to just use good ol’ read.csv():

read.csv(tfil, stringsAsFactors=FALSE)
## Error in make.names(col.names, unique = TRUE) : 
##   invalid multibyte string at '<ff><fe>A'
## In addition: Warning messages:
## 1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
##   line 1 appears to contain embedded nulls
## 2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
##   line 2 appears to contain embedded nulls
## 3: In read.table(file = file, header = header, sep = sep, quote = quote,  :
##   line 3 appears to contain embedded nulls
## 4: In read.table(file = file, header = header, sep = sep, quote = quote,  :
##   line 4 appears to contain embedded nulls
## 5: In read.table(file = file, header = header, sep = sep, quote = quote,  :
##   line 5 appears to contain embedded nulls

WAT

Actually the “WAT” isn’t really warranted since read.csv() gave us some super-valuable info via invalid multibyte string at '<ff><fe>A'. FF FE is a big signal1 2 we’re working with a file in another encoding as that’s a common “magic” sequence at the start of such files.

But, I didn’t want to delve into my Columbo persona… I. Just. Wanted. The. Data. So, I tried the mind-bendingly fast and flexible helper from data.table:

data.table::fread(tfil)
## Error in data.table::fread(tfil) : 
##   File is encoded in UTF-16, this encoding is not supported by fread(). Please recode the file to UTF-8.

AHA. UTF-16 (maybe). Let’s poke at the raw file:

x <- readBin(tfil, "raw", file.size(tfil)) ## also: read_file_raw(tfil)

x[1:100]
##   [1] ff fe 41 00 64 00 64 00 72 00 65 00 73 00 73 00 09 00 43 00
##  [21] 69 00 74 00 79 00 09 00 43 00 6f 00 75 00 6e 00 74 00 72 00
##  [41] 79 00 09 00 49 00 6e 00 64 00 65 00 78 00 09 00 4c 00 61 00
##  [61] 62 00 65 00 6c 00 09 00 4c 00 61 00 74 00 69 00 74 00 75 00
##  [81] 64 00 65 00 09 00 4c 00 6f 00 6e 00 67 00 69 00 74 00 75 00

There’s our ff fe (which is the beginning of the possibility it’s UTF-16) but that 41 00 harkens back to UTF-16’s older sibling UCS-2. The 0x00‘s are embedded nuls (likely to get bytes aligned). And, there are alot of 09s. Y’know what they are? They’re <tab>s. That’s right. Tableau named file full of TSV records in an unnecessary elaborate encoding as CSV. Perhaps they broke the “T” on all their keyboards typing their product name so much.

Living A Boy’s [Data] Adventure Tale

At this point we have:

  • no way to support an automated, reproducible workflow
  • an ill-named file for what it contains
  • an overly-encoded file for what it contains
  • many wasted minutes (which is likely by design to have us give up and just use Tableau. No. Way.)

At this point I’m in full-on Rockford Files (pun intended) mode and delved down to the command line to use a old, trusted sidekick enca🔗:

$ enca -L none Sheet_3_data.csv
## Universal character set 2 bytes; UCS-2; BMP
##   LF line terminators
##   Byte order reversed in pairs (1,2 -> 2,1)

Now, all we have to do is specify the encoding!

read_tsv(tfil, locale = locale(encoding = "UCS-2LE"))
## Error in guess_header_(datasource, tokenizer, locale) : 
##   Incomplete multibyte sequence

WAT

Unlike the other 99% of the time (mebbe 99.9%) you use it, the tidyverse doesn’t have your back in this situation (but it does have your backlog in that it’s on the TODO).

Y’know who does have your back? Base R!:

read.csv(tfil, sep="\t", fileEncoding = "UCS-2LE", stringsAsFactors=FALSE) %>% 
  as_tibble()
## # A tibble: 2,089 x 14
##    Address City  Country Index Label Latitude Longitude
##    <chr>   <chr> <chr>   <int> <chr>    <dbl>     <dbl>
##  1 1627 O… Aubu… United…     1 Payl…     32.6     -85.4
##  2 900 Co… Doth… United…     2 Payl…     31.3     -85.4
##  3 301 Co… Flor… United…     3 Payl…     34.8     -87.6
##  4 304 Ox… Home… United…     4 Payl…     33.5     -86.8
##  5 2000 R… Hoov… United…     5 Payl…     33.4     -86.8
##  6 6140 U… Hunt… United…     6 Payl…     34.7     -86.7
##  7 312 Sc… Mobi… United…     7 Payl…     30.7     -88.2
##  8 3402 B… Mobi… United…     8 Payl…     30.7     -88.1
##  9 5300 H… Mobi… United…     9 Payl…     30.6     -88.2
## 10 6641 A… Mont… United…    10 Payl…     32.4     -86.2
## # … with 2,079 more rows, and 7 more variables:
## #   Number.of.Records <int>, State <chr>, Store.Number <int>,
## #   Store.count <int>, Zip.code <chr>, State.Usps <chr>,
## #   statename <chr>

WAT WOOT!

Note that read.csv(tfil, sep="\t", fileEncoding = "UTF-16LE", stringsAsFactors=FALSE) would have worked equally as well.

The Road Not [Originally] Taken

Since this activity decimated productivity, for giggles I turned to another trusted R sidekick, the stringi package, to see what it said:

library(stringi)

stri_enc_detect(x)
## [[1]]
##      Encoding Language Confidence
## 1    UTF-16LE                1.00
## 2  ISO-8859-1       pt       0.61
## 3  ISO-8859-2       cs       0.39
## 4    UTF-16BE                0.10
## 5   Shift_JIS       ja       0.10
## 6     GB18030       zh       0.10
## 7      EUC-JP       ja       0.10
## 8      EUC-KR       ko       0.10
## 9        Big5       zh       0.10
## 10 ISO-8859-9       tr       0.01

And, just so it’s primed in the Google caches for future searchers, another way to get this data (and other data that’s even gnarlier but similar in form) into R would have been:

stri_read_lines(tfil) %>% 
  paste0(collapse="\n") %>% 
  read.csv(text=., sep="\t", stringsAsFactors=FALSE) %>% 
  as_tibble()
## # A tibble: 2,089 x 14
##    Address City  Country Index Label Latitude Longitude
##    <chr>   <chr> <chr>   <dbl> <chr>    <dbl>     <dbl>
##  1 1627 O… Aubu… United…     1 Payl…     32.6     -85.4
##  2 900 Co… Doth… United…     2 Payl…     31.3     -85.4
##  3 301 Co… Flor… United…     3 Payl…     34.8     -87.6
##  4 304 Ox… Home… United…     4 Payl…     33.5     -86.8
##  5 2000 R… Hoov… United…     5 Payl…     33.4     -86.8
##  6 6140 U… Hunt… United…     6 Payl…     34.7     -86.7
##  7 312 Sc… Mobi… United…     7 Payl…     30.7     -88.2
##  8 3402 B… Mobi… United…     8 Payl…     30.7     -88.1
##  9 5300 H… Mobi… United…     9 Payl…     30.6     -88.2
## 10 6641 A… Mont… United…    10 Payl…     32.4     -86.2
## # … with 2,079 more rows, and 7 more variables: `Number of
## #   Records` <dbl>, State <chr>, `Store Number` <dbl>, `Store
## #   count` <dbl>, `Zip code` <chr>, `State Usps` <chr>,
## #   statename <chr>

(with similar dances to use read_csv() or fread()).

FIN

The night’s quest to do some real work with the data was DoS’d by what I’ll brazenly call a deliberate attempt to dissuade doing exactly that in anything but a commercial program. But, understanding the impact of yet-another massive retail store closing is super-important and it looks like it may be up to us (since the media is too distracted by incompetent leaders and inexperienced junior NY representatives) to do the work so it is left for another eve.

Folks who’d like to do the same can grab the UTF-8 encoded actual CSV from this site which has also been run through janitor::clean_names() so there’s proper column types and names to work with.

Speaking of which, here’s the cols spec for that CSV:

cols(
  address = col_character(),
  city = col_character(),
  country = col_character(),
  index = col_double(),
  label = col_character(),
  latitude = col_double(),
  longitude = col_double(),
  number_of_records = col_double(),
  state = col_character(),
  store_number = col_double(),
  store_count = col_double(),
  zip_code = col_character(),
  state_usps = col_character(),
  statename = col_character()
)

If you do anything with the data blog about it and post a link in the comments so I and others can learn from what you’ve discovered! It’s already kinda scary that one doesn’t even need a basemap to see just how much a part of ‘Murica Payless was:

WiGLE has been around a while and is a great site to explore the pervasiveness or sparsity of Wi-Fi (and cellular) networks around the globe. While interactive use is fun, WiGLE also has a free API (so long as you obey the EULA and aren’t abusive) that lets you explore a little deeper if you register for an account, get a key, and use the WIP wiglr🔗 | GL| GH package.

Install from your least offensive social coding site:

devtools::install_git("https://sr.ht.com/~hrbrmstr/wiglr")
# or
devtools::install_gitlab("hrbrmstr/wiglr")
# or (if you must)
devtools::install_github("hrbrmstr/wiglr")

read the code (to ensure R package developers are pwning or tracking you),and …after putting your “Encoded for use” token in the WIGLE_API_KEY environment variable…start exploring!

Let’s pull all the country stats and plot the top 20:

library(wiglr)
library(hrbrthemes)
library(tidyverse) # for show

cc <- wigle_country_stats()

top_n(cc, 20) %>%  # show top 20 
  mutate(country = factor(country, levels = rev(country))) %>% 
  ggplot(aes(count, country)) +
  geom_segment(aes(xend=0, yend=country), size = 6, color = ft_cols$blue) +
  scale_x_comma(position = "top") +
  labs(
    x = "# Networks", y = NULL,
    title = "WiGLE Top 20 Countries"
  ) +
  theme_ipsum_rc(grid = "X")

We can also supply a bounding box and find all the Wi-Fi access points in the vicinity and use leaflet to make a handy interactive map (you can add other columns, too, to find fully open ones):

library(leaflet)

wifi_box <- wigle_bbox_search(43.2468, 43.2806, -70.9282, -70.8025)

wifi_box$results %>% 
  mutate(labs = sprintf("SSID: <code>%s</code><br/>Encryption: %s", ssid, encryption)) %>% 
  leaflet() %>% 
  addTiles() %>% 
  addCircleMarkers(~trilong, ~trilat, radius = 1, popup = ~labs)

There’s good but not complete WiGLE API coverage:

  • wigle_about_me: Get WiGLE user object for the current logged-in user
  • wigle_api_key: Get or set WIGLE_API_KEY value
  • wigle_bbox_search: Get WiGLE named map of general statistics
  • wigle_country_stats: Get WiGLE statistics organized by country
  • wigle_region_stats: Get WiGLE statistics for a specified country, organized by region
  • wigle_site_stats: Get WiGLE named map of general statistics

FIN

The package is a WIP but the API is really straightforward, so if you’re looking to contribute to a (dare I say “fun!”) open source project in 2019 now’s your chance! Just drop an issue on whatever social coding site you prefer (ping me in a comment if you use something besides SourceHut, GitLab or GitHub and I’ll get the package up there) and spec out what you’d like to do or just file a well-articulated PR. Potential areas to enhance are:

  • support for more search parameters
  • a function for automatic pagination
  • cover the remaining search/retrieval API endpoints
  • investigate how to use R to submit readings
  • create an internal package Shiny app for exploring WiGLE straight from R
  • develop standard base visualizations and add them as package functions
  • write a vignette
  • improve package docs
  • develop a statistical model for the likelihood on finding free/open Wi-Fi in an area or the impact of Wi-Fi/cellular deserts on communities by layering in Census data

Working with or developing for the package may also help shed some light on yet-another-way we’re exposed in the digital world.

As always you’ll get judgement-free help/suggestions. full credit in the DESCRIPTION and the fame and glory of CRAN if it ever makes its way through the process. The package does just enough for me right now so it will be a while before I get to the above TODO list if others don’t jump in first.

I must preface this post with the posit that if you’re doing anything interactive() with Amazon Athena you should seriously consider just using their free ODBC drivers as it’s the easiest way to wire them up to R DBI- and tidyverse-wise. I’ve said as much in previous posts. Drop a note in the comments if you don’t know the incantations for repackaging the provided Linux ODBC drivers to work on your flavor of Linux.

However

There are times—say, when you’re trying to stand up an R service in your kubernetes cluster which bridges data in Athena to analyses & visualizations in R—when ODBC drivers can be more of a hindrance than help and JDBC is the path of least resistance.

Sure, there’s the in-CRAN AWR.Athena package but it’s a fairly constrained and low-feature RJDBC shim which gets the basic job done but not much more.

Enter:

a trio of packages which aims to make it super-straightforward to wire up R to Amazon Athena when ODBC is not available.

Why Three Packages?

For starters, there are CRAN hopes for the metis-trio and one key component of that is separating out the JARs into one package (metis.jars) and actual functionality into others (metis and metis.tidy). We’ll see how the CRAN attempt goes since the JAR package weighs in at sufficient weight to warrant a NOTE. The packaging of the driver reduces the need for you to pre-load the JAR (locally or into, say, a Docker image) or perform a package-initiated download-dance like AWR.Athena does (which I still don’t understand why that hasn’t kicked it out of CRAN the way it does it but ¯\_(ツ)_/¯).

metis.jars also has three helper functions which do some (basic) fun things:

library(metis.jars)

simba_driver_version()
## [1] "02.00.06.1008"

athena_supported_types()
##  [1] "BOOLEAN"   "TINYINT"   "SMALLINT"  "INT"       "INTEGER"  
##  [6] "BIGINT"    "REAL"      "FLOAT"     "DOUBLE"    "DECIMAL"  
## [11] "DATE"      "TIMESTAMP" "BINARY"    "VARBINARY" "CHAR"     
## [16] "VARCHAR"   "STRING"    "ARRAY"     "MAP"       "ROW"      
## [21] "STRUCT"   

metis_jar_path()
## [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/metis.jars/java/AthenaJDBC42_2.0.6.jar"

The first uses the rJava interface to directly query the version (since Amazon seems to update the Simba JAR twice a year). By separating out the JAR into a separate package, updates can be made to the other two sibling packages more frequently without crushing CRAN’s disk space. metis.jars is also versioned to the included JAR so configuration management will be easier for folks.

The reason for the second type-lister function is that there’s hope Amazon will add support for all Presto data types, especially IPADDRESS. It, again, performs JDBC driver introspection to collect the supported types.

Finally, the third function abstracts the JAR location from the metis package or even your own interface package should you choose to depend on it.

OK, But Why Not Just Two?

The metis package is a more functional RJDBC superclass of a DBI wrapper than AWR.Athena. One thing it does that its CRAN cousin cannot is handle BIGINTs properly:

library(metis)

dbConnect(
  metis::Athena(),
  Schema = "sampledb",
  AwsCredentialsProviderClass = "com.simba.athena.amazonaws.auth.PropertiesFileCredentialsProvider",
  AwsCredentialsProviderArguments = path.expand("~/.aws/athenaCredentials.props")
) -> con

dbGetQuery(con, "
SELECT
  CAST('chr' AS CHAR(4)) achar,
  CAST('varchr' AS VARCHAR) avarchr,
  CAST(SUBSTR(timestamp, 1, 10) AS DATE) AS tsday,
  CAST(100.1 AS DOUBLE) AS justadbl,
  CAST(127 AS TINYINT) AS asmallint,
  CAST(100 AS INTEGER) AS justanint,
  CAST(100000000000000000 AS BIGINT) AS abigint,
  CAST(('GET' = 'GET') AS BOOLEAN) AS is_get,
  ARRAY[1, 2, 3] AS arr1,
  ARRAY['1', '2, 3', '4'] AS arr2,
  MAP(ARRAY['foo', 'bar'], ARRAY[1, 2]) AS mp,
  CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE)) AS rw,
  CAST('{\"a\":1}' AS JSON) js
FROM elb_logs
LIMIT 1
") %>% 
  dplyr::glimpse()
## Observations: 1
## Variables: 13
## $ achar     <chr> "chr "
## $ avarchr   <chr> "varchr"
## $ tsday     <date> 2014-09-29
## $ justadbl  <dbl> 100.1
## $ asmallint <int> 127
## $ justanint <int> 100
## $ abigint   <S3: integer64> 100000000000000000
## $ is_get    <lgl> TRUE
## $ arr1      <chr> "1, 2, 3"
## $ arr2      <chr> "1, 2, 3, 4"
## $ mp        <chr> "{bar=2, foo=1}"
## $ rw        <chr> "{x=1, y=2.0}"
## $ js        <chr> "\"{\\\"a\\\":1}\""

PrestoAthena arrays and maps and rows and JSON come across as characters from the Athena driver and they’re formatted so badly that there’s little hope of full R support for list columns for them. But, you do get real, big integers with metis along with full support for all other current Athena types.

R folk who may be users of the old, standalone metis package need to be aware of some things.

First, dbConnect() has breaking changes. The snake_case names that still exist in the higher-level athena_jdbc() function are gone. In exchange for this pain, you now have full naming-parity with all the Athena JDBC connection properties and can more easily use alternate credential providers which metis‘ cousin totally cannot do for you which is illustrated in the example above and in the package README.

The metis package also makes it easier to see documentation for all available Athena connection properties since it has a vignette with a descriptive table of all of them (rendered here).

There is also nascent support for the “streaming API” (TLDR: faster result set downloads) but that won’t be fully tested until some AWS policy tweaks happen this week.

Gotcha. But, Why Not Just Two?

As awesome as it is (including base Docker image support) the tidyverse is not without overhead in terms of compilation time and dependencies, both of which are especially painful on Linux systems and some Docker environments. You can absolutely get by with some well-crafted SQL and JDBC and the thinner the image the easier it is to deploy and scale.

But! The tidyverse is so helpful that ensuring smooth support for Athena is critical. On its own, metis wires up to dplyr/dbplyr fine, but by providing (in metis.tidy) some enhanced db_data_type() support (primarily for BIGINT) and some extra 💙 in sql_translate_env() )for those of us who continue to mindlessly use R-only verbs like grep() or as.POSIXct() in non-R contexts) we can level-up interactive() use and tidyverse-infused service use:

library(metis.tidy)
library(dbplyr)
library(dplyr)

metis::dbConnect(
  metis::Athena(),
  Schema = "sampledb",
  AwsCredentialsProviderClass = "com.simba.athena.amazonaws.auth.PropertiesFileCredentialsProvider",
  AwsCredentialsProviderArguments = path.expand("~/.aws/athenaCredentials.props")
) -> con

elb_logs <- tbl(con, "elb_logs")

filter(elb_logs, grepl("20", elbresponsecode)) %>%
  mutate(
    tsday = as.Date(substring(timestamp, 1L, 10L)),
    host = url_extract_host(url),
    proto_version = regexp_extract(protocol, "([[:digit:]\\.]+)"),
  ) %>%
  select(tsday, host, receivedbytes, requestprocessingtime, proto_version) %>%
  head(1) %>%
  glimpse()
## Observations: ??
## Variables: 5
## Database: AthenaConnection
## $ tsday                 <date> 2014-09-29
## $ host                  <chr> "www.abcxyz.com"
## $ receivedbytes         <S3: integer64> 0
## $ requestprocessingtime <dbl> 9.5e-05
## $ proto_version         <chr> "1.1"

FIN

A fairly big impetus for this radical refactoring was the need to use the Athena JDBC interface in R at $DAYJOB in a serverless context. So, if I/we needed it, others may as well. All three packages have tests (that work with my personal Athena setup which is easily replicated since it’s just the default schema & table you get when you enable Athena), pass CRAN checks and will be live in a real production environment by the time you read this.

Note that I do have CRAN plans for these three amigos, but all three packages will need to go in at the same time and I need to get tests into- and prove tests are live in Travis before submitting. Now’s the time for feature requests, problem reports or issues. Until SourceHut’s (sr.ht) API is finished, said contributions are best left to GitLab (preferably) or GitHub (if you must continue to fill the coffers of giant multional companies that undermine your freedom).

POSTSCRIPT

One other reason for re-visiting metis was this R-crashing rJava issue that is really a Simba Athena implementation issue (OS signals in a JDBC driver, rly?)

This Rprofile entry:

options(
  "java.parameters" = c(getOption("java.parameters", default = NULL), "-Xrs")
)

has been a solid workaround until rJava is updated. Note that metis.jars warns about this on load if it detects your setup is at risk.

I’ve got a post up over at $DAYJOB’s blog on using the ropendata🔗 package to access the ginormous and ever-increasing amount of internet telemetry (scan) data via the Rapid7 Open Data API.

It’s super-R-code-heavy but renders surprisingly well in Ghost (the blogging platform we use at work) and covers everything from where to sign up for an account, installing the package (it’s on CRAN!), enriching scan data with free geolocation databases and rgeolocate plus even shows how to cobble together some fairly intricate R + Rcpp code so you can decode gnarly, binary UDP responses from esoteric protocol exchanges.

Give it a look, grab the companion Rmd and drop any comments there or here if you have any questions, or reach out to research@rapid7.com.

Over at $DAYJOB’s blog I’ve queued up a post that shows how to use our new ropendata? package to work with our Open Data portal’s API. I’m not super-sure when it’s going to be posted so keep an RSS reader fixed on https://blog.rapid7.com/ if you’re interested in seeing it (I may make a small note of it here so it can wind its way into R Weekly & R-bloggers).

The example data used in the post is the public version of what I talked about in a recent post here, namely the devices discovered exposing the Ubiquity Discovery Protocol.

I’m quite blessed at work since we have virtually all of our icky payload data pre-processed and in parquet map columns in Athena so I don’t really have to do much data wrangling once we’ve fully baked a new study.

The format of the public data for the Ubiquiti discovery protocol scan results is a bit different than the base64 encoded data in the previous post in that the payload response is a hex-encoded character string; e.g.

0100009302000a002722bccf9db126fa9a02000a002722bdcf9dc0a80101010006002722bccf9d0a000400006ae40b000c626a732e6572656e696c646f0c00064147352d48500d00104d6f72726f5f446f757261646f5f30330e000102030022584d2e6172373234302e76352e362e332e32383539312e3135313133302e31373439100002e24514000d41697247726964204d35204850

So, every two characters is a byte (e.g. "01" is 0x01).

R has a nice strtoi() function for converting a hex-encoded byte into a raw value but it only works for one byte. We can split a string (like the one above) into a character vector of length 2 hex strings in many ways, one of which is using helper functions from the stringi package:

library(stringi)
library(magrittr) # for %>%

x <- "0100009302000a002722bccf9db126fa9a02000a002722bdcf9dc0a80101010006002722bccf9d0a000400006ae40b000c626a732e6572656e696c646f0c00064147352d48500d00104d6f72726f5f446f757261646f5f30330e000102030022584d2e6172373234302e76352e362e332e32383539312e3135313133302e31373439100002e24514000d41697247726964204d35204850"

stri_sub(x, seq(1, stri_length(x), by = 2), length = 2)
##   [1] "01" "00" "00" "93" "02" "00" "0a" "00" "27" "22" "bc" "cf" "9d" "b1" "26" "fa" "9a"
##  [18] "02" "00" "0a" "00" "27" "22" "bd" "cf" "9d" "c0" "a8" "01" "01" "01" "00" "06" "00"
##  [35] "27" "22" "bc" "cf" "9d" "0a" "00" "04" "00" "00" "6a" "e4" "0b" "00" "0c" "62" "6a"
##  [52] "73" "2e" "65" "72" "65" "6e" "69" "6c" "64" "6f" "0c" "00" "06" "41" "47" "35" "2d"
##  [69] "48" "50" "0d" "00" "10" "4d" "6f" "72" "72" "6f" "5f" "44" "6f" "75" "72" "61" "64"
##  [86] "6f" "5f" "30" "33" "0e" "00" "01" "02" "03" "00" "22" "58" "4d" "2e" "61" "72" "37"
## [103] "32" "34" "30" "2e" "76" "35" "2e" "36" "2e" "33" "2e" "32" "38" "35" "39" "31" "2e"
## [120] "31" "35" "31" "31" "33" "30" "2e" "31" "37" "34" "39" "10" "00" "02" "e2" "45" "14"
## [137] "00" "0d" "41" "69" "72" "47" "72" "69" "64" "20" "4d" "35" "20" "48" "50"

We still need to run that through strtoi() and turn it into a raw vector (at least for this use-case):

stri_sub(x, seq(1, stri_length(x), by = 2), length = 2) %>%
  strtoi(base = 16) %>%
  as.raw()
##   [1] 01 00 00 93 02 00 0a 00 27 22 bc cf 9d b1 26 fa 9a 02 00 0a 00 27 22 bd cf 9d c0 a8 01
##  [30] 01 01 00 06 00 27 22 bc cf 9d 0a 00 04 00 00 6a e4 0b 00 0c 62 6a 73 2e 65 72 65 6e 69
##  [59] 6c 64 6f 0c 00 06 41 47 35 2d 48 50 0d 00 10 4d 6f 72 72 6f 5f 44 6f 75 72 61 64 6f 5f
##  [88] 30 33 0e 00 01 02 03 00 22 58 4d 2e 61 72 37 32 34 30 2e 76 35 2e 36 2e 33 2e 32 38 35
## [117] 39 31 2e 31 35 31 31 33 30 2e 31 37 34 39 10 00 02 e2 45 14 00 0d 41 69 72 47 72 69 64
## [146] 20 4d 35 20 48 50

On one of my systems, an individual use of that full processing pipeline with the sample string takes about 170μs which is not bad. But, what if we have half a million of them (as was the case with the blog post for work)? I mean, sure, it’s only about a minute and a half of processing time (with some variance as each bit of input will be of different lengths), but that’s a painful interactive 1.5 minutes and we still need to wrap that bit of code in a function with some vectorization so it can be used easily.

This is a good example of where the complexity introduced by using a little C++/Rcpp may be warranted, especially since the BH package—which brings us a ton of capabilities from the Boost C++ library—has some handy string utilities, including an boost::algorithm::unhex() function.

Here’s one way to attack the problem in C++/Rcpp within a plain ol’ R session:

library(Rcpp)

cppFunction(depends = "BH", '
  List dehexify_cpp(StringVector input) {

    List out(input.size()); // make room for our return value

    for (unsigned int i=0; i<input.size(); i++) { // iterate over the input 

      if (StringVector::is_na(input[i]) || (input[i].size() == 0)) {
        out[i] = StringVector::create(NA_STRING); // bad input
      } else if (input[i].size() % 2 == 0) { // likey to be ok input

        RawVector tmp(input[i].size() / 2); // only need half the space
        std::string h = boost::algorithm::unhex(Rcpp::as<std::string>(input[i])); // do the work
        std::copy(h.begin(), h.end(), tmp.begin()); // copy it to our raw vector

        out[i] = tmp; // save it to the List

      } else {
        out[i] =  StringVector::create(NA_STRING); // bad input
      }

    }

    return(out);

  }
', includes = c('#include <boost/algorithm/hex.hpp>')
)

Now, we have a dehexify_cpp() function in our environment, so we can use it on any valid R data. Let’s see if we get the same results as the stringi R version:

dehexify_cpp(x)
## [[1]]
##   [1] 01 00 00 93 02 00 0a 00 27 22 bc cf 9d b1 26 fa 9a 02 00 0a 00 27 22 bd cf 9d c0 a8 01
##  [30] 01 01 00 06 00 27 22 bc cf 9d 0a 00 04 00 00 6a e4 0b 00 0c 62 6a 73 2e 65 72 65 6e 69
##  [59] 6c 64 6f 0c 00 06 41 47 35 2d 48 50 0d 00 10 4d 6f 72 72 6f 5f 44 6f 75 72 61 64 6f 5f
##  [88] 30 33 0e 00 01 02 03 00 22 58 4d 2e 61 72 37 32 34 30 2e 76 35 2e 36 2e 33 2e 32 38 35
## [117] 39 31 2e 31 35 31 31 33 30 2e 31 37 34 39 10 00 02 e2 45 14 00 0d 41 69 72 47 72 69 64
## [146] 20 4d 35 20 48 50

Apart from it being a list (since we took care of vectorization at the same time) it is, indeed, the same data.

With that tiny bit of fairly straightforward Rcpp/C++ code we get a substantially faster execution time of around 4μs. Yep, that’s not a typo: four microseconds.

We’ll give it a real world test with the payload data from work:

# This assumes you have a "~/Data" directory. Put it somewhere
# else if you don't have a "~/Data" directory.

if (!file.exists("~/Data/dehexify-sample-data.txt.gz")) {
  download.file(
    url = "https://rud.is/dl/dehexify-sample-data.txt.gz", 
    destfile = "~/Data/dehexify-sample-data.txt.gz"
  )
}

char_hex_lines <- readr::read_lines("~/Data/dehexify-sample-data.txt.gz")

length(char_hex_lines)
## [1] 501926

res <- dehexify_cpp(char_hex_lines)

That took just over a second to run on my main development system. But, did it really work? I chose index 998 at random so let’s poke at it with the tool from the other blog post:

udpprobe::parse_ubnt_discovery_response(res[[998]])
## [Model: N5N; Firmware: XW.ar934x.v5.5.9.21734.140403.1801; Uptime: 13.1 (hrs)

Aye, it did, indeed, work.

FIN

It’s still early in 2019 and if you haven’t settled on any resolutions yet or want to substitute out one that isn’t working so well (who wants to drive to the gym anyway?) with another, perhaps add “experiment with Rcpp” to the list since a tiny dose of it can go a very long way into speeding up some tasks.