Skip navigation

Search Results for: reticulate

It was probably not difficult to discern from my previous Drill-themed post that I’m fairly excited about the Apache Drill 1.15.0 release. I’ve rounded out most of the existing corners for it in preparation for a long-overdue CRAN update and have been concentrating on two helper features: configuring & launching Drill embedded Docker containers and auto-generation of Drill CTAS queries.

Drill Docker Goodness

Starting with version 1.14.0, Apache provides Drill Docker images for use in experimenting/testing/building-off-of. They run Drill in single node standalone mode so you’re not going to be running this in “production” (unless you have light or just personal workloads). Docker is a great way to get to know Drill if you haven’t already played with it since you don’t have do do much except run the Docker image.

I’ve simplified this even more thanks to @rgfitzjohn’s most excellent stevedore? package which adds a robust R wrapper to the Docker client without relying on any heavy external dependencies such as reticulate. The new drill_up() function will auto-fetch the latest Drill image and launch a container so you can have a running Drill instance with virtually no effort on your part.

Just running the vanilla image isn’t enough since your goal is likely to do more than work with the built-in cp data source. The default container launch scenario also doesn’t hook up any local filesystem paths to the container so you really can’t do much other than cp-oriented queries. Rather than make you do all the work of figuring out how to machinate Docker command line arguments and manually configure a workspace that points to a local filesystem area in the Drill web admin GUI the drill_up() function provides a data_dir argument (that defaults to the getwd() where you are in your R session) which will then auto-wire up that path into the container and create a dfs.d workspace which auto-points to it for you. Here’s a sample execution:

library(sergeant)
library(tidyverse)

dr <- drill_up(data_dir = "~/Data")
## Drill container started. Waiting for the service to become active (this may take up to 30s).
## Drill container ID: f02a11b50e1647e44c4e233799180da3e907c8aa27900f192b5fd72acfa67ec0

You can use dc$stop() to stop the container or use the printed container id to do it from the command line.

We’ll use this containerized Drill instance with the next feature but I need to thank @cboettig for the suggestion to make an auto-downloader-runner-thingy before doing that. (Thank you @cboettig!)

Taking the Tedium out of CTAS

@dseverski, an intrepid R, Drill & sergeant user noticed some new package behavior with Drill 1.15.0 that ended up spawning a new feature: automatic generation of Drill CTAS statements.

Prior to 1.14.0 sergeant had no way to accurately, precisely tell data types of the columns coming back since the REST API didn’t provide them (as noted in the previous Drill post). Now, it did rely on the JSON types to create the initial data frames but id also did something **kinda horribad*: it ran readr::type_convert() on the result sets ?. Said operation had the singular benefit of auto-converting CSV/CSVH/TSV/PSV/etc data to something sane without having to worry about writing lengthy CTAS queries (at the expense of potentially confusing everyone, though that didn’t seem to happen).

With 1.15.0, the readr::type_convert() crutch is gone, which results in less-than-helpful things like this when you have delimiter-separated values data:

# using the Drill container we just started above

write_csv(nycflights13::flights, "~/Data/flights.csvh")

con <- src_drill("localhost")

tbl(con, "dfs.d.`flights.csvh`") %>% 
  glimpse()
## Observations: ??
## Variables: 19
## Database: DrillConnection
## $ year           <chr> "2013", "2013", "2013", "2013", "2013", "2013", "2013", "2013…
## $ month          <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "…
## $ day            <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "…
## $ dep_time       <chr> "517", "533", "542", "544", "554", "554", "555", "557", "557"…
## $ sched_dep_time <chr> "515", "529", "540", "545", "600", "558", "600", "600", "600"…
## $ dep_delay      <chr> "2", "4", "2", "-1", "-6", "-4", "-5", "-3", "-3", "-2", "-2"…
## $ arr_time       <chr> "830", "850", "923", "1004", "812", "740", "913", "709", "838…
## $ sched_arr_time <chr> "819", "830", "850", "1022", "837", "728", "854", "723", "846…
## $ arr_delay      <chr> "11", "20", "33", "-18", "-25", "12", "19", "-14", "-8", "8",…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "…
## $ flight         <chr> "1545", "1714", "1141", "725", "461", "1696", "507", "5708", …
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK"…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO"…
## $ air_time       <chr> "227", "227", "160", "183", "116", "150", "158", "53", "140",…
## $ distance       <chr> "1400", "1416", "1089", "1576", "762", "719", "1065", "229", …
## $ hour           <chr> "5", "5", "5", "5", "6", "5", "6", "6", "6", "6", "6", "6", "…
## $ minute         <chr> "15", "29", "40", "45", "0", "58", "0", "0", "0", "0", "0", "…
## $ time_hour      <chr> "2013-01-01T10:00:00Z", "2013-01-01T10:00:00Z", "2013-01-01T1…

So the package does what it finally should have been doing all along. But, as noted, that’s not great if you just wanted to quickly work with a directory of CSV files. In theory, you’re supposed to use Drill’s CREATE TABLE AS then do a bunch of CASTS and TO_s to get proper data types. But who has time for that?

David had a stellar idea, might sergeant be able to automagically create CTAS statements from a query?. Yes. Yes it just might be able to do that with the new ctas_profile() function.

Let’s pipe the previous tbl() into ctas_profile() and see what we get:

tbl(con, "dfs.d.`flights.csvh`") %>% 
  ctas_profile() %>% 
  cat()
-- ** Created by ctas_profile() in the R sergeant package, version 0.8.0 **

CREATE TABLE CHANGE____ME AS
SELECT
  CAST(`year` AS DOUBLE) AS `year`,
  CAST(`month` AS DOUBLE) AS `month`,
  CAST(`day` AS DOUBLE) AS `day`,
  CAST(`dep_time` AS DOUBLE) AS `dep_time`,
  CAST(`sched_dep_time` AS DOUBLE) AS `sched_dep_time`,
  CAST(`dep_delay` AS DOUBLE) AS `dep_delay`,
  CAST(`arr_time` AS DOUBLE) AS `arr_time`,
  CAST(`sched_arr_time` AS DOUBLE) AS `sched_arr_time`,
  CAST(`arr_delay` AS DOUBLE) AS `arr_delay`,
  CAST(`carrier` AS VARCHAR) AS `carrier`,
  CAST(`flight` AS DOUBLE) AS `flight`,
  CAST(`tailnum` AS VARCHAR) AS `tailnum`,
  CAST(`origin` AS VARCHAR) AS `origin`,
  CAST(`dest` AS VARCHAR) AS `dest`,
  CAST(`air_time` AS DOUBLE) AS `air_time`,
  CAST(`distance` AS DOUBLE) AS `distance`,
  CAST(`hour` AS DOUBLE) AS `hour`,
  CAST(`minute` AS DOUBLE) AS `minute`,
  TO_TIMESTAMP(`time_hour`, 'FORMATSTRING') AS `time_hour` -- *NOTE* You need to specify the format string. Sample character data is: [2013-01-01T10:00:00Z]. 
FROM (SELECT * FROM dfs.d.`flights.csvh`)


-- TIMESTAMP and/or DATE columns were detected.
Drill's date/time format string reference can be found at:
--
-- <http://joda-time.sourceforge.net/apidocs/org/joda/time/format/DateTimeFormat.html>

There’s a parameter for the new table name which will cause the CHANGE____ME to go away and when the function finds TIMESTAMP or DATE fields it knows to switch to their TO_ cousins and gives sample data with a reminder that you need to make a format string (I’ll eventually auto-generate them unless someone PRs it first). And, since nodoby but Java programmers remember Joda format strings (they’re different than what you’re used to) it provides a handy link to them if it detects the presence of those column types.

Now, we don’t need to actually create a new table (though converting a bunch of CSVs to Parquet is likely a good idea for performance reasons) to use that output. We can pass most of that new query right to tbl():

tbl(con, sql("
SELECT
  CAST(`year` AS DOUBLE) AS `year`,
  CAST(`month` AS DOUBLE) AS `month`,
  CAST(`day` AS DOUBLE) AS `day`,
  CAST(`dep_time` AS DOUBLE) AS `dep_time`,
  CAST(`sched_dep_time` AS DOUBLE) AS `sched_dep_time`,
  CAST(`dep_delay` AS DOUBLE) AS `dep_delay`,
  CAST(`arr_time` AS DOUBLE) AS `arr_time`,
  CAST(`sched_arr_time` AS DOUBLE) AS `sched_arr_time`,
  CAST(`arr_delay` AS DOUBLE) AS `arr_delay`,
  CAST(`carrier` AS VARCHAR) AS `carrier`,
  CAST(`flight` AS DOUBLE) AS `flight`,
  CAST(`tailnum` AS VARCHAR) AS `tailnum`,
  CAST(`origin` AS VARCHAR) AS `origin`,
  CAST(`dest` AS VARCHAR) AS `dest`,
  CAST(`air_time` AS DOUBLE) AS `air_time`,
  CAST(`distance` AS DOUBLE) AS `distance`,
  CAST(`hour` AS DOUBLE) AS `hour`,
  CAST(`minute` AS DOUBLE) AS `minute`,
  TO_TIMESTAMP(`time_hour`, 'yyyy-MM-dd''T''HH:mm:ssZ') AS `time_hour` -- [2013-01-01T10:00:00Z].
FROM (SELECT * FROM dfs.d.`flights.csvh`)
")) %>% 
  glimpse()
## Observations: ??
## Variables: 19
## Database: DrillConnection
## $ year           <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <dbl> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 5…
## $ sched_dep_time <dbl> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 6…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -…
## $ arr_time       <dbl> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, …
## $ sched_arr_time <dbl> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, …
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31,…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "…
## $ flight         <dbl> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71,…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK"…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO"…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, 34…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0, 0, 0…
## $ time_hour      <dttm> 2013-01-01 10:00:00, 2013-01-01 10:00:00, 2013-01-01 10:00:0…

Ahhhh… Useful data types. (And, see what I mean about that daft format string? Also, WP is mangling the format string so add a comment if you need the actual string.)

FIN

As you can see questions, suggestions (and PRs!) are welcome and heeded on your social-coding platform of choice (though y’all still seem to be stuck on GH ?).

NOTE: I’ll be subbing out most install_github() links in READMEs and future blog posts for install_git() counterparts pointing to my sr.ht repos (as I co-locate/migrate them there).

You can play with the new 0.8.0 features via devtools::install_git("https://git.sr.ht/~hrbrmstr/sergeant", ref="0.8.0").

Well, 2018 has flown by and today seems like an appropriate time to take a look at the landscape of R bloggerdom as seen through the eyes of readers of R-bloggers and R Weekly. We’ll do this via a new package designed to make it easier to treat Feedly as a data source: seymour [GL | GH] (which is a pun-ified name based on a well-known phrase from Little Shop of Horrors).

The seymour package builds upon an introductory Feedly API blog post from back in April 2018 and covers most of the “getters” in the API (i.e. you won’t be adding anything to or modifying anything in Feedly through this package unless you PR into it with said functions). An impetus for finally creating the package came about when I realized that you don’t need a Feedly account to use the search or stream endpoints. You do get more data back if you have a developer token and can also access your own custom Feedly components if you have one. If you are a “knowledge worker” and do not have a Feedly account (and, really, a Feedly Pro account) you are missing out. But, this isn’t a rah-rah post about Feedly, it’s a rah-rah post about R! Onwards!

Feeling Out The Feeds

There are a bunch of different ways to get Feedly metadata about an RSS feed. One easy way is to just use the RSS feed URL itself:

library(seymour) # git[la|hu]b/hrbrmstr/seymour
library(hrbrthemes) # git[la|hu]b/hrbrmstr/hrbrthemes
library(lubridate)
library(tidyverse)
r_bloggers <- feedly_feed_meta("http://feeds.feedburner.com/RBloggers")
r_weekly <- feedly_feed_meta("https://rweekly.org/atom.xml")
r_weekly_live <- feedly_feed_meta("https://feeds.feedburner.com/rweeklylive")

glimpse(r_bloggers)
## Observations: 1
## Variables: 14
## $ feedId      <chr> "feed/http://feeds.feedburner.com/RBloggers"
## $ id          <chr> "feed/http://feeds.feedburner.com/RBloggers"
## $ title       <chr> "R-bloggers"
## $ subscribers <int> 24518
## $ updated     <dbl> 1.546227e+12
## $ velocity    <dbl> 44.3
## $ website     <chr> "https://www.r-bloggers.com"
## $ topics      <I(list)> data sci....
## $ partial     <lgl> FALSE
## $ iconUrl     <chr> "https://storage.googleapis.com/test-site-assets/X...
## $ visualUrl   <chr> "https://storage.googleapis.com/test-site-assets/X...
## $ language    <chr> "en"
## $ contentType <chr> "longform"
## $ description <chr> "Daily news and tutorials about R, contributed by ...

glimpse(r_weekly)
## Observations: 1
## Variables: 13
## $ feedId      <chr> "feed/https://rweekly.org/atom.xml"
## $ id          <chr> "feed/https://rweekly.org/atom.xml"
## $ title       <chr> "RWeekly.org - Blogs to Learn R from the Community"
## $ subscribers <int> 876
## $ updated     <dbl> 1.546235e+12
## $ velocity    <dbl> 1.1
## $ website     <chr> "https://rweekly.org/"
## $ topics      <I(list)> data sci....
## $ partial     <lgl> FALSE
## $ iconUrl     <chr> "https://storage.googleapis.com/test-site-assets/2...
## $ visualUrl   <chr> "https://storage.googleapis.com/test-site-assets/2...
## $ contentType <chr> "longform"
## $ language    <chr> "en"

glimpse(r_weekly_live)
## Observations: 1
## Variables: 9
## $ id          <chr> "feed/https://feeds.feedburner.com/rweeklylive"
## $ feedId      <chr> "feed/https://feeds.feedburner.com/rweeklylive"
## $ title       <chr> "R Weekly Live: R Focus"
## $ subscribers <int> 1
## $ updated     <dbl> 1.5461e+12
## $ velocity    <dbl> 14.7
## $ website     <chr> "https://rweekly.org/live"
## $ language    <chr> "en"
## $ description <chr> "Live Updates from R Weekly"

Feedly uses some special terms, one of which (above) is velocity. “Velocity” is simply the average number of articles published weekly (Feedly’s platform updates that every few weeks for each feed). R-bloggers has over 24,000 Feedly subscribers so any post-rankings we do here should be fairly representative. I included both the “live” and the week-based R Weekly feeds as I wanted to compare post coverage between R-bloggers and R Weekly in terms of raw content.

On the other hand, R Weekly’s “weekly” RSS feed has less than 1,000 subscribers. WAT?! While I have mostly nothing against R-bloggers-proper I heartily encourage ardent readers to also subscribe to R Weekly and perhaps even consider switching to it (or at least adding the individual blog feeds they monitor to your own Feedly). It wasn’t until the Feedly API that I had any idea of how many folks were really viewing my R blog posts since we must provide a full post RSS feed to R-bloggers and get very little in return (at least in terms of data). R Weekly uses a link counter but redirects all clicks to the blog author’s site where we can use logs or analytics platforms to measure engagement. R Weekly is also run by a group of volunteers (more eyes == more posts they catch!) and has a Patreon where the current combined weekly net is likely not enough to buy each volunteer a latte. No ads, a great team and direct engagement stats for the community of R bloggers seems like a great deal for $1.00 USD. If you weren’t persuaded by the above rant, then perhaps at least consider installing this (from source that you control).

Lastly, I believe I’m that “1” subscriber to R Weekly Live O_o. But, I digress.

We’ve got the feedIds (which can be used as “stream” ids) so let’s get cracking!

Binding Up The Posts

We need to use the feedId in calls to feedly_stream() to get the individual posts. The API claims there’s a temporal parameter that allows one to get posts only after a certain date but I couldn’t get it to work (PRs are welcome on any community source code portal you’re most comfortable in if you’re craftier than I am). As a result, we need to make a guess as to how many calls we need to make for two of the three feeds. Basic maths of 44 * 52 / 1000 suggests ~3 should suffice for R Weekly (live) and R-bloggers but let’s do 5 to be safe. We should be able to get R Weekly (weekly) in one go.

r_weekly_wk <- feedly_stream(r_weekly$feedId)

range(r_weekly_wk$items$published) # my preview of this said it got back to 2016!
## [1] "2016-05-20 20:00:00 EDT" "2018-12-30 19:00:00 EST"

# NOTE: If this were more than 3 I'd use a loop/iterator
# In reality, I should make a helper function to do this for you (PRs welcome)

r_blog_1 <- feedly_stream(r_bloggers$feedId)
r_blog_2 <- feedly_stream(r_bloggers$feedId, continuation = r_blog_1$continuation)
r_blog_3 <- feedly_stream(r_bloggers$feedId, continuation = r_blog_2$continuation)

r_weekly_live_1 <- feedly_stream(r_weekly_live$feedId)
r_weekly_live_2 <- feedly_stream(r_weekly_live$feedId, continuation = r_weekly_live_1$continuation)
r_weekly_live_3 <- feedly_stream(r_weekly_live$feedId, continuation = r_weekly_live_2$continuation)

bind_rows(r_blog_1$items, r_blog_2$items, r_blog_3$items) %>% 
  filter(published >= as.Date("2018-01-01")) -> r_blog_stream

bind_rows(r_weekly_live_1$items, r_weekly_live_2$items, r_weekly_live_3$items) %>% 
  filter(published >= as.Date("2018-01-01")) -> r_weekly_live_stream

r_weekly_wk_stream <- filter(r_weekly_wk$items, published >= as.Date("2018-01-01"))

Let’s take a look:

glimpse(r_weekly_wk_stream)
## Observations: 54
## Variables: 27
## $ id                  <chr> "2nIALmjjlFcpPJKakm2k8hjka0FzpApixM7HHu8B0...
## $ originid            <chr> "https://rweekly.org/2018-53", "https://rw...
## $ fingerprint         <chr> "114357f1", "199f78d0", "9adc236e", "63f99...
## $ title               <chr> "R Weekly 2018-53 vroom, Classification", ...
## $ updated             <dttm> 2018-12-30 19:00:00, 2018-12-23 19:00:00,...
## $ crawled             <dttm> 2018-12-31 00:51:39, 2018-12-23 23:46:49,...
## $ published           <dttm> 2018-12-30 19:00:00, 2018-12-23 19:00:00,...
## $ alternate           <list> [<https://rweekly.org/2018-53.html, text/...
## $ canonicalurl        <chr> "https://rweekly.org/2018-53.html", "https...
## $ unread              <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ categories          <list> [<user/c45e5b02-5a96-464c-bf77-4eea75409c...
## $ engagement          <int> 1, 5, 5, 3, 2, 3, 1, 2, 3, 2, 4, 3, 2, 2, ...
## $ engagementrate      <dbl> 0.33, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ recrawled           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ tags                <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL...
## $ content_content     <chr> "<p>Hello and welcome to this new issue!</...
## $ content_direction   <chr> "ltr", "ltr", "ltr", "ltr", "ltr", "ltr", ...
## $ origin_streamid     <chr> "feed/https://rweekly.org/atom.xml", "feed...
## $ origin_title        <chr> "RWeekly.org - Blogs to Learn R from the C...
## $ origin_htmlurl      <chr> "https://rweekly.org/", "https://rweekly.o...
## $ visual_processor    <chr> "feedly-nikon-v3.1", "feedly-nikon-v3.1", ...
## $ visual_url          <chr> "https://github.com/rweekly/image/raw/mast...
## $ visual_width        <int> 372, 672, 1000, 1000, 1000, 1001, 1000, 10...
## $ visual_height       <int> 479, 480, 480, 556, 714, 624, 237, 381, 36...
## $ visual_contenttype  <chr> "image/png", "image/png", "image/gif", "im...
## $ webfeeds_icon       <chr> "https://storage.googleapis.com/test-site-...
## $ decorations_dropbox <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

glimpse(r_weekly_live_stream)
## Observations: 1,333
## Variables: 27
## $ id                  <chr> "rhkRVQ8KjjGRDQxeehIj6RRIBGntdni0ZHwPTR8B3...
## $ originid            <chr> "https://link.rweekly.org/ckb", "https://l...
## $ fingerprint         <chr> "c11a0782", "c1897fc3", "c0b36206", "7049e...
## $ title               <chr> "Top Tweets of 2018", "My #Best9of2018 twe...
## $ crawled             <dttm> 2018-12-29 11:11:52, 2018-12-28 11:24:22,...
## $ published           <dttm> 2018-12-28 19:00:00, 2018-12-27 19:00:00,...
## $ canonical           <list> [<https://link.rweekly.org/ckb, text/html...
## $ alternate           <list> [<http://feedproxy.google.com/~r/RWeeklyL...
## $ unread              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ categories          <list> [<user/c45e5b02-5a96-464c-bf77-4eea75409c...
## $ tags                <list> [<user/c45e5b02-5a96-464c-bf77-4eea75409c...
## $ canonicalurl        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ ampurl              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ cdnampurl           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ engagement          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ summary_content     <chr> "<p>maraaverick.rbind.io</p><img width=\"1...
## $ summary_direction   <chr> "ltr", "ltr", "ltr", "ltr", "ltr", "ltr", ...
## $ origin_streamid     <chr> "feed/https://feeds.feedburner.com/rweekly...
## $ origin_title        <chr> "R Weekly Live: R Focus", "R Weekly Live: ...
## $ origin_htmlurl      <chr> "https://rweekly.org/live", "https://rweek...
## $ visual_url          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ visual_processor    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ visual_width        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ visual_height       <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ visual_contenttype  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ decorations_dropbox <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ decorations_pocket  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

glimpse(r_blog_stream)
## Observations: 2,332
## Variables: 34
## $ id                  <chr> "XGq6cYRY3hH9/vdZr0WOJiPdAe0u6dQ2ddUFEsTqP...
## $ keywords            <list> ["R bloggers", "R bloggers", "R bloggers"...
## $ originid            <chr> "https://datascienceplus.com/?p=19513", "h...
## $ fingerprint         <chr> "2f32071a", "332f9548", "2e6f8adb", "3d7ed...
## $ title               <chr> "Leaf Plant Classification: Statistical Le...
## $ crawled             <dttm> 2018-12-30 22:35:22, 2018-12-30 19:01:25,...
## $ published           <dttm> 2018-12-30 19:26:20, 2018-12-30 13:18:00,...
## $ canonical           <list> [<https://www.r-bloggers.com/leaf-plant-c...
## $ author              <chr> "Giorgio Garziano", "Sascha W.", "Economet...
## $ alternate           <list> [<http://feedproxy.google.com/~r/RBlogger...
## $ unread              <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
## $ categories          <list> [<user/c45e5b02-5a96-464c-bf77-4eea75409c...
## $ entities            <list> [<c("nlp/f/entity/en/-/leaf plant classif...
## $ engagement          <int> 50, 39, 482, 135, 33, 12, 13, 41, 50, 31, ...
## $ engagementrate      <dbl> 1.43, 0.98, 8.76, 2.45, 0.59, 0.21, 0.22, ...
## $ enclosure           <list> [NULL, NULL, NULL, NULL, <c("https://0.gr...
## $ tags                <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL...
## $ recrawled           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ updatecount         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ content_content     <chr> "<p><div><div><div><div data-show-faces=\"...
## $ content_direction   <chr> "ltr", "ltr", "ltr", "ltr", "ltr", "ltr", ...
## $ summary_content     <chr> "CategoriesAdvanced Modeling\nTags\nLinear...
## $ summary_direction   <chr> "ltr", "ltr", "ltr", "ltr", "ltr", "ltr", ...
## $ origin_streamid     <chr> "feed/http://feeds.feedburner.com/RBlogger...
## $ origin_title        <chr> "R-bloggers", "R-bloggers", "R-bloggers", ...
## $ origin_htmlurl      <chr> "https://www.r-bloggers.com", "https://www...
## $ visual_processor    <chr> "feedly-nikon-v3.1", "feedly-nikon-v3.1", ...
## $ visual_url          <chr> "https://i0.wp.com/datascienceplus.com/wp-...
## $ visual_width        <int> 383, 400, NA, 286, 456, 250, 450, 456, 397...
## $ visual_height       <int> 309, 300, NA, 490, 253, 247, 450, 253, 333...
## $ visual_contenttype  <chr> "image/png", "image/png", NA, "image/png",...
## $ webfeeds_icon       <chr> "https://storage.googleapis.com/test-site-...
## $ decorations_dropbox <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ decorations_pocket  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

And also check how far into December for each did I get as of this post? (I’ll check again after the 31 and update if needed).

range(r_weekly_wk_stream$published)
## [1] "2018-01-07 19:00:00 EST" "2018-12-30 19:00:00 EST"

range(r_blog_stream$published)
## [1] "2018-01-01 11:00:27 EST" "2018-12-30 19:26:20 EST"

range(r_weekly_live_stream$published)
## [1] "2018-01-01 19:00:00 EST" "2018-12-28 19:00:00 EST"

Digging Into The Weeds Feeds

In the above glimpses there’s another special term, engagement. Feedly defines this as an “indicator of how popular this entry is. The higher the number, the more readers have read, saved or shared this particular entry”. We’ll use this to look at the most “engaged” content in a bit. What’s noticeable from the start is that R Weekly Live has 1,333 entries and R-bloggers has 2,330 entries (so, nearly double the number of entries). Those counts are a bit of “fake news” when it comes to overall unique posts as can be seen by:

bind_rows(
  mutate(r_weekly_live_stream, src = "R Weekly (Live)"),
  mutate(r_blog_stream, src = "R-bloggers")
) %>% 
  mutate(wk = lubridate::week(published)) -> y2018

filter(y2018, title == "RcppArmadillo 0.9.100.5.0") %>% 
  select(src, title, originid, published) %>% 
  gt::gt()

src title originid published
R Weekly (Live) RcppArmadillo 0.9.100.5.0 https://link.rweekly.org/bg6 2018-08-17 07:55:00
R Weekly (Live) RcppArmadillo 0.9.100.5.0 https://link.rweekly.org/bfr 2018-08-16 21:20:00
R-bloggers RcppArmadillo 0.9.100.5.0 https://www.r-bloggers.com/?guid=f8865e8a004f772bdb64e3c4763a0fe5 2018-08-17 08:00:00
R-bloggers RcppArmadillo 0.9.100.5.0 https://www.r-bloggers.com/?guid=3046299f73344a927f787322c867233b 2018-08-16 21:20:00

Feedly has many processes going on behind the scenes to identify new entries and update entries as original sources are modified. This “duplication” (thankfully) doesn’t happen alot:

count(y2018, src, wk, title, sort=TRUE) %>% 
  filter(n > 1) %>% 
  arrange(wk) %>% 
  gt::gt() %>% 
  gt::fmt_number(c("wk", "n"), decimals = 0)

src wk title n
R-bloggers 3 conapomx data package 2
R Weekly (Live) 5 R in Latin America 2
R Weekly (Live) 12 Truncated Poisson distributions in R and Stan by @ellis2013nz 2
R Weekly (Live) 17 Regression Modeling Strategies 2
R Weekly (Live) 18 How much work is onboarding? 2
R Weekly (Live) 18 Survey books, courses and tools by @ellis2013nz 2
R-bloggers 20 Beautiful and Powerful Correlation Tables in R 2
R Weekly (Live) 24 R Consortium is soliciting your feedback on R package best practices 2
R Weekly (Live) 33 RcppArmadillo 0.9.100.5.0 2
R-bloggers 33 RcppArmadillo 0.9.100.5.0 2
R-bloggers 39 Individual level data 2
R Weekly (Live) 41 How R gets built on Windows 2
R Weekly (Live) 41 R Consortium grant applications due October 31 2
R Weekly (Live) 41 The Economist’s Big Mac Index is calculated with R 2
R Weekly (Live) 42 A small logical change with big impact 2
R Weekly (Live) 42 Maryland’s Bridge Safety, reported using R 2
R-bloggers 47 OneR – fascinating insights through simple rules 2

In fact, it happens infrequently enough that I’m going to let the “noise” stay in the data since Feedly technically is tracking some content change.

Let’s look at the week-over-week curation counts (neither source publishes original content, so using the term “published” seems ill fitting) for each:

count(y2018, src, wk) %>% 
  ggplot(aes(wk, n)) +
  geom_segment(aes(xend=wk, yend=0, color = src), show.legend = FALSE) +
  facet_wrap(~src, ncol=1, scales="free_x") + 
  labs(
    x = "Week #", y = "# Posts", 
    title = "Weekly Post Curation Stats for R-bloggers & R Weekly (Live)"
  ) +
  theme_ft_rc(grid="Y")

week over week

Despite R-bloggers having curated more overall content, there’s plenty to read each week for consumers of either/both aggregators.

Speaking of consuming, let’s look at the distribution of engagement scores for both aggregators:

group_by(y2018, src) %>% 
  summarise(v = list(broom::tidy(summary(engagement)))) %>% 
  unnest()
## # A tibble: 2 x 8
##   src             minimum    q1 median  mean    q3 maximum    na
##   <chr>             <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl> <dbl>
## 1 R Weekly (Live)       0     0    0     0       0       0  1060
## 2 R-bloggers            1    16   32.5  58.7    70    2023    NA

Well, it seems that it’s more difficult for Feedly to track engagement for the link-only R Weekly (Live) feed, so we’ll have to focus on R-bloggers for engagement views. Summary values are fine, but we can get a picture of the engagement distribution (we’ll do it monthly to get a bit more granularity, too):

filter(y2018, src == "R-bloggers") %>% 
  mutate(month = lubridate::month(published, label = TRUE, abbr = TRUE)) %>% 
  ggplot(aes(month, engagement)) +
  geom_violin() +
  ggbeeswarm::geom_quasirandom(
    groupOnX = TRUE, size = 2, color = "#2b2b2b", fill = ft_cols$green,
    shape = 21, stroke = 0.25
  ) +
  scale_y_comma(trans = "log10") +
  labs(
    x = NULL, y = "Engagement Score",
    title = "Monthly Post Engagement Distributions for R-bloggers Curated Posts",
    caption = "NOTE: Y-axis log10 Scale"
  ) +
  theme_ft_rc(grid="Y")

post engagement distribution

I wasn’t expecting each month’s distribution to be so similar. There are definitely outliers in terms of positive engagement so we should be able see what types of R-focused content piques the interest of the ~25,000 Feedly subscribers of R-bloggers.

filter(y2018, src == "R-bloggers") %>% 
  group_by(author) %>% 
  summarise(n_posts = n(), total_eng = sum(engagement), avg_eng = mean(engagement), med_eng = median(engagement)) %>% 
  arrange(desc(n_posts)) %>% 
  slice(1:20) %>% 
  gt::gt() %>% 
  gt::fmt_number(c("n_posts", "total_eng", "avg_eng", "med_eng"), decimals = 0)

author n_posts total_eng avg_eng med_eng
David Smith 116 9,791 84 47
John Mount 94 4,614 49 33
rOpenSci – open tools for open science 89 2,967 33 19
Thinking inside the box 85 1,510 18 14
R Views 60 4,142 69 47
hrbrmstr 55 1,179 21 16
Dr. Shirin Glander 54 2,747 51 25
xi’an 49 990 20 12
Mango Solutions 42 1,221 29 17
Econometrics and Free Software 33 2,858 87 60
business-science.io – Articles 31 4,484 145 70
NA 31 1,724 56 40
statcompute 29 1,329 46 33
Ryan Sheehy 25 1,271 51 45
Keith Goldfeld 24 1,305 54 43
free range statistics – R 23 440 19 12
Jakob Gepp 21 348 17 13
Tal Galili 21 1,587 76 22
Jozef’s Rblog 18 1,617 90 65
arthur charpentier 16 1,320 82 68

It is absolutely no surprise David comes in at number one in both post count and almost every engagement summary statistic since he’s a veritable blogging machine and creates + curates some super interesting content (whereas your’s truly doesn’t even make the median engagement cut ?).

What were the most engaging posts?

filter(y2018, src == "R-bloggers") %>% 
  arrange(desc(engagement)) %>% 
  mutate(published = as.Date(published)) %>% 
  select(engagement, title, published, author) %>% 
  slice(1:50) %>% 
  gt::gt() %>% 
  gt::fmt_number(c("engagement"), decimals = 0)

engagement title published author
2,023 Happy Birthday R 2018-08-27 eoda GmbH
1,132 15 Types of Regression you should know 2018-03-25 ListenData
697 R and Python: How to Integrate the Best of Both into Your Data Science Workflow 2018-10-08 business-science.io – Articles
690 Ultimate Python Cheatsheet: Data Science Workflow with Python 2018-11-18 business-science.io – Articles
639 Data Analysis with Python Course: How to read, wrangle, and analyze data 2018-10-31 Andrew Treadway
617 Machine Learning Results in R: one plot to rule them all! 2018-07-18 Bernardo Lares
614 R tip: Use Radix Sort 2018-08-21 John Mount
610 Data science courses in R (/python/etc.) for $10 at Udemy (Sitewide Sale until Aug 26th) 2018-08-24 Tal Galili
575 Why R for data science – and not Python? 2018-12-02 Learning Machines
560 Case Study: How To Build A High Performance Data Science Team 2018-09-18 business-science.io – Articles
516 R 3.5.0 is released! (major release with many new features) 2018-04-24 Tal Galili
482 R or Python? Why not both? Using Anaconda Python within R with {reticulate} 2018-12-30 Econometrics and Free Software
479 Sankey Diagram for the 2018 FIFA World Cup Forecast 2018-06-10 Achim Zeileis
477 5 amazing free tools that can help with publishing R results and blogging 2018-12-22 Jozef’s Rblog
462 What’s the difference between data science, machine learning, and artificial intelligence? 2018-01-09 David Robinson
456 XKCD “Curve Fitting”, in R 2018-09-28 David Smith
450 The prequel to the drake R package 2018-02-06 rOpenSci – open tools for open science
449 Who wrote that anonymous NYT op-ed? Text similarity analyses with R 2018-09-07 David Smith
437 Elegant regression results tables and plots in R: the finalfit package 2018-05-16 Ewen Harrison
428 How to implement neural networks in R 2018-01-12 David Smith
426 Data transformation in style: package sjmisc updated 2018-02-06 Daniel
413 Neural Networks Are Essentially Polynomial Regression 2018-06-20 matloff
403 Custom R charts coming to Excel 2018-05-11 David Smith
379 A perfect RStudio layout 2018-05-22 Ilya Kashnitsky
370 Drawing beautiful maps programmatically with R, sf and ggplot2 — Part 1: Basics 2018-10-25 Mel Moreno and Mathieu Basille
368 The Financial Times and BBC use R for publication graphics 2018-06-27 David Smith
367 Dealing with The Problem of Multicollinearity in R 2018-08-16 Perceptive Analytics
367 Excel is obsolete. Here are the Top 2 alternatives from R and Python. 2018-03-13 Appsilon Data Science Blog
365 New R Cheatsheet: Data Science Workflow with R 2018-11-04 business-science.io – Articles
361 Tips for analyzing Excel data in R 2018-08-30 David Smith
360 Importing 30GB of data in R with sparklyr 2018-02-16 Econometrics and Free Software
358 Scraping a website with 5 lines of R code 2018-01-24 David Smith
356 Clustering the Bible 2018-12-27 Learning Machines
356 Finally, You Can Plot H2O Decision Trees in R 2018-12-26 Gregory Kanevsky
356 Geocomputation with R – the afterword 2018-12-12 Rstats on Jakub Nowosad’s website
347 Time Series Deep Learning: Forecasting Sunspots With Keras Stateful LSTM In R 2018-04-18 business-science.io – Articles
343 Run Python from R 2018-03-27 Deepanshu Bhalla
336 Machine Learning Results in R: one plot to rule them all! (Part 2 – Regression Models) 2018-07-24 Bernardo Lares
332 R Generation: 25 Years of R 2018-08-01 David Smith
329 How to extract data from a PDF file with R 2018-01-05 Packt Publishing
325 R or Python? Python or R? The ongoing debate. 2018-01-28 tomaztsql
322 How to perform Logistic Regression, LDA, & QDA in R 2018-01-05 Prashant Shekhar
321 Who wrote the anti-Trump New York Times op-ed? Using tidytext to find document similarity 2018-09-06 David Robinson
311 Intuition for principal component analysis (PCA) 2018-12-06 Learning Machines
310 Packages for Getting Started with Time Series Analysis in R 2018-02-18 atmathew
309 Announcing the R Markdown Book 2018-07-13 Yihui Xie
307 Automated Email Reports with R 2018-11-01 JOURNEYOFANALYTICS
304 future.apply – Parallelize Any Base R Apply Function 2018-06-23 JottR on R
298 How to build your own Neural Network from scratch in R 2018-10-09 Posts on Tychobra
293 RStudio 1.2 Preview: SQL Integration 2018-10-02 Jonathan McPherson

Weekly & monthly curated post descriptive statstic patterns haven’t changed much since the April post:

filter(y2018, src == "R-bloggers") %>% 
  mutate(wkday = lubridate::wday(published, label = TRUE, abbr = TRUE)) %>%
  count(wkday) %>% 
  ggplot(aes(wkday, n)) +
  geom_col(width = 0.5, fill = ft_cols$slate, color = NA) +
  scale_y_comma() +
  labs(
    x = NULL, y = "# Curated Posts",
    title = "Day-of-week Curated Post Count for the R-bloggers Feed"
  ) +
  theme_ft_rc(grid="Y")

day of week view

filter(y2018, src == "R-bloggers") %>% 
  mutate(month = lubridate::month(published, label = TRUE, abbr = TRUE)) %>%
  count(month) %>% 
  ggplot(aes(month, n)) +
  geom_col(width = 0.5, fill = ft_cols$slate, color = NA) +
  scale_y_comma() +
  labs(
    x = NULL, y = "# Curated Posts",
    title = "Monthly Curated Post Count for the R-bloggers Feed"
  ) +
  theme_ft_rc(grid="Y")

month view

Surprisingly, monthly post count consistency (or even posting something each month) is not a common trait amongst the top 20 (by total engagement) authors:

w20 <- scales::wrap_format(20)

filter(y2018, src == "R-bloggers") %>% 
  filter(!is.na(author)) %>% # some posts don't have author attribution
  mutate(author_t = map_chr(w20(author), paste0, collapse="\n")) %>% # we need to wrap for facet titles (below)
  count(author, author_t, wt=engagement, sort=TRUE) %>% # get total author engagement
  slice(1:20) %>% # top 20
  { .auth_ordr <<- . ; . } %>% # we use the order later
  left_join(filter(y2018, src == "R-bloggers"), "author") %>% 
  mutate(month = lubridate::month(published, label = TRUE, abbr = TRUE)) %>%
  count(month, author_t, sort = TRUE) %>% 
  mutate(author_t = factor(author_t, levels = .auth_ordr$author_t)) %>% 
  ggplot(aes(month, nn, author_t)) +
  geom_col(width = 0.5) +
  scale_x_discrete(labels=substring(month.abb, 1, 1)) +
  scale_y_comma() +
  facet_wrap(~author_t) +
  labs(
    x = NULL, y = "Curated Post Count",
    title = "Monthly Curated Post Counts-per-Author (Top 20 by Engagement)",
    subtitle = "Arranged by Total Author Engagement"
  ) +
  theme_ft_rc(grid="yY")

Overall, most authors favor shorter titles for their posts:

filter(y2018, src == "R-bloggers") %>% 
  mutate(
    `Character Count Distribution` = nchar(title), 
    `Word Count Distribution` = stringi::stri_count_boundaries(title, type = "word")
  ) %>% 
  select(id, `Character Count Distribution`, `Word Count Distribution`) %>% 
  gather(measure, value, -id) %>% 
  ggplot(aes(value)) +
  ggalt::geom_bkde(alpha=1/3, color = ft_cols$slate, fill = ft_cols$slate) +
  scale_y_continuous(expand=c(0,0)) +
  facet_wrap(~measure, scales = "free") +
  labs(
    x = NULL, y = "Density",
    title = "Title Character/Word Count Distributions",
    subtitle = "~38 characters/11 words seems to be the sweet spot for most authors",
    caption = "Note Free X/Y Scales"
  ) +
  theme_ft_rc(grid="XY")

This post is already kinda tome-length so I’ll leave it to y’all to grab the data and dig in a bit more.

A Word About Using The content_content Field For R-bloggers Posts

Since R-bloggers requires a full feed from contributors, they, in-turn, post a “kinda” full-feed back out. I say “kinda” as they still haven’t fixed a reported bug in their processing engine which causes issues in (at least) Feedly’s RSS processing engine. If you use Feedly, take a look at the R-bloggers RSS feed entry for the recent “R or Python? Why not both? Using Anaconda Python within R with {reticulate}” post. It cuts off near “Let’s check its type:”. This is due to the way the < character is processed by the R-bloggers ingestion engine which turns the ## <class 'pandas.core.frame.DataFrame'> in the original post and doesn’t even display right on the R-bloggers page as it mangles the input and turns the descriptive output into an actuall <class> tag: <class &#39;pandas.core.frame.dataframe&#39;=""></class>. It’s really an issue on both sides, but R-bloggers is doing the mangling and should seriously consider addressing it in 2019.

Since it is still not fixed, it forces you to go to R-bloggers (clicks FTW? and may partly explain why that example post has a 400+ engagement score) unless you scroll back up to the top of the Feedly view and go to the author’s blog page. Given that tibble output invariably has a < right up top, your best bet for getting more direct views of your own content is to get a code-block with printed ## < output in it as close to the beginning as possible (perhaps start each post with a print(tbl_df(mtcars)))? ?).

Putting post-view-hacking levity aside, this content mangling means you can’t trust the content_content column in the stream data frame to have all the content; that is, if you were planning on taking the provided data and doing some topic clustering or content-based feature extraction for other stats/ML ops you’re out of luck and need to crawl the original site URLs on your own to get the main content for such analyses.

A Bit More About seymour

The seymour package has the following API functions:

  • feedly_access_token: Retrieve the Feedly Developer Token
  • feedly_collections: Retrieve Feedly Connections
  • feedly_feed_meta: Retrieve Metadata for a Feed
  • feedly_opml: Retrieve Your Feedly OPML File
  • feedly_profile: Retrieve Your Feedly Profile
  • feedly_search_contents: Search content of a stream
  • feedly_search_title: Find feeds based on title, url or ‘#topic’
  • feedly_stream: Retrieve contents of a Feedly “stream”
  • feedly_tags: Retrieve List of Tags

along with following helper function (which we’ll introduce in a minute):

  • render_stream: Render a Feedly Stream Data Frame to RMarkdown

and, the following helper reference (as Feedly has some “universal” streams):

  • global_resource_ids: Global Resource Ids Helper Reference

The render_stream() function is semi-useful on its own but was designed as more of a “you may want to replicate this on your own” (i.e. have a look at the source code and riff off of it). “Streams” are individual feeds, collections or even “boards” you make and with this new API package and the power of R Markdown, you can make your own “newsletter” like this:

fp <- feedly_profile() # get profile to get my id

# use the id to get my "security" category feed in my feedly
fs <- feedly_stream(sprintf("user/%s/category/security", fp$id))

# get the top 10 items with engagement >= third quartile of all posts
# and don't include duplicates in the report
mutate(fs$items, published = as.Date(published)) %>% 
  filter(published >= as.Date("2018-12-01")) %>%
  filter(engagement > fivenum(engagement)[4]) %>% 
  filter(!is.na(summary_content)) %>% 
  mutate(alt_url = map_chr(alternate, ~.x[[1]])) %>% 
  distinct(alt_url, .keep_all = TRUE) %>% 
  slice(1:10) -> for_report

# render the report
render_stream(
  feedly_stream = for_report, 
  title = "Cybersecurity News", 
  include_visual = TRUE,
  browse = TRUE
)

Which makes the following Rmd and HTML. (So, no need to “upgrade” to “Teams” to make newsletters!).

FIN

As noted, the 2018 data for R Weekly (Live) & R-bloggers is available and you can find the seymour package on [GL | GH].

If you’re not a Feedly user I strongly encourage you to give it a go! And, if you don’t subscribe to R Weekly, you should make that your first New Year’s Resolution.

Here’s looking to another year of great R content across the R blogosphere!

The sergeant? package has a minor update that adds REST API coverage for two “new” storage endpoints that make it possible to add, update and remove storage configurations on-the-fly without using the GUI or manually updating a config file.

This is an especially handy feature when paired with Drill’s new, official Docker container since that means we can:

  • fire up a clean Drill instance
  • modify the storage configuration (to, say, point to a local file system directory)
  • execute SQL ops
  • destroy the Drill instance

all from within R.

This is even more handy for those of us who prefer handling JSON data in Drill than in R directly or with sparklyr.

Quick Example

In a few weeks most of the following verbose-code-snippets will have a more diminutive and user-friendly interface within sergeant, but for now we’ll perform the above bulleted steps with some data that was used in a recent new package which was also generated by another recent new package. The zdnsr::zdns_exec() function ultimately generates a deeply nested JSON file that I really prefer working with in Drill before shunting it into R. Said file is stored, say, in the ~/drilldat directory.

Now, I have Drill running all the time on almost every system I use, but we’ll pretend I don’t for this example. I’ve run zdns_exec() and generated the JSON file and it’s in the aforementioned directory. Let’s fire up an instance and connect to it:

library(sergeant) # git[hu|la]b:hrbrmstr/sergeant
library(dplyr)

docker <- Sys.which("docker") # you do need docker. it's a big dependency, but worth it IMO

(system2(
  command = docker,  
  args = c(
    "run", "-i", 
    "--name", "drill-1.14.0", 
    "-p", "8047:8047", 
    "-v", paste0(c(path.expand("~/drilldat"), "/drilldat"), collapse=":"),
    "--detach", 
    "-t", "drill/apache-drill:1.14.0",
    "/bin/bash"
  ),
  stdout = TRUE
) -> drill_container)
## [1] "d6bc79548fa073d3bfbd32528a12669d753e7a19a6258e1be310e1db378f0e0d"

The above snippet fires up a Drill Docker container (downloads it, too, if not already local) and wires up a virtual directory to it.

We should wait a couple seconds and make sure we can connect to it:

drill_connection() %>% 
  drill_active()
## [1] TRUE

Now, we need to add a storage configuration so we can access our virtual directory. Rather than modify dfs we’ll add a drilldat plugin that will work with the local filesystem just like dfs does:

drill_connection() %>%
  drill_mod_storage(
    name = "drilldat",
    config = '
{
  "config" : {
    "connection" : "file:///", 
    "enabled" : true,
    "formats" : null,
    "type" : "file",
    "workspaces" : {
      "root" : {
        "location" : "/drilldat",
        "writable" : true,
        "defaultInputFormat": null
      }
    }
  },
  "name" : "drilldat"
}
')
## $result
## [1] "success"

Now, we can perform all the Drill ops sergeant has to offer, including ones like this:

(db <- src_drill("localhost"))
## src:  DrillConnection
## tbls: cp.default, dfs.default, dfs.root, dfs.tmp, drilldat.default, drilldat.root,
##   INFORMATION_SCHEMA, sys

tbl(db, "drilldat.root.`/*.json`")
## # Source:   table [?? x 10]
## # Database: DrillConnection
##    data                                              name   error class status timestamp          
##                                                                    
##  1 "{\"authorities\":[{\"ttl\":180,\"type\":\"SOA\"… _dmar… NA    IN    NOERR… 2018-09-09 13:18:07
##  2 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  3 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  4 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  5 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  6 "{\"authorities\":[{\"ttl\":1799,\"type\":\"SOA\… _dmar… NA    IN    NOERR… 2018-09-09 13:18:07
##  7 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  8 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
##  9 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NXDOM… 2018-09-09 13:18:07
## 10 "{\"authorities\":[],\"protocol\":\"udp\",\"flag… _dmar… NA    IN    NOERR… 2018-09-09 13:18:07
## # ... with more rows

(tbl(db, "(
SELECT
 b.answers.name AS question,
 b.answers.answer AS answer
FROM (
 SELECT 
   FLATTEN(a.data.answers) AS answers
 FROM 
   drilldat.root.`/*.json` a
 WHERE 
   (a.status = 'NOERROR')
) b
)") %>% 
 collect() -> dmarc_recs)
## # A tibble: 1,250 x 2
##    question             answer                                                                    
##  *                                                                                      
##  1 _dmarc.washjeff.edu  v=DMARC1; p=none                                                          
##  2 _dmarc.barry.edu     v=DMARC1; p=none; rua=mailto:dmpost@barry.edu,mailto:7cc566d7@mxtoolbox.d…
##  3 _dmarc.yhc.edu       v=DMARC1; pct=100; p=none                                                 
##  4 _dmarc.aacc.edu      v=DMARC1;p=none; rua=mailto:DKIM_DMARC@aacc.edu;ruf=mailto:DKIM_DMARC@aac…
##  5 _dmarc.sagu.edu      v=DMARC1; p=none; rua=mailto:Office365contact@sagu.edu; ruf=mailto:Office…
##  6 _dmarc.colostate.edu v=DMARC1; p=none; pct=100; rua=mailto:re+anahykughvo@dmarc.postmarkapp.co…
##  7 _dmarc.wne.edu       v=DMARC1;p=quarantine;sp=none;fo=1;ri=86400;pct=50;rua=mailto:dmarcreply@…
##  8 _dmarc.csuglobal.edu v=DMARC1; p=none;                                                         
##  9 _dmarc.devry.edu     v=DMARC1; p=none; pct=100; rua=mailto:devry@rua.agari.com; ruf=mailto:dev…
## 10 _dmarc.sullivan.edu  v=DMARC1; p=none; rua=mailto:mcambron@sullivan.edu; ruf=mailto:mcambron@s…
## # ... with 1,240 more rows

Finally (when done), we can terminate the Drill container:

system2(
  command = "docker",
  args = c("rm", "-f", drill_container)
)

FIN

Those system2() calls are hard on the ? and a pain to type/remember, so they’ll be wrapped in some sergeant utility functions (I’m hesitant to add a reticulate dependency to sergeant which is necessary to use the docker package, hence the system call wrapper approach).

Check your favorite repository for more sergeant updates and file issues if you have suggestions for how you’d like this Docker API for Drill to be controlled.

A previous post explored how to deal with Amazon Athena queries asynchronously. The function presented is a beast, though it is on purpose (to provide options for folks).

In reality, nobody really wants to use rJava wrappers much anymore and dealing with icky Python library calls directly just feels wrong, plus Python functions often return truly daft/ugly data structures. R users deserve better than that.

R is not a first-class citizen in Amazon-land and while the cloudyr project does a fine job building native-R packages for various Amazon services, the fact remains that the official Amazon SDKs are in other languages. The reticulate package provides an elegant interface to Python so it seemed to make sense to go ahead and wrap the boto3 Athena client into something more R-like and toss in the collect_async() function for good measure.

Dependencies

I forced a dependency on Python 3.5 because friends don’t let friends rely on dated, fragmented ecosystems. Python versions can gracefully (mostly) coexist so there should be no pain/angst associated with keeping an updated Python 3 environment around. As noted in the package, I highly recommend adding RETICULATE_PYTHON=/usr/local/bin/python3 to your R environment (~/.Renviron is a good place to store it) since it will help reticulate find the proper target.

If boto3 is not installed, you will need to do pip3 install boto3 to ensure you have the necessary Python module available and associated with your Python 3 installation.

It may seem obvious, but an Amazon AWS account is also required and you should be familiar with the Athena service and AWS services in general. Most of the roto.athena functions have a set of optional parameters:

  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token
  • region_name
  • profile_name

Ideally, these should be in setup in the proper configuration files and you should let boto3 handle the details of retrieving them. One parameter you will see used in many of my examples is profile_name = "personal". I have numerous AWS accounts and manage them via the profile ids. By ensuring the AWS configuration files are thoroughly populated, I avoid the need to load and pass around the various keys and/or tokens most AWS SDK API calls require. You can read more about profile management in the official docs: 1, 2.

Usage

The project README and package manual pages are populated and have a smattering of usage examples. It is likely you will really just want to execute a manually prepared SQL query and retrieve the results or do the dplyr dance and collect the results asynchronously. We’ll cover both of those use-cases now, starting with a manual SQL query.

If you have not deleted it, your Athena instance comes with a sampledb that contains an elb_logs table. We’ll use that for our example queries. First, let’s get the packages we’ll be using out of the way:

library(odbc)
library(DBI) # for dplyr access later
library(odbc) # for dplyr access later
library(roto.athena) # hrbrmstr/roto.athena on gh or gl
library(tidyverse) # b/c it rocks

Now, we’ll prepare and execute the query. This is a super-simple one:

query <- "SELECT COUNT(requestip) AS ct FROM elb_logs"

start_query_execution(
  query = query,
  database = "sampledb",
  output_location = "s3://aws-athena-query-results-redacted",
  profile = "personal"
) -> qex_id

The qex_id contains the query execution id. We can pass that along to get information on the status of the query:

get_query_execution(qex_id, profile = "personal") %>%
  glimpse()
## Observations: 1
## Variables: 10
## $ query_execution_id   "7f8d8bd6-9fe6-4a26-a021-ee10470c1048"
## $ query                "SELECT COUNT(requestip) AS ct FROM elb_logs"
## $ output_location      "s3://aws-athena-query-results-redacted/7f...
## $ database             "sampledb"
## $ state                "RUNNING"
## $ state_change_reason  NA
## $ submitted            "2018-07-20 11:06:06.468000-04:00"
## $ completed            NA
## $ execution_time_ms    NA
## $ bytes_scanned        NA

If the state is not SUCCEEDED then you’ll need to be patient before trying to retrieve the results.

get_query_results(qex_id, profile = "personal")
## # A tibble: 1 x 1
##   ct             
##   
## 1 4229

Now, we’ll use dplyr via the Athena ODBC driver:

DBI::dbConnect(
  odbc::odbc(),
  driver = "/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib",
  Schema = "sampledb",
  AwsRegion = "us-east-1",
  AwsProfile = "personal",
  AuthenticationType = "IAM Profile",
  S3OutputLocation = "s3://aws-athena-query-results-redacted"
) -> con

elb_logs <- tbl(con, "elb_logs")

I've got the ODBC DBI fragment in a parameterized RStudio snippet and others may find that as a time-saver if you're not doing that already.

Now to build and submit the query:

mutate(elb_logs, tsday = substr(timestamp, 1, 10)) %>%
  filter(tsday == "2014-09-29") %>%
  select(requestip, requestprocessingtime) %>%
  collect_async(
    database = "sampledb",
    output_location = "s3://aws-athena-query-results-redacted",
    profile_name = "personal"
  ) -> qex_id

As noted in the previous blog post, collect_async() turn the dplyr chain into a SQL query then fires off the whole thing to start_query_execution() for you and returns the query execution id:

get_query_execution(qex_id, profile = "personal") %>%
  glimpse()
## Observations: 1
## Variables: 10
## $ query_execution_id   "95bd158b-7790-42ba-aa83-e7436c3470fe"
## $ query                "SELECT \"requestip\", \"requestprocessing...
## $ output_location      "s3://aws-athena-query-results-redacted/95...
## $ database             "sampledb"
## $ state                "RUNNING"
## $ state_change_reason  NA
## $ submitted            "2018-07-20 11:06:12.817000-04:00"
## $ completed            NA
## $ execution_time_ms    NA
## $ bytes_scanned        NA

Again, you'll need to be patient and wait for the state to be SUCCEEDED to retrieve the results.

get_query_results(qex_id, profile = "personal")
## # A tibble: 774 x 2
##    requestip       requestprocessingtime
##                               
##  1 255.48.150.122              0.0000900
##  2 249.213.227.93              0.0000970
##  3 245.108.120.229             0.0000870
##  4 241.112.203.216             0.0000940
##  5 241.43.107.223              0.0000760
##  6 249.117.98.137              0.0000830
##  7 250.134.112.194             0.0000630
##  8 250.200.171.222             0.0000540
##  9 248.193.76.218              0.0000820
## 10 250.57.61.131               0.0000870
## # ... with 764 more rows

You can also use the query execution id to sync the resultant CSV from S3. Which one is more performant is definitely something you'll need to test since it varies with AWS region, result set size, your network connection and other environment variables. One benefit of using get_query_results() is that it uses the column types to set the data frame column types appropriately (I still need to setup a full test of all possible types so not all are handled yet).

Kick the tyres

The package is up on both GitLab and GitHub and any and all feedback (i.e. Issues) or tweaks (i.e. PRs) are most welcome.

Most modern operating systems keep secrets from you in many ways. One of these ways is by associating extended file attributes with files. These attributes can serve useful purposes. For instance, macOS uses them to identify when files have passed through the Gatekeeper or to store the URLs of files that were downloaded via Safari (though most other browsers add the com.apple.metadata:kMDItemWhereFroms attribute now, too).

Attributes are nothing more than a series of key/value pairs. They key must be a character value & unique, and it’s fairly standard practice to keep the value component under 4K. Apart from that, you can put anything in the value: text, binary content, etc.

When you’re in a terminal session you can tell that a file has extended attributes by looking for an @ sign near the permissions column:

$ cd ~/Downloads
$ ls -l
total 264856
-rw-r--r--@ 1 user  staff     169062 Nov 27  2017 1109.1968.pdf
-rw-r--r--@ 1 user  staff     171059 Nov 27  2017 1109.1968v1.pdf
-rw-r--r--@ 1 user  staff     291373 Apr 27 21:25 1804.09970.pdf
-rw-r--r--@ 1 user  staff    1150562 Apr 27 21:26 1804.09988.pdf
-rw-r--r--@ 1 user  staff     482953 May 11 12:00 1805.01554.pdf
-rw-r--r--@ 1 user  staff  125822222 May 14 16:34 RStudio-1.2.627.dmg
-rw-r--r--@ 1 user  staff    2727305 Dec 21 17:50 athena-ug.pdf
-rw-r--r--@ 1 user  staff      90181 Jan 11 15:55 bgptools-0.2.tar.gz
-rw-r--r--@ 1 user  staff    4683220 May 25 14:52 osquery-3.2.4.pkg

You can work with extended attributes from the terminal with the xattr command, but do you really want to go to the terminal every time you want to examine these secret settings (now that you know your OS is keeping secrets from you)?

I didn’t think so. Thus begat the xattrs? package.

Exploring Past Downloads

Data scientists are (generally) inquisitive folk and tend to accumulate things. We grab papers, data, programs (etc.) and some of those actions are performed in browsers. Let’s use the xattrs package to rebuild a list of download URLs from the extended attributes on the files located in ~/Downloads (if you’ve chosen a different default for your browsers, use that directory).

We’re not going to work with the entire package in this post (it’s really straightforward to use and has a README on the GitHub site along with extensive examples) but I’ll use one of the example files from the directory listing above to demonstrate a couple functions before we get to the main example.

First, let’s see what is hidden with the RStudio disk image:


library(xattrs)
library(reticulate) # not 100% necessary but you'll see why later
library(tidyverse) # we'll need this later

list_xattrs("~/Downloads/RStudio-1.2.627.dmg")
## [1] "com.apple.diskimages.fsck"            "com.apple.diskimages.recentcksum"    
## [3] "com.apple.metadata:kMDItemWhereFroms" "com.apple.quarantine"   

There are four keys we can poke at, but the one that will help transition us to a larger example is com.apple.metadata:kMDItemWhereFroms. This is the key Apple has standardized on to store the source URL of a downloaded item. Let’s take a look:


get_xattr_raw("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms")
##   [1] 62 70 6c 69 73 74 30 30 a2 01 02 5f 10 4c 68 74 74 70 73 3a 2f 2f 73 33 2e 61 6d 61
##  [29] 7a 6f 6e 61 77 73 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2d 69 64 65 2d 62 75 69 6c 64
##  [57] 2f 64 65 73 6b 74 6f 70 2f 6d 61 63 6f 73 2f 52 53 74 75 64 69 6f 2d 31 2e 32 2e 36
##  [85] 32 37 2e 64 6d 67 5f 10 2c 68 74 74 70 73 3a 2f 2f 64 61 69 6c 69 65 73 2e 72 73 74
## [113] 75 64 69 6f 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2f 6f 73 73 2f 6d 61 63 2f 08 0b 5a
## [141] 00 00 00 00 00 00 01 01 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 00
## [169] 00 00 00 89

Why “raw”? Well, as noted above, the value component of these attributes can store anything and this one definitely has embedded nul[l]s (0x00) in it. We can try to read it as a string, though:


get_xattr("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms")
## [1] "bplist00\xa2\001\002_\020Lhttps://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg_\020,https://dailies.rstudio.com/rstudio/oss/mac/\b\vZ"

So, we can kinda figure out the URL but it’s definitely not pretty. The general practice of Safari (and other browsers) is to use a binary property list to store metadata in the value component of an extended attribute (at least for these URL references).

There will eventually be a native Rust-backed property list reading package for R, but we can work with that binary plist data in two ways: first, via the read_bplist() function that comes with the xattrs package and wraps Linux/BSD or macOS system utilities (which are super expensive since it also means writing out data to a file each time) or turn to Python which already has this capability. We’re going to use the latter.

I like to prime the Python setup with invisible(py_config()) but that is not really necessary (I do it mostly b/c I have a wild number of Python — don’t judge — installs and use the RETICULATE_PYTHON env var for the one I use with R). You’ll need to install the biplist module via pip3 install bipist or pip install bipist depending on your setup. I highly recommended using Python 3.x vs 2.x, though.


biplist <- import("biplist", as="biplist")

biplist$readPlistFromString(
  get_xattr_raw(
    "~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms"
  )
)
## [1] "https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg"
## [2] "https://dailies.rstudio.com/rstudio/oss/mac/" 

That's much better.

Let's work with metadata for the whole directory:


list.files("~/Downloads", full.names = TRUE) %>% 
  keep(has_xattrs) %>% 
  set_names(basename(.)) %>% 
  map_df(read_xattrs, .id="file") -> xdf

xdf
## # A tibble: 24 x 4
##    file            name                                  size contents   
##                                                     
##  1 1109.1968.pdf   com.apple.lastuseddate#PS               16  
##  2 1109.1968.pdf   com.apple.metadata:kMDItemWhereFroms   110 
##  3 1109.1968.pdf   com.apple.quarantine                    74  
##  4 1109.1968v1.pdf com.apple.lastuseddate#PS               16  
##  5 1109.1968v1.pdf com.apple.metadata:kMDItemWhereFroms   116 
##  6 1109.1968v1.pdf com.apple.quarantine                    74  
##  7 1804.09970.pdf  com.apple.metadata:kMDItemWhereFroms    86  
##  8 1804.09970.pdf  com.apple.quarantine                    82  
##  9 1804.09988.pdf  com.apple.lastuseddate#PS               16  
## 10 1804.09988.pdf  com.apple.metadata:kMDItemWhereFroms   104 
## # ... with 14 more rows

## count(xdf, name, sort=TRUE)
## # A tibble: 5 x 2
##   name                                     n
##                                   
## 1 com.apple.metadata:kMDItemWhereFroms     9
## 2 com.apple.quarantine                     9
## 3 com.apple.lastuseddate#PS                4
## 4 com.apple.diskimages.fsck                1
## 5 com.apple.diskimages.recentcksum         1

Now we can focus on the task at hand: recovering the URLs:


list.files("~/Downloads", full.names = TRUE) %>% 
  keep(has_xattrs) %>% 
  set_names(basename(.)) %>% 
  map_df(read_xattrs, .id="file") %>% 
  filter(name == "com.apple.metadata:kMDItemWhereFroms") %>% 
  mutate(where_from = map(contents, biplist$readPlistFromString)) %>% 
  select(file, where_from) %>% 
  unnest() %>% 
  filter(!where_from == "")
## # A tibble: 15 x 2
##    file                where_from                                                       
##                                                                               
##  1 1109.1968.pdf       https://arxiv.org/pdf/1109.1968.pdf                              
##  2 1109.1968.pdf       https://www.google.com/                                          
##  3 1109.1968v1.pdf     https://128.84.21.199/pdf/1109.1968v1.pdf                        
##  4 1109.1968v1.pdf     https://www.google.com/                                          
##  5 1804.09970.pdf      https://arxiv.org/pdf/1804.09970.pdf                             
##  6 1804.09988.pdf      https://arxiv.org/ftp/arxiv/papers/1804/1804.09988.pdf           
##  7 1805.01554.pdf      https://arxiv.org/pdf/1805.01554.pdf                             
##  8 athena-ug.pdf       http://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf        
##  9 athena-ug.pdf       https://www.google.com/                                          
## 10 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/bgptools-0.2.tar.gz 
## 11 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/                    
## 12 osquery-3.2.4.pkg   https://osquery-packages.s3.amazonaws.com/darwin/osquery-3.2.4.p…
## 13 osquery-3.2.4.pkg   https://osquery.io/downloads/official/3.2.4                      
## 14 RStudio-1.2.627.dmg https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio…
## 15 RStudio-1.2.627.dmg https://dailies.rstudio.com/rstudio/oss/mac/             

(There are multiple URL entries due to the fact that some browsers preserve the path you traversed to get to the final download.)

Note: if Python is not an option for you, you can use the hack-y read_bplist() function in the package, but it will be much, much slower and you'll need to deal with an ugly list object vs some quaint text vectors.

FIN

Have some fun exploring what other secrets your OS may be hiding from you and if you're on Windows, give this a go. I have no idea if it will compile or work there, but if it does, definitely report back!

Remember that the package lets you set and remove extended attributes as well, so you can use them to store metadata with your data files (they don't always survive file or OS transfers but if you keep things local they can be an interesting way to tag your files) or clean up items you do not want stored.

Much of what I need to do for work-work involves using tools that are (for the moment) not in R. Today, I needed to test the validity of (and other processing on) DMARC records and I’m loathe to either reinvent the wheel or reticulate bits from a fragmented programming language ecosystem unless absolutely necessary. Thankfully, there’s libopendmarc which works well on sane operating systems, but it is a C library that needs an interface to use in R.

However, I also really didn’t want to start a new package for this just yet (there will eventually be one, though, and I prefer working in a package context for Rcpp work). I just needed to run opendmarc_policy_store_dmarc() against a decent-sized chunk of domain names and already-retrieved DMARC TXT records. So, I decided to write a small “inline” cppFunction() to get’er done.

Why am I blogging about this?

Despite growing popularity and a nice examples site, many newcomers to Rcpp (literally the way you want to go when it comes to bridging C[++] and R) still voice discontent about there not being enough “easy” examples. Granted, they are quitely likely looking for full-bore tutorials covering a different, explicit use cases. The aforelinked Gallery has some of those and there are codified examples in — literally — rcppexamples. But, there definitely needs to be more blog posts, books and such linking to them and expanding upon them.

Having mentioned that I’m using cppFunction(), one could, further, ask cppFunction() has a help page with an example, so why blather about using it?”. Fair point! And, there is a reason which was hinted at in the opening paragraph.

I need to use libopendmarc and that requires making a “plugin” if I’m going to do this “inline”. For some other DMARC processing I also need to use libresolv since the library needs to make DNS requests and uses resolv. You don’t need a plugin for a package version as you just need to boilerplate some “find these libraries and get their paths right for Makevars.in” and add the linking code in there as well. Here, we need to register two plugins that provide metdata for the magic that happens under the covers when Rcpp takes your inline code, compiles it and makes the function shared object available in R.

Plugins can be complex and do transformations, but the two I needed to write are just helping ensure the right #include lines are there along with the right linker libraries. Here they are:

library(Rcpp)

registerPlugin(
  name = "libresolv",
  plugin = function(x) {
    list(
      includes = "",
      env = list(PKG_LIBS="-lresolv")
    )
  }
)

registerPlugin(
  name = "libopendmarc",
  plugin = function(x) {
    list(
      includes = "#include <opendmarc/dmarc.h>",
      env = list(PKG_LIBS="-lopendmarc")
    )
  }
)

All they do is make data structures available in the environment. We can use inline::getPlugin() to see them:

inline::getPlugin("libresolv")
## $includes
## [1] ""
##
## $env
## $env$PKG_LIBS
## [1] "-lresolv"


inline::getPlugin("libopendmarc")
## $includes
## [1] "#include <opendmarc/dmarc.h>"
## 
## $env
## $env$PKG_LIBS
## [1] "-lopendmarc"

Finally, the tiny bit of C/C++ code to take in the necessary parameters and return the result. In this case, we’re passing in a character vector of domain names and DMARC records and getting back a logical vector with the test results. Apart from the necessary initialization and cleanup code for libopendmarc this is an idiom you’ll recognize if you look over packages that use Rcpp.

cppFunction(
std::vector< bool > is_dmarc_valid(std::vector< std::string> domains,
                                   std::vector< std::string> dmarc_records) {

  std::vector< bool > out(dmarc_records.size());

  DMARC_POLICY_T *pctx;
  OPENDMARC_STATUS_T status;

  pctx = opendmarc_policy_connect_init((u_char *)"1.2.3.4", 0);

  for (unsigned int i=0; i<dmarc_records.size(); i++) {

    status = opendmarc_policy_store_dmarc(
      pctx,
      (u_char *)dmarc_records[i].c_str(),
      (u_char *)domains[i].c_str(),
      NULL
    );

    out[i] = (status == DMARC_PARSE_OKAY);

    pctx = opendmarc_policy_connect_rset(pctx);

  }

  pctx = opendmarc_policy_connect_shutdown(pctx);

  return(out);

}
,
plugins=c("libresolv", "libopendmarc"))

(Note: the code-formatting plugin was tossing a serious fit about the long text field so you’ll need to put a single quote after cppFunction( and before the line with the , if you’re cutting and pasting at home).

Right at the end, the final parameter is telling cppFunction() what plugins to use.

Executing that line shunts a modified version of the function to disk, compiles it and lets us use the function in R (use cacheDir, showOutput and verbose parameters to control how many gory details lie undeneath this pristine shell).

After running the function, is_dmarc_valid() is available in the environment and ready to use.

domains <- c("bit.ly", "bizible.com", "blackmountainsystems.com", "blackspoke.com")
dmarc <-  c("v=DMARC1; p=none; pct=100; rua=mailto:dmarc@bit.ly; ruf=mailto:ruf@dmarc.bitly.net; fo=1;", 
            "v=DMARC1; p=reject; fo=1; rua=mailto:postmaster@bizible.com; ruf=mailto:forensics@bizible.com;", 
            "v=DMARC1; p=quarantine; pct=100; rua=mailto:demarcrecords@blkmtn.com, mailto:ttran@blkmtn.com", 
            "user.cechire.com.")

is_dmarc_valid(domains, dmarc)
## [1]  TRUE  TRUE  TRUE FALSE

Processing those 5 took just about 10 microseconds which meant I could process the ~1,000,000 domains+DMARCs in no time at all. And, I have something I can use in a DMARC utility package (coming “soon”).

Hopefully this was a useful reference for both hooking up external libraries to “inline” Rcpp functions and for how to go about doing this type of thing in general.

I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non-“main text content” from an HTML document and it usually does a good job but there are some pages that it fails miserably on since it’s more of a brute-force method than one that uses any real “intelligence” when performing the text node targeting.

Most modern browsers have inherent or plugin-able “readability” capability, and most of those are based — at least in part — on the seminal Arc90 implementation. Many programming languages have a package or module that use a similar methodology, but I’m not aware of any R ports.

What do I mean by “clean txt”? Well, I can’t show the URL I was having trouble processing but I can show an example using a recent rOpenSci blog post. Here’s what the raw HTML looks like after retrieving it:

library(xml2)
library(httr)
library(reticulate)
library(magrittr)

res <- GET("https://ropensci.org/blog/blog/2017/08/22/visdat")

content(res, as="text", endoding="UTF-8")
## [1] "\n \n<!DOCTYPE html>\n<html lang=\"en\">\n <head>\n <meta charset=\"utf-8\">\n <meta name=\"apple-mobile-web-app-capable\" content=\"yes\" />\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\" />\n <meta name=\"apple-mobile-web-app-status-bar-style\" content=\"black\" />\n <link rel=\"shortcut icon\" href=\"/assets/flat-ui/images/favicon.ico\">\n\n <link rel=\"alternate\" type=\"application/rss+xml\" title=\"RSS\" href=\"http://ropensci.org/feed.xml\" />\n\n <link rel=\"stylesheet\" href=\"/assets/flat-ui/bootstrap/css/bootstrap.css\">\n <link rel=\"stylesheet\" href=\"/assets/flat-ui/css/flat-ui.css\">\n\n <link rel=\"stylesheet\" href=\"/assets/common-files/css/icon-font.css\">\n <link rel=\"stylesheet\" href=\"/assets/common-files/css/animations.css\">\n <link rel=\"stylesheet\" href=\"/static/css/style.css\">\n <link href=\"/assets/css/ss-social/webfonts/ss-social.css\" rel=\"stylesheet\" />\n <link href=\"/assets/css/ss-standard/webfonts/ss-standard.css\" rel=\"stylesheet\"/>\n <link rel=\"stylesheet\" href=\"/static/css/github.css\">\n <script type=\"text/javascript\" src=\"//use.typekit.net/djn7rbd.js\"></script>\n <script type=\"text/javascript\">try{Typekit.load();}catch(e){}</script>\n <script src=\"/static/highlight.pack.js\"></script>\n <script>hljs.initHighlightingOnLoad();</script>\n\n <title>Onboarding visdat, a tool for preliminary visualisation of whole dataframes</title>\n <meta name=\"keywords\" content=\"R, software, package, review, community, visdat, data-visualisation\" />\n <meta name=\"description\" content=\"\" />\n <meta name=\"resource_type\" content=\"website\"/>\n <!– RDFa Metadata (in DublinCore) –>\n <meta property=\"dc:title\" content=\"Onboarding visdat, a tool for preliminary visualisation of whole dataframes\" />\n <meta property=\"dc:creator\" content=\"\" />\n <meta property=\"dc:date\" content=\"\" />\n <meta property=\"dc:format\" content=\"text/html\" />\n <meta property=\"dc:language\" content=\"en\" />\n <meta property=\"dc:identifier\" content=\"/blog/blog/2017/08/22/visdat\" />\n <meta property=\"dc:rights\" content=\"CC0\" />\n <meta property=\"dc:source\" content=\"\" />\n <meta property=\"dc:subject\" content=\"Ecology\" />\n <meta property=\"dc:type\" content=\"website\" />\n <!– RDFa Metadata (in OpenGraph) –>\n <meta property=\"og:title\" content=\"Onboarding visdat, a tool for preliminary visualisation of whole dataframes\" />\n <meta property=\"og:author\" content=\"/index.html#me\" /> <!– Should be Liquid? URI? –>\n <meta property=\"http://ogp.me/ns/profile#first_name\" content=\"\"/>\n <meta property=\"http://ogp.me/ns/profile#last_name\" content=\"\"/>\n

(it goes on for a bit, best to run the code locally)

We can use the reticulate package to load the Python readability module to just get the clean, article text:

readability <- import("readability") # pip install readability-lxml

doc <- readability$Document(httr::content(res, as="text", endoding="UTF-8"))

doc$summary() %>%
  read_xml() %>%
  xml_text()
# [1] "Take a look at the dataThis is a phrase that comes up when you first get a dataset.It is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?Starting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.These growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.The visdat package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".Making visdat was fun, and it was easy to use. But I couldn't help but think that maybe visdat could be more. I felt like the code was a little sloppy, and that it could be better. I wanted to know whether others found it useful.What I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.Too much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci onboarding process provides.rOpenSci onboarding basicsOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with JOSS.What's in it for the author?Feedback on your packageSupport from rOpenSci membersMaintain ownership of your packagePublicity from it being under rOpenSciContribute something to rOpenSciPotentially a publicationWhat can rOpenSci do that CRAN cannot?The rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN 1. Here's what rOpenSci does that CRAN cannot:Assess documentation readability / usabilityProvide a code review to find weak points / points of improvementDetermine whether a package is overlapping with another.

(again, it goes on for a bit, best to run the code locally)

That text is now in good enough shape to tidy.

Here’s the same version with clean_text():

# devtools::install_github("hrbrmstr/hgr")
hgr::clean_text(content(res, as="text", endoding="UTF-8"))
## [1] "Onboarding visdat, a tool for preliminary visualisation of whole dataframes\n \n \n \n \n  \n \n \n \n \n August 22, 2017 \n \n \n \n \nTake a look at the data\n\n\nThis is a phrase that comes up when you first get a dataset.\n\nIt is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?\n\nStarting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.\n\nThese growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.\n\nThe package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".\n\nMaking was fun, and it was easy to use. But I couldn't help but think that maybe could be more.\n\n I felt like the code was a little sloppy, and that it could be better.\n I wanted to know whether others found it useful.\nWhat I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.\n\nToo much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci provides.\n\nrOpenSci onboarding basics\n\nOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with .\n\nWhat's in it for the author?\n\nFeedback on your package\nSupport from rOpenSci members\nMaintain ownership of your package\nPublicity from it being under rOpenSci\nContribute something to rOpenSci\nPotentially a publication\nWhat can rOpenSci do that CRAN cannot?\n\nThe rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN . Here's what rOpenSci does that CRAN cannot:\n\nAssess documentation readability / usability\nProvide a code review to find weak points / points of improvement\nDetermine whether a package is overlapping with another.

(lastly, it goes on for a bit, best to run the code locally)

As you can see, even though that version is usable, readability does a much smarter job of cleaning the text.

The Python code is quite — heh — readable, and R could really use a native port (i.e. this would be a ++gd project or an aspiring package author to take on).

The reticulate package provides a very clean & concise interface bridge between R and Python which makes it handy to work with modules that have yet to be ported to R (going native is always better when you can do it). This post shows how to use reticulate to create parquet files directly from R using reticulate as a bridge to the pyarrow module, which has the ability to natively create parquet files.

Now, you can create parquet files through R with Apache Drill — and, I’ll provide another example for that here — but, you may have need to generate such files and not have the ability to run Drill.

The Python parquet process is pretty simple since you can convert a pandas DataFrame directly to a pyarrow Table which can be written out in parquet format with pyarrow.parquet. We just need to follow this process through reticulate in R:

library(reticulate)

pd <- import("pandas", "pd")
pa <- import("pyarrow", "pa")
pq <- import("pyarrow.parquet", "pq")

mtcars_py <- r_to_py(mtcars)
mtcars_df <- pd$DataFrame$from_dict(mtcars_py)
mtcars_tab <- pa$Table$from_pandas(mtcars_df)

pq$write_table(mtcars_tab, path.expand("~/Data/mtcars_python.parquet"))

I wouldn’t want to do that for ginormous data frames, but it should work pretty well for modest use cases (you’re likely using Spark, Drill, Presto or other “big data” platforms for creation of larger parquet structures). Here’s how we’d do that with Drill via the sergeant package:

readr::write_csv(mtcars, "~/Data/mtcars_r.csvh")
dc <- drill_connection("localhost")
drill_query(dc, "CREATE TABLE dfs.tmp.`/mtcars_r.parquet` AS SELECT * FROM dfs.root.`/Users/bob/Data/mtcars_r.csvh`")

Without additional configuration parameters, the reticulated-Python version (above) generates larger parquet files and also has an index column since they’re needed in Python DataFrames (ugh), but small-ish data frames will end up in a single file whereas the Drill created ones will be in a directory with an additional CRC file (and, much smaller by default). NOTE: You can use preserve_index=False on the call to Table.from_pandas to get rid of that icky index.

It’s fairly efficient even for something like nycflights13::flights which has ~330K rows and 19 columns:

system.time(
  r_to_py(nycflights13::flights) %>% 
  pd$DataFrame$from_dict() %>% 
  pa$Table$from_pandas() %>% 
  pq$write_table(where = "/tmp/flights.parquet")
)
##    user  system elapsed 
##   1.285   0.108   1.398 

If you need to generate parquet files in a pinch, reticulate seems to be a good way to go.

UPDATE (2018-01-25)

API’s change and while the above still works, there’s a slightly simpler way, now:

library(reticulate)

pd <- import("pandas", "pd")

mtcars_py <- r_to_py(mtcars)
mtcars_df <- pd$DataFrame$from_dict(mtcars_py)

city_wx_df$to_parquet(path.expand("~/Data/mtcars_python.parquet"), "pyarrow")