Skip navigation

Category Archives: R

I was chatting with a fellow Amazon Athena user and the topic of using Presto functions such as approx_distinct() via {d[b]plyr} came up and it seems it might not be fully common knowledge that any non-already translated function is passed to the destination intact. That means you can just “use” approx_distinct() and it will work just fine. Here’s an example using the ODBC {DBI} interface:

library(dbplyr)
library(tidyverse)

# My personal Athena workgroup has been upgraded to "engine 2"
# so Presto 0.217 functions are available. Only noting that for
# folks who may not keep up with AWS announcements.
#
# https://prestodb.io/docs/0.217/index.html

DBI::dbConnect(
  odbc::odbc(),
  driver = "/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib",
  Schema = "sampledb",
  AwsRegion = "us-east-1",
  AuthenticationType = "IAM Profile",
  AWSProfile = "personal",
  MaxCatalogNameLen = 0L,
  MaxSchemaNameLen = 0L,
  MaxColumnNameLen = 0L,
  MaxTableNameLen = 0L,
  UseResultsetStreaming = 1L,
  StringColumnLength = 32 * 1024L,
  S3OutputLocation = "s3://accessible-bucket/"
) -> con

# this comes with Athena
elb_logs <- tbl(con, "elb_logs")

elb_logs
## # Source:   table<elb_logs> [?? x 16]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##    timestamp elbname requestip requestport backendip backendport
##    <chr>     <chr>   <chr>           <int> <chr>           <int>
##  1 2014-09-… lb-demo 251.51.8…       17141 251.111.…        8000
##  2 2014-09-… lb-demo 244.201.…       17141 244.140.…        8888
##  3 2014-09-… lb-demo 242.204.…       17141 255.196.…        8888
##  4 2014-09-… lb-demo 251.51.8…       17141 255.129.…        8888
##  5 2014-09-… lb-demo 242.241.…       17141 255.129.…        8899
##  6 2014-09-… lb-demo 243.198.…       17141 255.129.…        8888
##  7 2014-09-… lb-demo 244.119.…       17141 242.89.1…          80
##  8 2014-09-… lb-demo 254.173.…       17141 251.51.8…        8000
##  9 2014-09-… lb-demo 243.198.…       17141 254.149.…        8888
## 10 2014-09-… lb-demo 249.185.…       17141 241.36.2…        8888
## # … with more rows, and 10 more variables: requestprocessingtime <dbl>,
## #   backendprocessingtime <dbl>, clientresponsetime <dbl>,
## #   elbresponsecode <chr>, backendresponsecode <chr>,
## #   receivedbytes <int64>, sentbytes <int64>, requestverb <chr>,
## #   url <chr>, protocol <chr>

elb_logs %>% 
  summarise(d = n_distinct(backendip)) # 0.62 seconds
## # Source:   lazy query [?? x 1]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##         d
##   <int64>
## 1    2311

# https://prestodb.io/docs/0.217/functions/aggregate.html#approx_distinct

elb_logs %>% 
  summarise(d = approx_distinct(backendip)) # 0.49 seconds
## # Source:   lazy query [?? x 1]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##         d
##   <int64>
## 1    2386

In this toy example there’s no real reason to use this alternate function, but on my datasets using the approximator version dramatically reduces query time, reduces query cost, and produces results that by default have a standard error of 2.3% (which is fine for the use-cases I apply this to). There’s an alternate signature which lets you supply the standard error, as well.

If you’re curious as to what functions are translated by default, just use sql_translate_env() on the connection object:

sql_translate_env(con)
## <sql_variant>
## scalar:    -, :, !, !=, (, [, [[, {, *, /, &, &&, %/%, %%, %>%,
## scalar:    %in%, ^, +, <, <=, ==, >, >=, |, ||, $, abs, acos,
## scalar:    as_date, as_datetime, as.character, as.Date,
## scalar:    as.double, as.integer, as.integer64, as.logical,
## scalar:    as.numeric, as.POSIXct, asin, atan, atan2, between,
## scalar:    bitwAnd, bitwNot, bitwOr, bitwShiftL, bitwShiftR,
## scalar:    bitwXor, c, case_when, ceil, ceiling, coalesce, cos,
## scalar:    cosh, cot, coth, day, desc, exp, floor, hour, if,
## scalar:    if_else, ifelse, is.na, is.null, log, log10, mday,
## scalar:    minute, month, na_if, nchar, now, paste, paste0, pmax,
## scalar:    pmin, qday, round, second, sign, sin, sinh, sql, sqrt,
## scalar:    str_c, str_conv, str_count, str_detect, str_dup,
## scalar:    str_extract, str_extract_all, str_flatten, str_glue,
## scalar:    str_glue_data, str_interp, str_length, str_locate,
## scalar:    str_locate_all, str_match, str_match_all, str_order,
## scalar:    str_pad, str_remove, str_remove_all, str_replace,
## scalar:    str_replace_all, str_replace_na, str_sort, str_split,
## scalar:    str_split_fixed, str_squish, str_sub, str_subset,
## scalar:    str_to_lower, str_to_title, str_to_upper, str_trim,
## scalar:    str_trunc, str_view, str_view_all, str_which,
## scalar:    str_wrap, substr, substring, switch, tan, tanh, today,
## scalar:    tolower, toupper, trimws, wday, xor, yday, year
## aggregate: cume_dist, cummax, cummean, cummin, cumsum,
## aggregate: dense_rank, first, lag, last, lead, max, mean, median,
## aggregate: min, min_rank, n, n_distinct, nth, ntile, order_by,
## aggregate: percent_rank, quantile, rank, row_number, sd, sum, var
## window:    cume_dist, cummax, cummean, cummin, cumsum,
## window:    dense_rank, first, lag, last, lead, max, mean, median,
## window:    min, min_rank, n, n_distinct, nth, ntile, order_by,
## window:    percent_rank, quantile, rank, row_number, sd, sum, var

The release of the latest versions of {d[b]plyr} destroyed a lazy, bad, hack I was using to cast columns to JSON (you’ll note the lack of a cast() function above, which is necessary for Athena since the syntax is not that of a function call). I’m _very_glad they did since it’s bad to rely on undocumented functionality and, honestly, it’s pretty straightforward to make an “official” translation for them.

First, we need the class of this Athena ODBC connection:

class(con)
## [1] "Amazon Athena"
## attr(,"package")
## [1] ".GlobalEnv"

We’ll need to write a sql_translation.Amazon Athena() function for this connection class and we’ll start with writing one that doesn’t handle our casting just to show the basic setup:

`sql_translation.Amazon Athena` <- function(x) {
  sql_variant(
    dbplyr::base_odbc_scalar,
    dbplyr::base_odbc_agg,
    dbplyr::base_odbc_win
  )
}

All that function is doing (now) is setting up the default translators you’ve seen in the above output listings.

To make it do something else, we need to add casting translator helpers, which fall under the “scalar” category. This, too, is pretty straightforward since {dbplyr} makes it possible to just extend a parent set of category translators:

sql_translator(
  .parent = dbplyr::base_odbc_scalar,
  cast_as = function(x, y) dbplyr::build_sql("CAST(", x, " AS ", y, ")"),
  try_cast_as = function(x, y) dbplyr::build_sql("TRY_CAST(", x, " AS ", y, ")")
) -> athena_scalar

`sql_translation.Amazon Athena` <- function(x) {
  sql_variant(
    athena_scalar,
    dbplyr::base_odbc_agg,
    dbplyr::base_odbc_win
  )
}

Now, let’s see if it really knows about our new casting functions:

sql_translate_env(con)
## <sql_variant>
## scalar:    -, :, !, !=, (, [, [[, {, *, /, &, &&, %/%, %%, %>%,
## scalar:    %in%, ^, +, <, <=, ==, >, >=, |, ||, $, abs, acos,
## scalar:    as_date, as_datetime, as.character, as.Date,
## scalar:    as.double, as.integer, as.integer64, as.logical,
## scalar:    as.numeric, as.POSIXct, asin, atan, atan2, between,
## scalar:    bitwAnd, bitwNot, bitwOr, bitwShiftL, bitwShiftR,
## scalar:    bitwXor, c, case_when, cast_as, ceil, ceiling,
## scalar:    coalesce, cos, cosh, cot, coth, day, desc, exp, floor,
## scalar:    hour, if, if_else, ifelse, is.na, is.null, log, log10,
## scalar:    mday, minute, month, na_if, nchar, now, paste, paste0,
## scalar:    pmax, pmin, qday, round, second, sign, sin, sinh, sql,
## scalar:    sqrt, str_c, str_conv, str_count, str_detect, str_dup,
## scalar:    str_extract, str_extract_all, str_flatten, str_glue,
## scalar:    str_glue_data, str_interp, str_length, str_locate,
## scalar:    str_locate_all, str_match, str_match_all, str_order,
## scalar:    str_pad, str_remove, str_remove_all, str_replace,
## scalar:    str_replace_all, str_replace_na, str_sort, str_split,
## scalar:    str_split_fixed, str_squish, str_sub, str_subset,
## scalar:    str_to_lower, str_to_title, str_to_upper, str_trim,
## scalar:    str_trunc, str_view, str_view_all, str_which,
## scalar:    str_wrap, substr, substring, switch, tan, tanh, today,
## scalar:    tolower, toupper, trimws, try_cast_as, wday, xor,
## scalar:    yday, year
## aggregate: cume_dist, cummax, cummean, cummin, cumsum,
## aggregate: dense_rank, first, lag, last, lead, max, mean, median,
## aggregate: min, min_rank, n, n_distinct, nth, ntile, order_by,
## aggregate: percent_rank, quantile, rank, row_number, sd, sum, var
## window:    cume_dist, cummax, cummean, cummin, cumsum,
## window:    dense_rank, first, lag, last, lead, max, mean, median,
## window:    min, min_rank, n, n_distinct, nth, ntile, order_by,
## window:    percent_rank, quantile, rank, row_number, sd, sum, var

Aye! Let’s test it out.

Unfortunately, this boring, default database has no MAP columns to really show this off, but we can convert a simple character column into JSON just to get the idea:

elb_logs %>% 
  select(backendip)
## # Source:   lazy query [?? x 1]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##    backendip      
##    <chr>          
##  1 249.6.80.219   
##  2 248.178.189.65 
##  3 254.70.228.23  
##  4 248.178.189.65 
##  5 252.0.81.65    
##  6 248.178.189.65 
##  7 245.241.133.121
##  8 244.202.183.67 
##  9 255.226.190.127
## 10 246.22.152.210 
## # … with more rows

elb_logs %>% 
  select(backendip) %>% 
  mutate(
    backendip = cast_as(backendip, JSON)
  )
## # Source:   lazy query [?? x 1]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##    backendip            
##    <chr>                
##  1 "\"244.238.214.120\""
##  2 "\"248.99.214.228\"" 
##  3 "\"243.3.190.175\""  
##  4 "\"246.235.181.255\""
##  5 "\"241.112.203.216\""
##  6 "\"240.147.242.82\"" 
##  7 "\"248.99.214.228\"" 
##  8 "\"248.99.214.228\"" 
##  9 "\"253.161.243.121\""
## 10 "\"248.99.214.228\"" 
## # … with more rows

FIN

Despite the {tidyverse} documentation being written with care and clarity, this part of the R ecosystem is so extensive and evolving that watching out for all the doors and corners can be tricky. It’s easy for the short paragraph on the “untranslated function” capability to be overlooked and it may be hard to fully grok the translation concept without an IRL example.

Hopefully this helped (even if only a little) demystify these two areas of {d[b]plyr}.

The past two posts have (lightly) introduced how to use compiled Swift code in R, but they’ve involved a bunch of “scary” command line machinations and incantations.

One feature of {Rcpp} I’ve always 💙 is the cppFunction() (“r-lib” zealots have a similar cpp11::cpp_function()) which lets one experiment with C[++] code in R with as little friction as possible. To make it easier to start experimenting with Swift, I’ve built an extremely fragile swift_function() in {swiftr} that intends to replicate this functionality. Explaining it will be easier with an example.

Reading Property Lists in R With Swift

macOS relies heavily on property lists for many, many things. These can be plain text (XML) or binary files and there are command-line tools and Python libraries (usable via {reticulate}) that can read them along with the good ‘ol XML::readKeyValueDB(). We’re going to create a Swift function to read property lists and return JSON which we can use back in R via {jsonlite}.

This time around there’s no need to create extra files, just install {swiftr} and your favorite R IDE and enter the following (expository is after the code):

library(swiftr)

swift_function(
  code = '

func ignored() {
  print("""
this will be ignored by swift_function() but you could use private
functions as helpers for the main public Swift function which will be 
made available to R.
""")
}  

@_cdecl ("read_plist")
public func read_plist(path: SEXP) -> SEXP {

  var out: SEXP = R_NilValue

  do {
    // read in the raw plist
    let plistRaw = try Data(contentsOf: URL(fileURLWithPath: String(cString: R_CHAR(STRING_ELT(path, 0)))))

    // convert it to a PropertyList  
    let plist = try PropertyListSerialization.propertyList(from: plistRaw, options: [], format: nil) as! [String:Any]

    // serialize it to JSON
    let jsonData = try JSONSerialization.data(withJSONObject: plist , options: .prettyPrinted)

    // setup the JSON string return
    String(data: jsonData, encoding: .utf8)?.withCString { 
      cstr in out = Rf_mkString(cstr) 
    }

  } catch {
    debugPrint("\\(error)")
  }

  return(out)

}
')

This new swift_function() function — for the moment (the API is absolutely going to change) — is defined as:

swift_function(
  code,
  env = globalenv(),
  imports = c("Foundation"),
  cache_dir = tempdir()
)

where:

  • code is a length 1 character vector of Swift code
  • env is the environment to expose the function in (defaults to the global environment)
  • imports is a character vector of any extra Swift frameworks that need to be imported
  • cache_dir is where all the temporary files will be created and compiled dynlib will be stored. It defaults to a temporary directory so specify your own directory (that exists) if you want to keep the files around after you close the R session

Folks familiar with cppFunction() will notice some (on-purpose) similarities.

The function expects you to expose only one public Swift function which also (for the moment) needs to have the @_cdecl decorator before it. You can have as many other valid Swift helper functions as you like, but are restricted to one function that will be turned into an R function automagically.

In this example, swift_function() will see public func read_plist(path: SEXP) -> SEXP { and be able to identify

  • the function name (read_plist)
  • the number of parameters (they all need to be SEXP, for now)
  • the names of the parameters

A complete source file with all the imports will be created and a pre-built bridging header (which comes along for the ride with {swiftr}) will be included in the compilation step and a dylib will be built and loaded into the R session. Finally, an R function that wraps a .Call() will be created and will have the function name of the Swift function as well as all the parameter names (if any).

In the case of our example, above, the built R function is:

function(path) {
  .Call("read_plist", path)
}

There’s a good chance you’re using RStudio, so we can test this with it’s property list, or you can substitute any other application’s property list (or any .plist you have) to test this out:

read_plist("/Applications/RStudio.app/Contents/Info.plist") %>% 
  jsonlite::fromJSON() %>% 
  str(1)
## List of 32
##  $ NSPrincipalClass                     : chr "NSApplication"
##  $ NSCameraUsageDescription             : chr "R wants to access the camera."
##  $ CFBundleIdentifier                   : chr "org.rstudio.RStudio"
##  $ CFBundleShortVersionString           : chr "1.4.1093-1"
##  $ NSBluetoothPeripheralUsageDescription: chr "R wants to access bluetooth."
##  $ NSRemindersUsageDescription          : chr "R wants to access the reminders."
##  $ NSAppleEventsUsageDescription        : chr "R wants to run AppleScript."
##  $ NSHighResolutionCapable              : logi TRUE
##  $ LSRequiresCarbon                     : logi TRUE
##  $ NSPhotoLibraryUsageDescription       : chr "R wants to access the photo library."
##  $ CFBundleGetInfoString                : chr "RStudio 1.4.1093-1, © 2009-2020 RStudio, PBC"
##  $ NSLocationWhenInUseUsageDescription  : chr "R wants to access location information."
##  $ CFBundleInfoDictionaryVersion        : chr "6.0"
##  $ NSSupportsAutomaticGraphicsSwitching : logi TRUE
##  $ CSResourcesFileMapped                : logi TRUE
##  $ CFBundleVersion                      : chr "1.4.1093-1"
##  $ OSAScriptingDefinition               : chr "RStudio.sdef"
##  $ CFBundleLongVersionString            : chr "1.4.1093-1"
##  $ CFBundlePackageType                  : chr "APPL"
##  $ NSContactsUsageDescription           : chr "R wants to access contacts."
##  $ NSCalendarsUsageDescription          : chr "R wants to access calendars."
##  $ NSMicrophoneUsageDescription         : chr "R wants to access the microphone."
##  $ CFBundleDocumentTypes                :'data.frame':  16 obs. of  8 variables:
##  $ NSPhotoLibraryAddUsageDescription    : chr "R wants write access to the photo library."
##  $ NSAppleScriptEnabled                 : logi TRUE
##  $ CFBundleExecutable                   : chr "RStudio"
##  $ CFBundleSignature                    : chr "Rstd"
##  $ NSHumanReadableCopyright             : chr "RStudio 1.4.1093-1, © 2009-2020 RStudio, PBC"
##  $ CFBundleName                         : chr "RStudio"
##  $ LSApplicationCategoryType            : chr "public.app-category.developer-tools"
##  $ CFBundleIconFile                     : chr "RStudio.icns"
##  $ CFBundleDevelopmentRegion            : chr "English"

FIN

A source_swift() function is on the horizon as is adding a ton of checks/validations to swift_function(). I’ll likely be adding some of the SEXP and R Swift utility functions I’ve demonstrated in the [unfinished] book to make it fairly painless to interface Swift and R code in this new and forthcoming function.

As usual, kick the tyres, submit feature requests and bugs in any forum that’s comfortable and stay strong, wear a 😷, and socially distanced when out and about.

The previous post introduced the topic of how to compile Swift code for use in R using a useless, toy example. This one goes a bit further and makes a case for why one might want to do this by showing how to use one of Apple’s machine learning libraries, specifically the Natural Language one, focusing on extracting parts of speech from text.

I made a parts-of-speech directory to keep the code self-contained. In it are two files. The first is partsofspeech.swift (swiftc seems to dislike dashes in names of library code and I dislike underscores):

import NaturalLanguage
import CoreML

extension Array where Element == String {
  var SEXP: SEXP? {
    let charVec = Rf_protect(Rf_allocVector(SEXPTYPE(STRSXP), count))
    defer { Rf_unprotect(1) }
    for (idx, elem) in enumerated() { SET_STRING_ELT(charVec, idx, Rf_mkChar(elem)) }
    return(charVec)
  }
}

@_cdecl ("part_of_speech")
public func part_of_speech(_ x: SEXP) -> SEXP {

  let text = String(cString: R_CHAR(STRING_ELT(x, 0)))
  let tagger = NLTagger(tagSchemes: [.lexicalClass])

  tagger.string = text

  let options: NLTagger.Options = [.omitPunctuation, .omitWhitespace]

  var txts = [String]()
  var tags = [String]()

  tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .lexicalClass, options: options) { tag, tokenRange in
    if let tag = tag {
      txts.append("\(text[tokenRange])")
      tags.append("\(tag.rawValue)")
    }
    return true
  }

  let out = Rf_protect(Rf_allocVector(SEXPTYPE(VECSXP), 2))
  SET_VECTOR_ELT(out, 0, txts.SEXP)
  SET_VECTOR_ELT(out, 1, tags.SEXP)
  Rf_unprotect(1)

  return(out!)
}

The other is bridge code that seems to be the same for every one of these (or could be) so I’ve just named it swift-r-glue.h (it’s the same as the bridge code in the previous post):

#define USE_RINTERNALS

#include <R.h>
#include <Rinternals.h>

const char* R_CHAR(SEXP x);

Let’s walk through the Swift code.

We need to two imports:

import NaturalLanguage
import CoreML

to make use of the NLP functionality provided by Apple.

The following extension to the String Array class:

extension Array where Element == String {
  var SEXP: SEXP? {
    let charVec = Rf_protect(Rf_allocVector(SEXPTYPE(STRSXP), count))
    defer { Rf_unprotect(1) }
    for (idx, elem) in enumerated() { SET_STRING_ELT(charVec, idx, Rf_mkChar(elem)) }
    return(charVec)
  }
}

will reduce the amount of code we need to type later on to turn Swift String Arrays to R character vectors.

The start of the function:

@_cdecl ("part_of_speech")
public func part_of_speech(_ x: SEXP) -> SEXP {

tells swiftc to make this a C-compatible call and notes that the function takes one parameter (in this case, it’s expecting a length 1 character vector) and returns an R-compatible value (which will be a list that we’ll turn into a data.frame in R just for brevity).

The following sets up our inputs and outputs:

  let text = String(cString: R_CHAR(STRING_ELT(x, 0)))
  let tagger = NLTagger(tagSchemes: [.lexicalClass])

  tagger.string = text

  let options: NLTagger.Options = [.omitPunctuation, .omitWhitespace]

  var txts = [String]()
  var tags = [String]()

We convert the passed-in parameter to a Swift String, initialize the NLP tagger, and setup two arrays to hold the results (sentence component in txts and the part of speech that component is in tags).

The following code is mostly straight from Apple and (inefficiently) populates the previous two arrays:


tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .lexicalClass, options: options) { tag, tokenRange in if let tag = tag { txts.append("\(text[tokenRange])") tags.append("\(tag.rawValue)") } return true }

Finally, we use the Swift-R bridge to make a list much like one would in C:


let out = Rf_protect(Rf_allocVector(SEXPTYPE(VECSXP), 2)) SET_VECTOR_ELT(out, 0, txts.SEXP) SET_VECTOR_ELT(out, 1, tags.SEXP) Rf_unprotect(1) return(out!)

To get a shared library we can use from R, we just need to compile this like last time:

swiftc \
  -I /Library/Frameworks/R.framework/Headers \
  -F/Library/Frameworks \
  -framework R \
  -import-objc-header swift-r-glue.h \
  -emit-library \
  partsofspeech.swift

Let’s run that on some text! First, we’ll load the new shared library into R:

dyn.load("libpartsofspeech.dylib")

Next, we’ll make a wrapper function to avoid messy .Call(…)s and to make a data.frame:

parts_of_speech <- function(x) {
  res <- .Call("part_of_speech", x)  
  as.data.frame(stats::setNames(res, c("name", "tag")))
}

Finally, let’s try this on some text!

tibble::as_tibble(
  parts_of_speech(paste0(c(
"The comm wasn't working. Feeling increasingly ridiculous, he pushed",
"the button for the 1MC channel several more times. Nothing. He opened",
"his eyes and saw that all the lights on the panel were out. Then he",
"turned around and saw that the lights on the refrigerator and the",
"ovens were out. It wasn’t just the coffeemaker; the entire galley was",
"in open revolt. Holden looked at the ship name, Rocinante, newly",
"stenciled onto the galley wall, and said, Baby, why do you hurt me",
"when I love you so much?"
  ), collapse = " "))
)
## # A tibble: 92 x 2
##    name         tag
##    <chr>        <chr>
##  1 The          Determiner
##  2 comm         Noun
##  3 was          Verb
##  4 n't          Adverb
##  5 working      Verb
##  6 Feeling      Verb
##  7 increasingly Adverb
##  8 ridiculous   Adjective
##  9 he           Pronoun
## 10 pushed       Verb
## # … with 82 more rows

FIN

If you’re playing along at home, try adding a function to this Swift file that uses Apple’s entity tagger.

The next installment of this topic will be how to wrap all this into a package (then all these examples get tweaked and go into the tome.

I’ve been on a Swift + R bender for a while now, but have been envious of the pure macOS/iOS (et al) folks who get to use Apple’s seriously ++good machine learning libraries, which are even more robust on the new M1 hardware (it’s cool having hardware components dedicated to improving the performance of built models).

Sure, it’s pretty straightforward to make a command-line utility that can take data input, run them through models, then haul the data back into R, but I figured it was about time that Swift got the “Rust” and “Go” treatment in terms of letting R call compiled Swift code directly. Thankfully, none of this involves using Xcode since it’s one of the world’s worst IDEs.

To play along at home you’ll need macOS and at least the command line tools installed (I don’t think this requires a full Xcode install, but y’all can let me know if it does in the comments). If you can enter swiftc at a terminal prompt and get back <unknown>:0: error: no input files then you’re good-to-go.

Hello, Swift!

To keep this post short (since I’ll be adding this entire concept to the SwiftR tome), we’ll be super-focused and just build a shared library we can dynamically load into R. That library will have one function which will be to let us say hello to the planet with a customized greeting.

Make a new directory for this effort (I called mine greetings) and create a greetings.swift file with the following contents:

All this code is also in this gist.

@_cdecl ("greetings_from")
public func greetings_from(_ who: SEXP) -> SEXP {
  print("Greetings, 🌎, it's \(String(cString: R_CHAR(STRING_ELT(who, 0))))!")
  return(R_NilValue)
}

Before I explain what’s going on there, also create a geetings.h file with the following contents:

#define USE_RINTERNALS

#include <R.h>
#include <Rinternals.h>

const char* R_CHAR(SEXP x);

In the Swift file, there’s a single function that takes an R SEXP and converts it into a Swift String which is then routed to stdout (not a “great” R idiom, but benign enough for an intro example). Swift functions aren’t C functions and on their own do not adhere to C calling conventions. Unfortunately R’s ability to work with dynamic library code requires such a contract to be in place. Thankfully, the Swift Language Overlords provided us with the ability to instruct the compiler to create library code that will force the calling conventions to be C-like (that’s what the @cdecl is for).

We’re using SEXP, some R C-interface functions, and even the C version of NULL in the Swift code, but we haven’t done anything in the Swift file to tell Swift about the existence of these elements. That’s what the C header file is for (I added the R_CHAR declaration since complex C macros don’t work in Swift).

Now, all we need to do is make sure the compiler knows about the header file (which is a “bridge” between C and Swift), where the R framework is, and that we want to generate a library vs a binary executable file as we compile the code. Make sure you’re in the same directory as both the .swift and .h file and execute the following at a terminal prompt:

swiftc \
  -I /Library/Frameworks/R.framework/Headers \ # where the R headers are
  -F/Library/Frameworks \                      # where the R.framework lives
  -framework R \                               # we want to link against the R framework
  -import-objc-header greetings.h \            # this is our bridging header which will make R C things available to Swift
  -emit-library \                              # we want a library, not an exe
  greetings.swift                              # our file!

If all goes well, you should have a libgreetings.dylib shared library in that directory.

Now, fire up a R console session in that directory and do:

greetings_lib <- dyn.load("libgreetings.dylib")

If there are no errors, the shared library has been loaded into your R session and we can use the function we just made! Let’s wrap it in an R function so we’re not constantly typing .Call(…):

greetings_from <- function(who = "me") {
  invisible(.Call("greetings_from", as.character(who[1])))
}

I also took the opportunity to make sure we are sending a length-1 character vector to the C/Swift function.

Now, say hello!

greetings_from("hrbrmstr")

And you should see:

Greetings, 🌎, it's hrbrmstr!

FIN

We’ll stop there for now, but hopefully this small introduction has shown how straightforward it can be to bridge Swift & R in the other direction.

I’ll have another post that shows how to extend this toy example to use one of Apple’s natural language processing libraries, and may even do one more on how to put all this into a package before I shunt all the individual posts into a few book chapters.

Last week I introduced a new bookdown series on how to embed R into a macOS Swift application.

The initial chapters focused on core concepts and showed how to build a macOS compiled, binary command line application that uses embedded R for some functionality.

This week, a new chapter is up that walks you though how to build a basic SwiftUI application that takes input from the user, performs a computation in R (via embedded R) and displays the result of the computation back to the user.

The app looks like this:

and — apart from some of the boilerplate interface code from previous chapters — is around ~60 lines of Swift code that ends up consuming ~65 MB of active RAM when run with almost no energy impact (an equivalent Electron-packaged Shiny app would be 130-200 MB of initial RAM and have a significant, constant energy impact).

There’s sufficient boilerplate in this project to extend to write a basic GUI wrapper for various R operations you have hanging around.

Forthcoming chapters will show how to get graphics out of R and into a SwiftUI window as well as how to make a more diminutive Shiny app wrapper that we’ll eventually be able to ship with an embedded copy of the R framework.

I went completely daft this week and broke my months-long Twitter break due to the domestic terror event in my nation’s capitol. I’ll likely be resuming the break starting today.

Whilst keeping up with the final descent of the U.S. into a fully failed state, I also noticed that a debate from months ago on CRAN URL checks was still going strong.

I briefly chimed in those months ago and this week on the dangers of short URLs (which was not exactly the core topic of the debate which centered around HTTP URL redirects which is a feature of the protocol that URL shorteners happen to take advantage of).

Short URLs make it easier to type a URL out or remember a URL (if you can still get a decent, short keyword to use after the /) but they’re dangerous. In case you’re one of the R folks who challenge my security chops, perhaps you’ll believe Bruce.

NOTE: Regular ol’ URLs can be, and are dangerous, too, especially if they’re used in an http:// context vs an https:// context or run by daft folks who think they’re capable of making a system fully impervious to attackers.

The pandemic has made “cyber” fairly hectic, so my plan to wrap up a safety checker and local package URLs re-writer into a small, usable tool/package has no ETA on completion. However, that doesn’t mean you can’t gain visibility into the number, types, and safety of URLs in your locally installed packages.

The code below has exposition in the comments – and you can find it here as well — so I’ll close with it vs my usual “FIN”.

Stay safe out there, folks; and — to my not-so-‘United’-after-all States readers — stay strong! The nightmare of the last four years is almost over (though the cleanup — now both physical and metaphorical — is going to take a long time).

library(urltools)
library(stringi)
library(tidyverse)
# we're also using {clipr} and {tools} but via ::: and ::

# fairly comprehensive list of URL shorteners
shorteners <- read_lines("https://github.com/sambokai/ShortURL-Services-List/raw/master/shorturl-services-list.txt")

# opaque function baked into {tools}
# NOTE: this can take a while
db <- tools:::url_db_from_installed_packages(rownames(installed.packages()), verbose = TRUE)

as_tibble(db) %>% 
  distinct() %>%  # yep, even w/in a pkg there may be dups from ^^
  mutate(
    scheme = scheme(URL), # https or not
    dom = domain(URL)     # need this later to be able to compute apex domain
  )  %>% 
  filter(
    dom != "..", # prbly legit since it will be a relative "go up one directory" 
    !is.na(dom)  # the {tools} url_db_from_installed_packages() is not perfect
  ) %>% 
  bind_cols(
    suffix_extract(.$dom) # break them all down into component atoms
  ) %>% 
  select(-dom) %>% # this is now 'host' from ^^
  mutate(
    apex = sprintf("%s.%s", domain, suffix) # apex domain
  ) %>% 
  mutate(
    is_short = (host %in% shorteners) | (apex %in% shorteners) # does it use a shortener?
  ) -> db

db
## # A tibble: 12,623 x 9
##    URL        Parent    scheme host  subdomain domain suffix apex  is_short
##    <chr>      <chr>     <chr>  <chr> <chr>     <chr>  <chr>  <chr> <lgl>   
##  1 https://g… albersus… https  gith… NA        github com    gith… FALSE   
##  2 https://g… albersus… https  gith… NA        github com    gith… FALSE   
##  3 https://w… AnomalyD… https  www.… www       usenix org    usen… FALSE   
##  4 https://w… AnomalyD… https  www.… www       jstor  org    jsto… FALSE   
##  5 https://w… AnomalyD… https  www.… www       usenix org    usen… FALSE   
##  6 https://w… AnomalyD… https  www.… www       jstor  org    jsto… FALSE   
##  7 https://g… AnomalyD… https  gith… NA        github com    gith… FALSE   
##  8 https://g… AnomalyD… https  gith… NA        github com    gith… FALSE   
##  9 https://g… AnomalyD… https  gith… NA        github com    gith… FALSE   
## 10 https://g… AnomalyD… https  gith… NA        github com    gith… FALSE   
## # … with 12,613 more rows

# what packages do i have installed that use short URLS?
# a nice thing to do would be to file a PR to these authors

filter(db, is_short) %>% 
  select(
    URL,
    Parent,
    scheme
  )
## # A tibble: 5 x 3
##   URL                         Parent                   scheme
##   <chr>                       <chr>                    <chr> 
## 1 https://goo.gl/5KBjL5       fpp2/man/goog.Rd         https 
## 2 http://bit.ly/2016votecount geofacet/man/election.Rd http  
## 3 http://bit.ly/SnLi6h        knitr/man/knit.Rd        http  
## 4 https://bit.ly/magickintro  magick/man/magick.Rd     https 
## 5 http://bit.ly/2UaiYbo       ssh/doc/intro.html       http  

# what protocols are in use? (you'll note that some are borked and
# others got mangled by the {tools} function)

count(db, scheme, sort=TRUE)
## # A tibble: 5 x 2
##   scheme     n
##   <chr>  <int>
## 1 https  10007
## 2 http    2498
## 3 NA       113
## 4 ftp        4
## 5 `https     1

# what are the most used top-level sites?

count(db, host, sort=TRUE) %>% 
  mutate(pct = n/sum(n))
## # A tibble: 1,108 x 3
##    host                      n     pct
##    <chr>                 <int>   <dbl>
##  1 docs.aws.amazon.com    3859 0.306  
##  2 github.com             2954 0.234  
##  3 cran.r-project.org      450 0.0356 
##  4 en.wikipedia.org        220 0.0174 
##  5 aws.amazon.com          204 0.0162 
##  6 doi.org                 181 0.0143 
##  7 wikipedia.org           132 0.0105 
##  8 developers.google.com   114 0.00903
##  9 stackoverflow.com       101 0.00800
## 10 gitlab.com               86 0.00681
## # … with 1,098 more rows

# same as ^^ but apex

count(db, apex, sort=TRUE) %>% 
  mutate(pct = n/sum(n)) 
## # A tibble: 743 x 3
##    apex                  n     pct
##    <chr>             <int>   <dbl>
##  1 amazon.com         4180 0.331  
##  2 github.com         2997 0.237  
##  3 r-project.org       563 0.0446 
##  4 wikipedia.org       352 0.0279 
##  5 doi.org             221 0.0175 
##  6 google.com          179 0.0142 
##  7 tidyverse.org       151 0.0120 
##  8 r-lib.org           137 0.0109 
##  9 rstudio.com         117 0.00927
## 10 stackoverflow.com   102 0.00808
## # … with 733 more rows

# See all the eavesdroppable, interceptable, 
# content-mutable-by-evil-MITM-network-operator URLs
# A nice thing to do would be to fix these and issue PRs

filter(db, scheme == "http") %>% 
  select(URL, Parent)
## # A tibble: 2,498 x 2
##    URL                                                 Parent              
##    <chr>                                               <chr>               
##  1 http://www.winfield.demon.nl                        antiword/DESCRIPTION
##  2 http://github.com/ropensci/antiword/issues          antiword/DESCRIPTION
##  3 http://dirk.eddelbuettel.com/code/anytime.html      anytime/DESCRIPTION 
##  4 http://arrayhelpers.r-forge.r-project.org/          arrayhelpers/DESCRI…
##  5 http://arrow.apache.org/blog/2019/01/25/r-spark-im… arrow/doc/arrow.html
##  6 http://docs.aws.amazon.com/AmazonS3/latest/API/RES… aws.s3/man/accelera…
##  7 http://docs.aws.amazon.com/AmazonS3/latest/API/RES… aws.s3/man/accelera…
##  8 http://docs.aws.amazon.com/AmazonS3/latest/dev/acl… aws.s3/man/acl.Rd   
##  9 http://docs.aws.amazon.com/AmazonS3/latest/API/RES… aws.s3/man/bucket_e…
## 10 http://docs.aws.amazon.com/AmazonS3/latest/API/RES… aws.s3/man/bucketli…
## # … with 2,488 more rows

# find the abusers of "http" URLs

filter(db, scheme == "http") %>% 
  select(URL, Parent) %>% 
  mutate(
    pkg = stri_match_first_regex(Parent, "(^[^/]+)")[,2]
  ) %>% 
  count(pkg, sort=TRUE)
## # A tibble: 265 x 2
##    pkg                        n
##    <chr>                  <int>
##  1 paws.security.identity   258
##  2 paws.management          152
##  3 XML                      129
##  4 paws.analytics            78
##  5 stringi                   70
##  6 paws                      57
##  7 RCurl                     51
##  8 igraph                    49
##  9 base                      47
## 10 aws.s3                    44
## # … with 255 more rows

# send all the apex domains to the clipboard

clipr::write_clip(unique(db$apex))

# go here to paste them into the domain search box
# most domain/URL checker APIs aren't free for more 
# than a cpl dozen URLs/domains

browseURL("https://www.bulkblacklist.com")

# paste what you clipped into the box and wait a while

Over Christmas break I teased some screencaps:

of some almost-natural “R” looking code (this is a snippet):

Button("Run") {
  do { // calls to R can fail so there are lots of "try"s; poking at less ugly alternatives

    // handling dots in named calls is a WIP
    _  = try R.evalParse("options(tidyverse.quiet = TRUE )")

    // in practice this wld be called once in a model
    try R.library("ggplot2")
    try R.library("hrbrthemes")
    try R.library("magick")

    // can mix initialiation of an R list with Swift and R objects
    let mvals: RObject = [
      "month": [ "Jan", "Feb", "Mar", "Apr", "May", "Jun" ],
      "value": try R.sample(100, 6)
    ]

    // ggplot2! `mvals` is above, `col.hexValue` comes from the color picker
    // can't do R.as.data.frame b/c "dots" so this is a deliberately exposed alternate call
    let gg = try R.ggplot(R.as_data_frame(mvals)) +
      R.geom_col(R.aes_string("month", "value"), fill: col.hexValue) + // supports both [un]named
      R.scale_y_comma() +
      R.labs(
        x: rNULL, y: "# things",
        title: "Monthly Bars"
      ) +
      R.theme_ipsum_gs(grid: "Y")

    // an alternative to {magick} could be getting raw SVG from {svglite} device
    // we get Image view width/height and pass that to {magick}
    // either beats disk/ssd round-trip
    let fig = try R.image_graph(
      width: Double(imageRect.width), 
      height: Double(imageRect.height), 
      res: 144
    )

    try R.print(gg)
    _ = R.dev_off() // can't do R.dev.off b/c "dots" so this is a deliberately exposed alternate call

    let res = try R.image_write(fig, path: rNULL, format: "png")

    imgData = Data(res) // "imgData" is a reactive SwiftUI bound object; when it changes Image does too

  } catch {
  }

}

that works in Swift as part of a SwiftUI app that displays a ggplot2 plot inside of a macOS application.

It doesn’t shell out to R, but uses Swift 5’s native abilities to interface with R’s C interface.

I’m not ready to reveal that SwiftR code/library just yet (break’s over and the core bits still need some tweaking) but I can provide some interim resources with an online book about working with R’s C interface from Swift on macOS. It is uninspiringly called SwiftR — Using R from Swift.

There are, at present, six chapters that introduce the Swift+R concepts via command line apps. These aren’t terribly useful (shebanged R scripts work just fine, #tyvm) in and of themselves, but command line machinations are a much lower barrier to entry than starting right in with SwiftUI (that starts in chapter seven).

FIN

If you’ve wanted a reason to burn ~20GB of drive space with an Xcode installation and start to learn Swift (or learn more about Swift) then this is a resource for you.

The topics in the chapters are also a fairly decent (albeit incomplete) overview of R’s C interface and also how to work with C code from Swift in general.

So, take advantage of the remaining pandemic time and give it a 👀.

Feedback is welcome in the comments or the book code repo (book source repo is in progress).

Hope everyone has a safe and strong new year!

While the future of the Apache Drill ecosystem is somewhat in-play (MapR — a major sponsoring org for the project — is kinda dead), I still use it almost daily (on my local home office cluster) to avoid handing over any more money to Amazon than I/we already do. The latest (yet-to-be-released) v1.18.0 has some great improvements, including JSON resultset streaming for the REST API. Alas, tweaking {sergeant} (my REST API R package) to handle that is not on the TODO for the foreseeable future, so I’ve been using {sergeant.caffeinated} — https://github.com/hrbrmstr/sergeant-caffeinated — (a RJDBC wrapper for the Drill JDBC interface) for quite a while since it handles large resultsets quite nicely.

I broke out the RJDBC functionality from {sergeant} into this separate package since, despite the fact that it’s 2019/2020, many folks still have/had problems getting {rJava} to work (FWIW it’s a seamless install for me on Windows, Ubuntu, or macOS, even Apple Silicon macOS). The surgery to separate it was fairly hack-ish (one reason it’s not on CRAN) and it finally broke with the recent {dbplyr} 2.x release. I assumed fixing the caffeinated version was easier/quicker than the REST API version, so I dug in and am cautiously tossing it out for wider poking.

An All New Way To Use 💂☕️

Gone are the days of src_drill_jdbc(), but enter in the new term of more standardized {DBI} and {d[b]plyr} access to Apache Drill. To install this version you can do:

remotes::install_github("hrbrmstr/sergeant-caffeinated")

(more install options using safer and saner social coding sites coming soon).

Let’s load up the package(s) and perform some operations.

library(sergeant.caffeinated)

test_host <- Sys.getenv("DRILL_TEST_HOST", "localhost")

be_quiet()

(con <- dbConnect(drv = DrillJDBC(), sprintf("jdbc:drill:zk=%s", test_host)))
## <DrillJDBCConnection>

The DRILL_TEST_HOST environment variable contains the hostname or IP address of my/your Drill server, defaulting to localhost if none is found.

The be_quiet() function stops the Java engine from yelling at you with “illegal reflective access” warnings. If you see this in other rJava-powered packages it means code in some classes in some Java archive files are doing some sketchy old-school things that newer JVMs aren’t happy about. At some point, these warnings become full-on errors which will break many things. Unfortunately, Drill is still fairly tied to Java 8.x and has tons of introspecting code. The errors are ugly, so if you want to get rid of them, just call this function before doing anything with Drill. (You’ll also notice log4j errors are finally gone!)

Now that we have a Drill JDBC connection, we can do something with it. All the DBI-ish operations work, but it’s 2020 and {d[b]ply} is the bee’s knees, so we’ll just dive right in with that:

(db <- tbl(con, "cp.`employee.json`"))

## # Source:   table<cp.`employee.json`> [?? x 16]
## # Database: DrillJDBCConnection
##    employee_id full_name first_name last_name position_id position_title store_id
##          <dbl> <chr>     <chr>      <chr>           <dbl> <chr>             <dbl>
##  1           1 Sheri No… Sheri      Nowmer              1 President             0
##  2           2 Derrick … Derrick    Whelply             2 VP Country Ma…        0
##  3           4 Michael … Michael    Spence              2 VP Country Ma…        0
##  4           5 Maya Gut… Maya       Gutierrez           2 VP Country Ma…        0
##  5           6 Roberta … Roberta    Damstra             3 VP Informatio…        0
##  6           7 Rebecca … Rebecca    Kanagaki            4 VP Human Reso…        0
##  7           8 Kim Brun… Kim        Brunner            11 Store Manager         9
##  8           9 Brenda B… Brenda     Blumberg           11 Store Manager        21
##  9          10 Darren S… Darren     Stanz               5 VP Finance            0
## 10          11 Jonathan… Jonathan   Murraiin           11 Store Manager         1
## # … with more rows, and 9 more variables: department_id <dbl>, birth_date <chr>,
## #   hire_date <chr>, salary <dbl>, supervisor_id <dbl>, education_level <chr>,
## #   marital_status <chr>, gender <chr>, management_role <chr>

Basically, that’s it: it “just works”.

FIN

If you’ve been a user of {sergeant.caffeinated} and really need src_drill_jdbc() back, drop an issue on GH or a note in the comments, and be sure to file issues if I’ve missed anything as you kick the tyres.