Skip navigation

Presented without much commentary since I stopped once {ggrepel} and {graphlayouts} failed (RStudio doesn’t support it yet, either, which I knew).

The following steps will get you a fully working and STUPID FAST fully native ARM64 M1/Apple Silicon R setup with {tidyverse} and {rJava}.

Just remember, that if you need RStudio (or anything that links against the x86_64 R dylib) you’re going to be reverting this to get stuff done.

# Setting up ARM64 R on Apple Silicon/M1

# I'd run each section by hand, but feel free to live dangerously

# save off what you have installed in homebrew
brew list > ~/Documents/currently-installed-homebrew-formulas.txt

# uninstall x86_64 homebrew
# NOTE that in theory x86_64 and arm64 homebrew can live happily together but
# I went cheap on the SSD in the M1 Mini and wld like the space back
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/uninstall.sh)"

# make sure you unalias "brew" if you aliased it with "arch"
# need to do this in whatever shell startup script(s) you use, too
unalias brew

# install arm64 homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

# do what it says re: paths

# install wget (to make sure stuff is working)
brew install wget 

# place for r-libs; ty R Core & Prof Ripley!
mkdir ~/Downloads/libs-arm64/

# go there
cd ~/Downloads/libs-arm64

# grab'em from https://mac.r-project.org/libs-arm64/
for dl in $(curl -sS "https://mac.r-project.org/libs-arm64/" \
  | xmllint --html --xpath '//td/a[contains(@href, 'tar.gz')]/@href' 2>/dev/null - \
  | sed -e 's/ href="//g' -e 's/"/\n/g'); do wget "https://mac.r-project.org/libs-arm64/${dl}" ; done

# prime sudo (not rly necessary but I dislike having to enter sudo passwords in a for loop)
sudo ls ~/Downloads/libs-arm64

# extract'em
for gz in $(ls ~/Downloads/libs-arm64/*gz); do  
  sudo tar fvxz ${gz} -C /
done

# grab r-devel
cd ~/Downloads
wget https://mac.r-project.org/big-sur/R-devel/arm64/R-devel.tar.gz

# extract it
tar fvxz R*.tar.gz -C /

# install libxml2 and more (to prime libraries)
brew install libxml2 ccache libgit2 unixodbc poppler coreutils

# review ~/Documents/currently-installed-homebrew-formulas.txt and add what you need from there

# put this in your shell startup (macOS folks shld get used to zsh, so ~/.zshrc is a good place to stick it at the end)
export PATH=/opt/R/arm64/bin:$PATH

# and also run it at the command prompt
export PATH=/opt/R/arm64/bin:$PATH

# go for broke!
Rscript -e "install.packages('tidyverse')"

# throw caution to the wind!
Rscript -e "install.packages('devtools')"

# shoot for the moon!
Rscript -e "install.packages(c('DBI', 'odbc'))"

# it's crazytown
Rscript -e "install.packages('ggraph')"

# ARGH! {ggrepel} and {graphlayouts} fail (RStudio won't work anyway so this whole thing was just an exercise)

# setup java; open and run the pkg installer; they have a tar.gz as well 
wget https://cdn.azul.com/zulu/bin/zulu11.45.27-ca-jdk11.0.10-macosx_aarch64.dmg

# tell R about it
R CMD javareconf

# setup JAVA_HOME like you would in your shell

# this is necessary as the Java framework is gone from macOS in Big Sur
Rscript -e 'install.packages("rJava",,"http://rforge.net")'

##################################
# TO ***UNDO*** ALL OF THE ABOVE #
##################################

# uninstall amd64 homebrew
# NOTE that in theory x86_64 and arm64 homebrew can live happily together but
# I went cheap on the SSD in the M1 Mini and wld like the space back
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/uninstall.sh)"

# make sure you re-alias "brew" 
# need to do this in whatever shell startup script(s) you use, too
alias brew='arch --x86_64 brew'

# install x86_64 homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

# install libxml2 and more (to prime libraries)
brew install libxml2 ccache libgit2 unixodbc poppler coreutils

# review ~/Documents/currently-installed-homebrew-formulas.txt and add what you need from there

# remove the following from your shell startup script(s)
export PATH=/opt/R/arm64/bin:$PATH

# re-setup java for x86_64; open and run the pkg installer; they have a tar.gz as well 
wget "https://cdn.azul.com/zulu/bin/zulu11.45.27-ca-jdk11.0.10-macosx_x64.dmg" 

# re-do the parts of the rest of the setup that you need to

Until CRAN has {sf} binaries, use this recipe to build it from source.

If you’ve been following me around the internets for a while you’ve likely heard me pontificate about the need to be aware of and reduce — when possible — your personal “cyber” attack surface. One of the ways you can do that is to install as few applications as possible onto your devices and make sure you have a decent handle on those you’ve kept around are doing or capable of doing.

On macOS, one application attribute you can look at is the set of “entitlements” apps have asked for and that you have actioned on (i.e. either granted or denied the entitlement request). If you have Developer Tools or Xcode installed you can use the codesign utility (it may be usable w/o the developer tools, but I never run without them so drop a note in the comments if you can confirm this) to see them:

$ codesign -d --entitlements :- /Applications/RStudio.app
Executable=/Applications/RStudio.app/Contents/MacOS/RStudio
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>

  <!-- Required by R packages which want to access the camera. -->
  <key>com.apple.security.device.camera</key>
  <true/>

  <!-- Required by R packages which want to access the microphone. -->
  <key>com.apple.security.device.audio-input</key>
  <true/>

  <!-- Required by Qt / Chromium. -->
  <key>com.apple.security.cs.disable-library-validation</key>
  <true/>
  <key>com.apple.security.cs.disable-executable-page-protection</key>
  <true/>

  <!-- We use DYLD_INSERT_LIBRARIES to load the libR.dylib dynamically. -->
  <key>com.apple.security.cs.allow-dyld-environment-variables</key>
  <true/>

  <!-- Required to knit to Word -->
  <key>com.apple.security.automation.apple-events</key>
  <true/>

</dict>
</plist>

The output is (ugh) XML, and don’t think that all app developers are as awesome as RStudio ones since those comments are pseudo-optional (i.e. you can put junk in them). I’ll continue to use RStudio throughout this example just for consistency.

Since you likely have better things to do than execute a command line tool multiple times and do significant damage to your eyes with all those pointy tags we can use R to turn the apps on our filesystem into data and examine the entitlements in a much more dignified manner.

First, we’ll write a function to wrap the codesign tool execution (and, I’ve leaked how were going to eventually look at them by putting all the library calls up front):

library(XML)
library(tidyverse)
library(igraph)
library(tidygraph)
library(ggraph)

# rewriting this to also grab the text from the comments is an exercise left to the reader

read_entitlements <- function(app) { 

  system2(
    command = "codesign",
    args = c(
      "-d",
      "--entitlements",
      ":-",
      gsub(" ", "\\\\ ", app)
    ),
    stdout = TRUE
  ) -> x

  x <- paste0(x, collapse = "\n")

  if (nchar(x) == 0) return(tibble())

  x <- XML::xmlParse(x, asText=TRUE)
  x <- try(XML::readKeyValueDB(x), silent = TRUE)

  if (inherits(x, "try-error")) return(tibble())

  x <- sapply(x, function(.x) paste0(.x, collapse=";"))

  if (length(x) == 0) return(tibble())

  data.frame(
    app = basename(app),
    entitlement = make.unique(names(x)),
    value = I(x)
  ) -> x

  x <- tibble::as_tibble(x)

  x

} 

Now, we can slurp up all the entitlements with just a few lines of code:

my_apps <- list.files("/Applications", pattern = "\\.app$", full.names = TRUE)

my_apps_entitlements <- map_df(my_apps, read_entitlements)

my_apps_entitlements %>% 
  filter(grepl("RStudio", app))
## # A tibble: 6 x 3
##   app         entitlement                                              value   
##   <chr>       <chr>                                                    <I<chr>>
## 1 RStudio.app com.apple.security.device.camera                         TRUE    
## 2 RStudio.app com.apple.security.device.audio-input                    TRUE    
## 3 RStudio.app com.apple.security.cs.disable-library-validation         TRUE    
## 4 RStudio.app com.apple.security.cs.disable-executable-page-protection TRUE    
## 5 RStudio.app com.apple.security.cs.allow-dyld-environment-variables   TRUE    
## 6 RStudio.app com.apple.security.automation.apple-events               TRUE 

Having these entitlement strings is great, but what do they mean? Unfortunately, Apple, frankly, sucks at developer documentation, and this suckage shines especially bright when it comes to documenting all the possible entitlements. We can retrieve some of them from the online documentation, so let’s do that and re-look at RStudio:

# a handful of fairly ok json URLs that back the online dev docs; they have ok, but scant entitlement definitions
c(
  "https://developer.apple.com/tutorials/data/documentation/bundleresources/entitlements.json",
  "https://developer.apple.com/tutorials/data/documentation/security/app_sandbox.json",
  "https://developer.apple.com/tutorials/data/documentation/security/hardened_runtime.json",
  "https://developer.apple.com/tutorials/data/documentation/bundleresources/entitlements/system_extensions.json"
) -> entitlements_info_urls

extract_entitlements_info <- function(x) {

  apple_ents_pg <- jsonlite::fromJSON(x)

  apple_ents_pg$references %>% 
    map_df(~{

      if (!hasName(.x, "role")) return(tibble())
      if (.x$role != "symbol") return(tibble())

      tibble(
        title = .x$title,
        entitlement = .x$name,
        description = .x$abstract$text %||% NA_character_
      )

    })

}

entitlements_info_urls %>% 
  map(extract_ents_info) %>% 
  bind_rows() %>% 
  distinct() -> apple_entitlements_definitions

# look at rstudio again ---------------------------------------------------

my_apps_entitlements %>% 
  left_join(apple_entitlements_definitions) %>% 
  filter(grepl("RStudio", app)) %>% 
  select(title, description)
## Joining, by = "entitlement"
## # A tibble: 6 x 2
##   title                            description                                                       
##   <chr>                            <chr>                                                             
## 1 Camera Entitlement               A Boolean value that indicates whether the app may capture movies…
## 2 Audio Input Entitlement          A Boolean value that indicates whether the app may record audio u…
## 3 Disable Library Validation Enti… A Boolean value that indicates whether the app may load arbitrary…
## 4 Disable Executable Memory Prote… A Boolean value that indicates whether to disable all code signin…
## 5 Allow DYLD Environment Variable… A Boolean value that indicates whether the app may be affected by…
## 6 Apple Events Entitlement         A Boolean value that indicates whether the app may prompt the use…

It might be interesting to see what the most requested entitlements are:


my_apps_entitlements %>% filter( grepl("security", entitlement) ) %>% count(entitlement, sort = TRUE) ## # A tibble: 60 x 2 ## entitlement n ## <chr> <int> ## 1 com.apple.security.app-sandbox 51 ## 2 com.apple.security.network.client 44 ## 3 com.apple.security.files.user-selected.read-write 35 ## 4 com.apple.security.application-groups 29 ## 5 com.apple.security.automation.apple-events 26 ## 6 com.apple.security.device.audio-input 19 ## 7 com.apple.security.device.camera 17 ## 8 com.apple.security.files.bookmarks.app-scope 16 ## 9 com.apple.security.network.server 16 ## 10 com.apple.security.cs.disable-library-validation 15 ## # … with 50 more rows

Playing in an app sandbox, talking to the internet, and handling files are unsurprising in the top three slots since that’s how most apps get stuff done for you.

There are a few entitlements which increase your attack surface, one of which is apps that use untrusted third-party libraries:

my_apps_entitlements %>% 
  filter(
    entitlement == "com.apple.security.cs.disable-library-validation"
  ) %>% 
  select(app)
## # A tibble: 15 x 1
##    app                      
##    <chr>                    
##  1 Epic Games Launcher.app  
##  2 GarageBand.app           
##  3 HandBrake.app            
##  4 IINA.app                 
##  5 iStat Menus.app          
##  6 krisp.app                
##  7 Microsoft Excel.app      
##  8 Microsoft PowerPoint.app 
##  9 Microsoft Word.app       
## 10 Mirror for Chromecast.app
## 11 Overflow.app             
## 12 R.app                    
## 13 RStudio.app              
## 14 RSwitch.app              
## 15 Wireshark.app 

(‘Tis ironic that one of Apple’s own apps is in that list.)

What about apps that listen on the network (i.e. are also servers)?

## # A tibble: 16 x 1
##    app                          
##    <chr>                        
##  1 1Blocker.app                 
##  2 1Password 7.app              
##  3 Adblock Plus.app             
##  4 Divinity - Original Sin 2.app
##  5 Fantastical.app              
##  6 feedly.app                   
##  7 GarageBand.app               
##  8 iMovie.app                   
##  9 Keynote.app                  
## 10 Kindle.app                   
## 11 Microsoft Remote Desktop.app 
## 12 Mirror for Chromecast.app    
## 13 Slack.app                    
## 14 Tailscale.app                
## 15 Telegram.app                 
## 16 xScope.app 

You should read through the retrieved definitions to see what else you may want to observe to be an informed macOS app user.

The Big Picture

Looking at individual apps is great, but why not look at them all? We can build a large, but searchable network graph hierarchy if we output it as PDf, so let’s do that:

# this is just some brutish force code to build a hierarchical edge list

my_apps_entitlements %>% 
  distinct(entitlement) %>% 
  pull(entitlement) %>% 
  stri_count_fixed(".") %>% 
  max() -> max_dots

my_apps_entitlements %>% 
  distinct(entitlement, app) %>% 
  separate(
    entitlement,
    into = sprintf("level_%02d", 1:(max_dots+1)),
    fill = "right",
    sep = "\\."
  ) %>% 
  select(
    starts_with("level"), app
  ) -> wide_hierarchy

bind_rows(

  distinct(wide_hierarchy, level_01) %>%
    rename(to = level_01) %>%
    mutate(from = ".") %>%
    select(from, to) %>% 
    mutate(to = sprintf("%s_1", to)),

  map_df(1:nrow(wide_hierarchy), ~{

    wide_hierarchy[.x,] %>% 
      unlist(use.names = FALSE) %>% 
      na.exclude() -> tmp

    tibble(
      from = tail(lag(tmp), -1),
      to = head(lead(tmp), -1),
      lvl = 1:length(from)
    ) %>% 
      mutate(
        from = sprintf("%s_%d", from, lvl),
        to = sprintf("%s_%d", to, lvl+1)
      )

  }) %>% 
    distinct()

) -> long_hierarchy

# all that so we can make a pretty graph! ---------------------------------

g <- graph_from_data_frame(long_hierarchy, directed = TRUE)

ggraph(g, 'partition', circular = FALSE) + 
  geom_edge_diagonal2(
    width = 0.125, color = "gray70"
  ) + 
  geom_node_text(
    aes(
      label = stri_replace_last_regex(name, "_[[:digit:]]+$", "")
    ),
    hjust = 0, size = 3, family = font_gs
  ) +
  coord_flip() -> gg

# saving as PDF b/c it is ginormous, but very searchable

quartz(
  file = "~/output-tmp/.pdf", # put it where you want
  width = 21,
  height = 88,
  type = "pdf",
  family = font_gs
)
print(gg)
dev.off()

The above generates a large (dimension-wise; it’s ~<5MB on disk for me) PDF graph that is barely viewable in thumnail mode:

Here are some screen captures of portions of it. First are all network servers and clients:

Last are seekrit entitlements only for Apple:

FIN

I’ll likely put a few of these functions into {mactheknife} for easier usage.

After going through this exercise I deleted 11 apps, some for their entitlements and others that I just never use anymore. Hopefully this will help you do some early Spring cleaning as well.

I was chatting with a fellow Amazon Athena user and the topic of using Presto functions such as approx_distinct() via {d[b]plyr} came up and it seems it might not be fully common knowledge that any non-already translated function is passed to the destination intact. That means you can just “use” approx_distinct() and it will work just fine. Here’s an example using the ODBC {DBI} interface:

library(dbplyr)
library(tidyverse)

# My personal Athena workgroup has been upgraded to "engine 2"
# so Presto 0.217 functions are available. Only noting that for
# folks who may not keep up with AWS announcements.
#
# https://prestodb.io/docs/0.217/index.html

DBI::dbConnect(
  odbc::odbc(),
  driver = "/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib",
  Schema = "sampledb",
  AwsRegion = "us-east-1",
  AuthenticationType = "IAM Profile",
  AWSProfile = "personal",
  MaxCatalogNameLen = 0L,
  MaxSchemaNameLen = 0L,
  MaxColumnNameLen = 0L,
  MaxTableNameLen = 0L,
  UseResultsetStreaming = 1L,
  StringColumnLength = 32 * 1024L,
  S3OutputLocation = "s3://accessible-bucket/"
) -> con

# this comes with Athena
elb_logs <- tbl(con, "elb_logs")

elb_logs
## # Source:   table<elb_logs> [?? x 16]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##    timestamp elbname requestip requestport backendip backendport
##    <chr>     <chr>   <chr>           <int> <chr>           <int>
##  1 2014-09-… lb-demo 251.51.8…       17141 251.111.…        8000
##  2 2014-09-… lb-demo 244.201.…       17141 244.140.…        8888
##  3 2014-09-… lb-demo 242.204.…       17141 255.196.…        8888
##  4 2014-09-… lb-demo 251.51.8…       17141 255.129.…        8888
##  5 2014-09-… lb-demo 242.241.…       17141 255.129.…        8899
##  6 2014-09-… lb-demo 243.198.…       17141 255.129.…        8888
##  7 2014-09-… lb-demo 244.119.…       17141 242.89.1…          80
##  8 2014-09-… lb-demo 254.173.…       17141 251.51.8…        8000
##  9 2014-09-… lb-demo 243.198.…       17141 254.149.…        8888
## 10 2014-09-… lb-demo 249.185.…       17141 241.36.2…        8888
## # … with more rows, and 10 more variables: requestprocessingtime <dbl>,
## #   backendprocessingtime <dbl>, clientresponsetime <dbl>,
## #   elbresponsecode <chr>, backendresponsecode <chr>,
## #   receivedbytes <int64>, sentbytes <int64>, requestverb <chr>,
## #   url <chr>, protocol <chr>

elb_logs %>% 
  summarise(d = n_distinct(backendip)) # 0.62 seconds
## # Source:   lazy query [?? x 1]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##         d
##   <int64>
## 1    2311

# https://prestodb.io/docs/0.217/functions/aggregate.html#approx_distinct

elb_logs %>% 
  summarise(d = approx_distinct(backendip)) # 0.49 seconds
## # Source:   lazy query [?? x 1]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##         d
##   <int64>
## 1    2386

In this toy example there’s no real reason to use this alternate function, but on my datasets using the approximator version dramatically reduces query time, reduces query cost, and produces results that by default have a standard error of 2.3% (which is fine for the use-cases I apply this to). There’s an alternate signature which lets you supply the standard error, as well.

If you’re curious as to what functions are translated by default, just use sql_translate_env() on the connection object:

sql_translate_env(con)
## <sql_variant>
## scalar:    -, :, !, !=, (, [, [[, {, *, /, &, &&, %/%, %%, %>%,
## scalar:    %in%, ^, +, <, <=, ==, >, >=, |, ||, $, abs, acos,
## scalar:    as_date, as_datetime, as.character, as.Date,
## scalar:    as.double, as.integer, as.integer64, as.logical,
## scalar:    as.numeric, as.POSIXct, asin, atan, atan2, between,
## scalar:    bitwAnd, bitwNot, bitwOr, bitwShiftL, bitwShiftR,
## scalar:    bitwXor, c, case_when, ceil, ceiling, coalesce, cos,
## scalar:    cosh, cot, coth, day, desc, exp, floor, hour, if,
## scalar:    if_else, ifelse, is.na, is.null, log, log10, mday,
## scalar:    minute, month, na_if, nchar, now, paste, paste0, pmax,
## scalar:    pmin, qday, round, second, sign, sin, sinh, sql, sqrt,
## scalar:    str_c, str_conv, str_count, str_detect, str_dup,
## scalar:    str_extract, str_extract_all, str_flatten, str_glue,
## scalar:    str_glue_data, str_interp, str_length, str_locate,
## scalar:    str_locate_all, str_match, str_match_all, str_order,
## scalar:    str_pad, str_remove, str_remove_all, str_replace,
## scalar:    str_replace_all, str_replace_na, str_sort, str_split,
## scalar:    str_split_fixed, str_squish, str_sub, str_subset,
## scalar:    str_to_lower, str_to_title, str_to_upper, str_trim,
## scalar:    str_trunc, str_view, str_view_all, str_which,
## scalar:    str_wrap, substr, substring, switch, tan, tanh, today,
## scalar:    tolower, toupper, trimws, wday, xor, yday, year
## aggregate: cume_dist, cummax, cummean, cummin, cumsum,
## aggregate: dense_rank, first, lag, last, lead, max, mean, median,
## aggregate: min, min_rank, n, n_distinct, nth, ntile, order_by,
## aggregate: percent_rank, quantile, rank, row_number, sd, sum, var
## window:    cume_dist, cummax, cummean, cummin, cumsum,
## window:    dense_rank, first, lag, last, lead, max, mean, median,
## window:    min, min_rank, n, n_distinct, nth, ntile, order_by,
## window:    percent_rank, quantile, rank, row_number, sd, sum, var

The release of the latest versions of {d[b]plyr} destroyed a lazy, bad, hack I was using to cast columns to JSON (you’ll note the lack of a cast() function above, which is necessary for Athena since the syntax is not that of a function call). I’m _very_glad they did since it’s bad to rely on undocumented functionality and, honestly, it’s pretty straightforward to make an “official” translation for them.

First, we need the class of this Athena ODBC connection:

class(con)
## [1] "Amazon Athena"
## attr(,"package")
## [1] ".GlobalEnv"

We’ll need to write a sql_translation.Amazon Athena() function for this connection class and we’ll start with writing one that doesn’t handle our casting just to show the basic setup:

`sql_translation.Amazon Athena` <- function(x) {
  sql_variant(
    dbplyr::base_odbc_scalar,
    dbplyr::base_odbc_agg,
    dbplyr::base_odbc_win
  )
}

All that function is doing (now) is setting up the default translators you’ve seen in the above output listings.

To make it do something else, we need to add casting translator helpers, which fall under the “scalar” category. This, too, is pretty straightforward since {dbplyr} makes it possible to just extend a parent set of category translators:

sql_translator(
  .parent = dbplyr::base_odbc_scalar,
  cast_as = function(x, y) dbplyr::build_sql("CAST(", x, " AS ", y, ")"),
  try_cast_as = function(x, y) dbplyr::build_sql("TRY_CAST(", x, " AS ", y, ")")
) -> athena_scalar

`sql_translation.Amazon Athena` <- function(x) {
  sql_variant(
    athena_scalar,
    dbplyr::base_odbc_agg,
    dbplyr::base_odbc_win
  )
}

Now, let’s see if it really knows about our new casting functions:

sql_translate_env(con)
## <sql_variant>
## scalar:    -, :, !, !=, (, [, [[, {, *, /, &, &&, %/%, %%, %>%,
## scalar:    %in%, ^, +, <, <=, ==, >, >=, |, ||, $, abs, acos,
## scalar:    as_date, as_datetime, as.character, as.Date,
## scalar:    as.double, as.integer, as.integer64, as.logical,
## scalar:    as.numeric, as.POSIXct, asin, atan, atan2, between,
## scalar:    bitwAnd, bitwNot, bitwOr, bitwShiftL, bitwShiftR,
## scalar:    bitwXor, c, case_when, cast_as, ceil, ceiling,
## scalar:    coalesce, cos, cosh, cot, coth, day, desc, exp, floor,
## scalar:    hour, if, if_else, ifelse, is.na, is.null, log, log10,
## scalar:    mday, minute, month, na_if, nchar, now, paste, paste0,
## scalar:    pmax, pmin, qday, round, second, sign, sin, sinh, sql,
## scalar:    sqrt, str_c, str_conv, str_count, str_detect, str_dup,
## scalar:    str_extract, str_extract_all, str_flatten, str_glue,
## scalar:    str_glue_data, str_interp, str_length, str_locate,
## scalar:    str_locate_all, str_match, str_match_all, str_order,
## scalar:    str_pad, str_remove, str_remove_all, str_replace,
## scalar:    str_replace_all, str_replace_na, str_sort, str_split,
## scalar:    str_split_fixed, str_squish, str_sub, str_subset,
## scalar:    str_to_lower, str_to_title, str_to_upper, str_trim,
## scalar:    str_trunc, str_view, str_view_all, str_which,
## scalar:    str_wrap, substr, substring, switch, tan, tanh, today,
## scalar:    tolower, toupper, trimws, try_cast_as, wday, xor,
## scalar:    yday, year
## aggregate: cume_dist, cummax, cummean, cummin, cumsum,
## aggregate: dense_rank, first, lag, last, lead, max, mean, median,
## aggregate: min, min_rank, n, n_distinct, nth, ntile, order_by,
## aggregate: percent_rank, quantile, rank, row_number, sd, sum, var
## window:    cume_dist, cummax, cummean, cummin, cumsum,
## window:    dense_rank, first, lag, last, lead, max, mean, median,
## window:    min, min_rank, n, n_distinct, nth, ntile, order_by,
## window:    percent_rank, quantile, rank, row_number, sd, sum, var

Aye! Let’s test it out.

Unfortunately, this boring, default database has no MAP columns to really show this off, but we can convert a simple character column into JSON just to get the idea:

elb_logs %>% 
  select(backendip)
## # Source:   lazy query [?? x 1]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##    backendip      
##    <chr>          
##  1 249.6.80.219   
##  2 248.178.189.65 
##  3 254.70.228.23  
##  4 248.178.189.65 
##  5 252.0.81.65    
##  6 248.178.189.65 
##  7 245.241.133.121
##  8 244.202.183.67 
##  9 255.226.190.127
## 10 246.22.152.210 
## # … with more rows

elb_logs %>% 
  select(backendip) %>% 
  mutate(
    backendip = cast_as(backendip, JSON)
  )
## # Source:   lazy query [?? x 1]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##    backendip            
##    <chr>                
##  1 "\"244.238.214.120\""
##  2 "\"248.99.214.228\"" 
##  3 "\"243.3.190.175\""  
##  4 "\"246.235.181.255\""
##  5 "\"241.112.203.216\""
##  6 "\"240.147.242.82\"" 
##  7 "\"248.99.214.228\"" 
##  8 "\"248.99.214.228\"" 
##  9 "\"253.161.243.121\""
## 10 "\"248.99.214.228\"" 
## # … with more rows

FIN

Despite the {tidyverse} documentation being written with care and clarity, this part of the R ecosystem is so extensive and evolving that watching out for all the doors and corners can be tricky. It’s easy for the short paragraph on the “untranslated function” capability to be overlooked and it may be hard to fully grok the translation concept without an IRL example.

Hopefully this helped (even if only a little) demystify these two areas of {d[b]plyr}.

(this is an unrolled Twitter thread converted to the blog since one never knows how long content will be preserved anywhere anymore)

It looks like @StackPath (NetCDN[.]com redirects to them) is enabling insurrection-mongers. They’re fronting news[.]parler[.]com .

It seems they (Parler) have a second domain dicecrm[.]com with the actual content, too.

dicecrm[.]com is hosted in @awscloud, so it looks like Parler folks are smarter than Bezos’ minions. Amazon might want to take this down before it gets going (again).

They load JS via @Google tag manager (you can see in the HTML src). The GA_MEASUREMENT_ID is “G-P76KHELPLT

BGP Info for the IPs associated with the domain

In site source screenshot in the first tweet there’s a reference to twexit[.]com. DNS for it shows they also have leftwexit[.]com (which is a very odd site).

"Twexit" is being enabled by @awscloud @GoDaddy and @WordPress/@automattic plus @StackPath.

While the main page has (unsurprisingly) busted HTML, they’re using their old sitemap[.]xml — https://carbon.now.sh/mdyJbvddCvZaGu2tOnD6 — which has a singular recent (whining) entry: http://dicecrm[.]com/updates/facebook-continues-their-confusing-hypocritical-stifling-of-free-speech-

Looks like @Shareaholic is also enabling Parler. Their “shareaholic:site_id” is “f7b53d75b2e7afdc512ea898bbbff585“.

shareaholic id capture

One of the CDN content refs is this (attached img). It’s loading content for Parler from free[ ]pressers[.]com, which is a pretty nutjob fake news site enabled by @IBMcloud (so IBM is enabling Parler as well). the free[ ]pressers Twitter is equally nutjob.

I suspect Parler is going to keep rejiggering this nutjob-fueled content network knowing that AWS, IBM (et al) won't play whack-a-mole and are rly just waiting for our collective memory and attention to fade so they can go back to making $ from divisiveness, greed, & hate.

protip: perhaps not spin up a new FQDN with such hastily-crafted garbage behind it when you know lots of very technically-resourced 👀 are on you.

Originally tweeted by (@hrbrmstr) on 2021-01-29.

The past two posts have (lightly) introduced how to use compiled Swift code in R, but they’ve involved a bunch of “scary” command line machinations and incantations.

One feature of {Rcpp} I’ve always 💙 is the cppFunction() (“r-lib” zealots have a similar cpp11::cpp_function()) which lets one experiment with C[++] code in R with as little friction as possible. To make it easier to start experimenting with Swift, I’ve built an extremely fragile swift_function() in {swiftr} that intends to replicate this functionality. Explaining it will be easier with an example.

Reading Property Lists in R With Swift

macOS relies heavily on property lists for many, many things. These can be plain text (XML) or binary files and there are command-line tools and Python libraries (usable via {reticulate}) that can read them along with the good ‘ol XML::readKeyValueDB(). We’re going to create a Swift function to read property lists and return JSON which we can use back in R via {jsonlite}.

This time around there’s no need to create extra files, just install {swiftr} and your favorite R IDE and enter the following (expository is after the code):

library(swiftr)

swift_function(
  code = '

func ignored() {
  print("""
this will be ignored by swift_function() but you could use private
functions as helpers for the main public Swift function which will be 
made available to R.
""")
}  

@_cdecl ("read_plist")
public func read_plist(path: SEXP) -> SEXP {

  var out: SEXP = R_NilValue

  do {
    // read in the raw plist
    let plistRaw = try Data(contentsOf: URL(fileURLWithPath: String(cString: R_CHAR(STRING_ELT(path, 0)))))

    // convert it to a PropertyList  
    let plist = try PropertyListSerialization.propertyList(from: plistRaw, options: [], format: nil) as! [String:Any]

    // serialize it to JSON
    let jsonData = try JSONSerialization.data(withJSONObject: plist , options: .prettyPrinted)

    // setup the JSON string return
    String(data: jsonData, encoding: .utf8)?.withCString { 
      cstr in out = Rf_mkString(cstr) 
    }

  } catch {
    debugPrint("\\(error)")
  }

  return(out)

}
')

This new swift_function() function — for the moment (the API is absolutely going to change) — is defined as:

swift_function(
  code,
  env = globalenv(),
  imports = c("Foundation"),
  cache_dir = tempdir()
)

where:

  • code is a length 1 character vector of Swift code
  • env is the environment to expose the function in (defaults to the global environment)
  • imports is a character vector of any extra Swift frameworks that need to be imported
  • cache_dir is where all the temporary files will be created and compiled dynlib will be stored. It defaults to a temporary directory so specify your own directory (that exists) if you want to keep the files around after you close the R session

Folks familiar with cppFunction() will notice some (on-purpose) similarities.

The function expects you to expose only one public Swift function which also (for the moment) needs to have the @_cdecl decorator before it. You can have as many other valid Swift helper functions as you like, but are restricted to one function that will be turned into an R function automagically.

In this example, swift_function() will see public func read_plist(path: SEXP) -> SEXP { and be able to identify

  • the function name (read_plist)
  • the number of parameters (they all need to be SEXP, for now)
  • the names of the parameters

A complete source file with all the imports will be created and a pre-built bridging header (which comes along for the ride with {swiftr}) will be included in the compilation step and a dylib will be built and loaded into the R session. Finally, an R function that wraps a .Call() will be created and will have the function name of the Swift function as well as all the parameter names (if any).

In the case of our example, above, the built R function is:

function(path) {
  .Call("read_plist", path)
}

There’s a good chance you’re using RStudio, so we can test this with it’s property list, or you can substitute any other application’s property list (or any .plist you have) to test this out:

read_plist("/Applications/RStudio.app/Contents/Info.plist") %>% 
  jsonlite::fromJSON() %>% 
  str(1)
## List of 32
##  $ NSPrincipalClass                     : chr "NSApplication"
##  $ NSCameraUsageDescription             : chr "R wants to access the camera."
##  $ CFBundleIdentifier                   : chr "org.rstudio.RStudio"
##  $ CFBundleShortVersionString           : chr "1.4.1093-1"
##  $ NSBluetoothPeripheralUsageDescription: chr "R wants to access bluetooth."
##  $ NSRemindersUsageDescription          : chr "R wants to access the reminders."
##  $ NSAppleEventsUsageDescription        : chr "R wants to run AppleScript."
##  $ NSHighResolutionCapable              : logi TRUE
##  $ LSRequiresCarbon                     : logi TRUE
##  $ NSPhotoLibraryUsageDescription       : chr "R wants to access the photo library."
##  $ CFBundleGetInfoString                : chr "RStudio 1.4.1093-1, © 2009-2020 RStudio, PBC"
##  $ NSLocationWhenInUseUsageDescription  : chr "R wants to access location information."
##  $ CFBundleInfoDictionaryVersion        : chr "6.0"
##  $ NSSupportsAutomaticGraphicsSwitching : logi TRUE
##  $ CSResourcesFileMapped                : logi TRUE
##  $ CFBundleVersion                      : chr "1.4.1093-1"
##  $ OSAScriptingDefinition               : chr "RStudio.sdef"
##  $ CFBundleLongVersionString            : chr "1.4.1093-1"
##  $ CFBundlePackageType                  : chr "APPL"
##  $ NSContactsUsageDescription           : chr "R wants to access contacts."
##  $ NSCalendarsUsageDescription          : chr "R wants to access calendars."
##  $ NSMicrophoneUsageDescription         : chr "R wants to access the microphone."
##  $ CFBundleDocumentTypes                :'data.frame':  16 obs. of  8 variables:
##  $ NSPhotoLibraryAddUsageDescription    : chr "R wants write access to the photo library."
##  $ NSAppleScriptEnabled                 : logi TRUE
##  $ CFBundleExecutable                   : chr "RStudio"
##  $ CFBundleSignature                    : chr "Rstd"
##  $ NSHumanReadableCopyright             : chr "RStudio 1.4.1093-1, © 2009-2020 RStudio, PBC"
##  $ CFBundleName                         : chr "RStudio"
##  $ LSApplicationCategoryType            : chr "public.app-category.developer-tools"
##  $ CFBundleIconFile                     : chr "RStudio.icns"
##  $ CFBundleDevelopmentRegion            : chr "English"

FIN

A source_swift() function is on the horizon as is adding a ton of checks/validations to swift_function(). I’ll likely be adding some of the SEXP and R Swift utility functions I’ve demonstrated in the [unfinished] book to make it fairly painless to interface Swift and R code in this new and forthcoming function.

As usual, kick the tyres, submit feature requests and bugs in any forum that’s comfortable and stay strong, wear a 😷, and socially distanced when out and about.

The previous post introduced the topic of how to compile Swift code for use in R using a useless, toy example. This one goes a bit further and makes a case for why one might want to do this by showing how to use one of Apple’s machine learning libraries, specifically the Natural Language one, focusing on extracting parts of speech from text.

I made a parts-of-speech directory to keep the code self-contained. In it are two files. The first is partsofspeech.swift (swiftc seems to dislike dashes in names of library code and I dislike underscores):

import NaturalLanguage
import CoreML

extension Array where Element == String {
  var SEXP: SEXP? {
    let charVec = Rf_protect(Rf_allocVector(SEXPTYPE(STRSXP), count))
    defer { Rf_unprotect(1) }
    for (idx, elem) in enumerated() { SET_STRING_ELT(charVec, idx, Rf_mkChar(elem)) }
    return(charVec)
  }
}

@_cdecl ("part_of_speech")
public func part_of_speech(_ x: SEXP) -> SEXP {

  let text = String(cString: R_CHAR(STRING_ELT(x, 0)))
  let tagger = NLTagger(tagSchemes: [.lexicalClass])

  tagger.string = text

  let options: NLTagger.Options = [.omitPunctuation, .omitWhitespace]

  var txts = [String]()
  var tags = [String]()

  tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .lexicalClass, options: options) { tag, tokenRange in
    if let tag = tag {
      txts.append("\(text[tokenRange])")
      tags.append("\(tag.rawValue)")
    }
    return true
  }

  let out = Rf_protect(Rf_allocVector(SEXPTYPE(VECSXP), 2))
  SET_VECTOR_ELT(out, 0, txts.SEXP)
  SET_VECTOR_ELT(out, 1, tags.SEXP)
  Rf_unprotect(1)

  return(out!)
}

The other is bridge code that seems to be the same for every one of these (or could be) so I’ve just named it swift-r-glue.h (it’s the same as the bridge code in the previous post):

#define USE_RINTERNALS

#include <R.h>
#include <Rinternals.h>

const char* R_CHAR(SEXP x);

Let’s walk through the Swift code.

We need to two imports:

import NaturalLanguage
import CoreML

to make use of the NLP functionality provided by Apple.

The following extension to the String Array class:

extension Array where Element == String {
  var SEXP: SEXP? {
    let charVec = Rf_protect(Rf_allocVector(SEXPTYPE(STRSXP), count))
    defer { Rf_unprotect(1) }
    for (idx, elem) in enumerated() { SET_STRING_ELT(charVec, idx, Rf_mkChar(elem)) }
    return(charVec)
  }
}

will reduce the amount of code we need to type later on to turn Swift String Arrays to R character vectors.

The start of the function:

@_cdecl ("part_of_speech")
public func part_of_speech(_ x: SEXP) -> SEXP {

tells swiftc to make this a C-compatible call and notes that the function takes one parameter (in this case, it’s expecting a length 1 character vector) and returns an R-compatible value (which will be a list that we’ll turn into a data.frame in R just for brevity).

The following sets up our inputs and outputs:

  let text = String(cString: R_CHAR(STRING_ELT(x, 0)))
  let tagger = NLTagger(tagSchemes: [.lexicalClass])

  tagger.string = text

  let options: NLTagger.Options = [.omitPunctuation, .omitWhitespace]

  var txts = [String]()
  var tags = [String]()

We convert the passed-in parameter to a Swift String, initialize the NLP tagger, and setup two arrays to hold the results (sentence component in txts and the part of speech that component is in tags).

The following code is mostly straight from Apple and (inefficiently) populates the previous two arrays:


tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .lexicalClass, options: options) { tag, tokenRange in if let tag = tag { txts.append("\(text[tokenRange])") tags.append("\(tag.rawValue)") } return true }

Finally, we use the Swift-R bridge to make a list much like one would in C:


let out = Rf_protect(Rf_allocVector(SEXPTYPE(VECSXP), 2)) SET_VECTOR_ELT(out, 0, txts.SEXP) SET_VECTOR_ELT(out, 1, tags.SEXP) Rf_unprotect(1) return(out!)

To get a shared library we can use from R, we just need to compile this like last time:

swiftc \
  -I /Library/Frameworks/R.framework/Headers \
  -F/Library/Frameworks \
  -framework R \
  -import-objc-header swift-r-glue.h \
  -emit-library \
  partsofspeech.swift

Let’s run that on some text! First, we’ll load the new shared library into R:

dyn.load("libpartsofspeech.dylib")

Next, we’ll make a wrapper function to avoid messy .Call(…)s and to make a data.frame:

parts_of_speech <- function(x) {
  res <- .Call("part_of_speech", x)  
  as.data.frame(stats::setNames(res, c("name", "tag")))
}

Finally, let’s try this on some text!

tibble::as_tibble(
  parts_of_speech(paste0(c(
"The comm wasn't working. Feeling increasingly ridiculous, he pushed",
"the button for the 1MC channel several more times. Nothing. He opened",
"his eyes and saw that all the lights on the panel were out. Then he",
"turned around and saw that the lights on the refrigerator and the",
"ovens were out. It wasn’t just the coffeemaker; the entire galley was",
"in open revolt. Holden looked at the ship name, Rocinante, newly",
"stenciled onto the galley wall, and said, Baby, why do you hurt me",
"when I love you so much?"
  ), collapse = " "))
)
## # A tibble: 92 x 2
##    name         tag
##    <chr>        <chr>
##  1 The          Determiner
##  2 comm         Noun
##  3 was          Verb
##  4 n't          Adverb
##  5 working      Verb
##  6 Feeling      Verb
##  7 increasingly Adverb
##  8 ridiculous   Adjective
##  9 he           Pronoun
## 10 pushed       Verb
## # … with 82 more rows

FIN

If you’re playing along at home, try adding a function to this Swift file that uses Apple’s entity tagger.

The next installment of this topic will be how to wrap all this into a package (then all these examples get tweaked and go into the tome.

I’ve been on a Swift + R bender for a while now, but have been envious of the pure macOS/iOS (et al) folks who get to use Apple’s seriously ++good machine learning libraries, which are even more robust on the new M1 hardware (it’s cool having hardware components dedicated to improving the performance of built models).

Sure, it’s pretty straightforward to make a command-line utility that can take data input, run them through models, then haul the data back into R, but I figured it was about time that Swift got the “Rust” and “Go” treatment in terms of letting R call compiled Swift code directly. Thankfully, none of this involves using Xcode since it’s one of the world’s worst IDEs.

To play along at home you’ll need macOS and at least the command line tools installed (I don’t think this requires a full Xcode install, but y’all can let me know if it does in the comments). If you can enter swiftc at a terminal prompt and get back <unknown>:0: error: no input files then you’re good-to-go.

Hello, Swift!

To keep this post short (since I’ll be adding this entire concept to the SwiftR tome), we’ll be super-focused and just build a shared library we can dynamically load into R. That library will have one function which will be to let us say hello to the planet with a customized greeting.

Make a new directory for this effort (I called mine greetings) and create a greetings.swift file with the following contents:

All this code is also in this gist.

@_cdecl ("greetings_from")
public func greetings_from(_ who: SEXP) -> SEXP {
  print("Greetings, 🌎, it's \(String(cString: R_CHAR(STRING_ELT(who, 0))))!")
  return(R_NilValue)
}

Before I explain what’s going on there, also create a geetings.h file with the following contents:

#define USE_RINTERNALS

#include <R.h>
#include <Rinternals.h>

const char* R_CHAR(SEXP x);

In the Swift file, there’s a single function that takes an R SEXP and converts it into a Swift String which is then routed to stdout (not a “great” R idiom, but benign enough for an intro example). Swift functions aren’t C functions and on their own do not adhere to C calling conventions. Unfortunately R’s ability to work with dynamic library code requires such a contract to be in place. Thankfully, the Swift Language Overlords provided us with the ability to instruct the compiler to create library code that will force the calling conventions to be C-like (that’s what the @cdecl is for).

We’re using SEXP, some R C-interface functions, and even the C version of NULL in the Swift code, but we haven’t done anything in the Swift file to tell Swift about the existence of these elements. That’s what the C header file is for (I added the R_CHAR declaration since complex C macros don’t work in Swift).

Now, all we need to do is make sure the compiler knows about the header file (which is a “bridge” between C and Swift), where the R framework is, and that we want to generate a library vs a binary executable file as we compile the code. Make sure you’re in the same directory as both the .swift and .h file and execute the following at a terminal prompt:

swiftc \
  -I /Library/Frameworks/R.framework/Headers \ # where the R headers are
  -F/Library/Frameworks \                      # where the R.framework lives
  -framework R \                               # we want to link against the R framework
  -import-objc-header greetings.h \            # this is our bridging header which will make R C things available to Swift
  -emit-library \                              # we want a library, not an exe
  greetings.swift                              # our file!

If all goes well, you should have a libgreetings.dylib shared library in that directory.

Now, fire up a R console session in that directory and do:

greetings_lib <- dyn.load("libgreetings.dylib")

If there are no errors, the shared library has been loaded into your R session and we can use the function we just made! Let’s wrap it in an R function so we’re not constantly typing .Call(…):

greetings_from <- function(who = "me") {
  invisible(.Call("greetings_from", as.character(who[1])))
}

I also took the opportunity to make sure we are sending a length-1 character vector to the C/Swift function.

Now, say hello!

greetings_from("hrbrmstr")

And you should see:

Greetings, 🌎, it's hrbrmstr!

FIN

We’ll stop there for now, but hopefully this small introduction has shown how straightforward it can be to bridge Swift & R in the other direction.

I’ll have another post that shows how to extend this toy example to use one of Apple’s natural language processing libraries, and may even do one more on how to put all this into a package before I shunt all the individual posts into a few book chapters.

There was an org that didn’t see
The data exfil hacking spree.
A patch went up, our guard was down,
Oh blow, SolarWinds, blow.

Soon may the Vendorman come,
And bring us Yara rules to run.
One day when their huntin’ is done,
They’ll take their scripts and go.

There was no implant here before,
But after was a sly backdoor.
The C2 signals they did soar
And labored low and slow.

As the attack did laterally move
Our foe did get into their groove;
Stealing certs; installing ‘sploits
Wher’er they did go.

Then one day in the by and by,
Some clever folk with a firey-eye,
Spotted something amiss in Orion’s belt
And a delvin’ they did go.

The sunburst cleared and they did see
The origin of their recent misery.
A blog went up; they did proclaim
Many other orgs were laid low.

As far as I’ve heard, the fight’s still on;
The C2s blocked but the attackers aren’t done.
The Vendorman makes his regular call
To help CISO, crew and all.