Skip navigation

Horrible puns aside, hopefully everyone saw the news, earlier this week, from @thomasp85 on the evolution of modern typographic capabilities in the R ecosystem. Thomas (and some cohorts) has been working on {systemfonts}, {ragg}, and {textshaping} for quite a while now, and the — shall we say tidyglyphs ecosystem — is super-ready for prime time.

Thomas covered a seriously large amount of ground in his post, so please take some time to digest that before continuing.

Back? 👍🏽

While it is possible to mangage typographic needs with the foundry tools provided via the font-rendering package-triad, one would be hard-pressed to say that the following is “fun”, or even truly manageable coding:

library(systemfonts)

register_variant(
  name = "some-unique-prefix Inter some-style-01",
  weight = "normal",
  features = font_feature(
    poss = 1, ibly = 1, many = 1, 
    four = 1, char = 1, open = 1,
    type = 1, code = 1, spec = 1
  )
)

# remember that name

register_variant(
  name = "some-unique-prefix Inter some-style-02",
  weight = "normal",
  features = font_feature(
    poss = 1, ibly = 1, many = 1, 
    four = 1, char = 1, open = 1,
    type = 1, code = 1, spec = 1
  )
)

# remember that name 

# add a dozen more lines ...

ggplot() +
   geom_text(family = "oops-i-just-misspelled-the-family-name-*again*", ...) 

We’ve been given the power to level up our chart typography, but it’s sort of where literal typesetters (the ones who put blocks of type into a press) were and we can totally make our lives easier and charts prettier with the help of a new package — {hrbragg} https://git.rud.is/hrbrmstr/hrbragg — which is somewhat of a bridge between {ragg}, {systemfonts}, {textshaping} and a surprisingly popular package of mine: {hrbrthemes}. {hrbragg} is separate from {hrbrthemes} since this new typographic ecosystem is fairly restricted to {ragg} graphics devices (for the moment, as Thomas alluded the other day), and the new themes provided in {hrbragg} are a bit of a level-up from those in its sibling package.

Feature Management

At the heart of {systemfonts} lies the ability to tweak font features and bend them to your will. This somewhat old post shows why these tweaks exist and delves (but not too deeply) into the details of them, down to the four-letter codes that are used to represent and work with a given feature. But, what does calt mean? And, what is this tnum fellow you’ll be seeing a great deal of in R-land over the coming months? While one could leave the comfort of RStudio, VS Code, or vim to visit one of the reference links in Thomas’ package or {hrbragg}, I’ve included the most recent copy of tag-code<->full-tag-name<->short-tab-description in {hrbragg} as a usable data frame so you can treat it like the data it is!

library(systemfonts) # access to and tweaking OTFs!
library(textshaping) # lets us treat type as data
library(ragg)        # because it'll be lonely w/o the other two
library(hrbragg)     # remotes::install_git("https://git.rud.is/hrbrmstr/hrbragg.git")
library(tidyverse)   # nice printing, {ggplot2}, and b/c we'll do some font data wrangling

data("feature_dict")

feature_dict
## # A tibble: 122 x 3
##    tag   long_name                   description                                                                                             
##    <chr> <chr>                       <chr>                                                                                                   
##  1 aalt  Access All Alternates       Special feature: used to present user with choice all alternate forms of the character                  
##  2 abvf  Above-base Forms            Replaces the above-base part of a vowel sign. For Khmer and similar scripts.                            
##  3 abvm  Above-base Mark Positioning Positions a mark glyph above a base glyph.                                                              
##  4 abvs  Above-base Substitutions    Ligates a consonant with an above-mark.                                                                 
##  5 afrc  Alternative Fractions       Converts figures separated by slash with alternative stacked fraction form                              
##  6 akhn  Akhand                      Hindi for unbreakable.  Ligates consonant+halant+consonant, usually only for k-ss and j-ny combinations.
##  7 blwf  Below-base Forms            Replaces halant+consonant combination with a subscript form.                                            
##  8 blwm  Below-base Mark Positioning Positions a mark glyph below a base glyph                                                               
##  9 blws  Below-base Substitutions    Ligates a consonant with a below-mark.                                                                  
## 10 c2pc  Capitals to Petite Caps     Substitutes capital letters with petite caps                                                            
## # … with 112 more rows

You can also use help("opentype_typographic_features") to see an R help page with the same information. That page also has links external resource, one of which is a detailed manual of each feature with use-cases (in the event even the short-description is not as helpful as it could be).

Before one can think about using the bare-metal register_variant(..., font_feature(...)) duo, one has to know what features a particular type family supports. We can retrieve the feature codes with textshaping::get_font_features() and look them up in this data frame to get an at-a-glance view:

# old school subsetting ftw!
feature_dict[feature_dict$tag %in% textshaping::get_font_features("Inter")[[1]],]
## # A tibble: 19 x 3
##    tag   long_name                   description                                                                                                                 
##    <chr> <chr>                       <chr>                                                                                                                       
##  1 aalt  Access All Alternates       Special feature: used to present user with choice all alternate forms of the character                                      
##  2 calt  Contextual Alternates       Applies a second substitution feature based on a match of a character pattern within a context of surrounding patterns      
##  3 case  Case Sensitive Forms        Replace characters, especially punctuation, with forms better suited for all-capital text, cf. titl                         
##  4 ccmp  Glyph Composition/Decompos… Either calls a ligature replacement on a sequence of characters or replaces a character with a sequence of glyphs. Provides…
##  5 cpsp  Capital Spacing             Adjusts spacing between letters in all-capitals text                                                                        
##  6 dlig  Discretionary Ligatures     Ligatures to be applied at the user's discretion                                                                            
##  7 dnom  Denominator                 Converts to appropriate fraction denominator form, invoked by frac                                                          
##  8 frac  Fractions                   Converts figures separated by slash with diagonal fraction                                                                  
##  9 kern  Kerning                     Fine horizontal positioning of one glyph to the next, based on the shapes of the glyphs                                     
## 10 locl  Localized Forms             Substitutes character with the preferred form based on script language                                                      
## 11 mark  Mark Positioning            Fine positioning of a mark glyph to a base character                                                                        
## 12 numr  Numerator                   Converts to appropriate fraction numerator form, invoked by frac                                                            
## 13 ordn  Ordinals                    Replaces characters with ordinal forms for use after numbers                                                                
## 14 pnum  Proportional Figures        Replaces numerals with glyphs of proportional width, often also onum                                                        
## 15 salt  Stylistic Alternates        Either replaces with, or displays list of, stylistic alternatives for a character                                           
## 16 subs  Subscript                   Replaces character with subscript version, cf. numr                                                                         
## 17 sups  Superscript                 Replaces character with superscript version, cf. dnom                                                                       
## 18 tnum  Tabular Figures             Replaces numerals with glyphs of uniform width, often also lnum                                                             
## 19 zero  Slashed Zero                Replaces 0 figure with slashed 0        

Most of those will not be super-useful (yet) but there are three key features that I believe one needs when picking a font for a chart:

  • One of the *ligs (because ligatures.) are so gosh darn cool, pretty, and useful)
  • tnum for tabular numbers (essential in axis value display, and more)
  • kern for sweet, sweet letterspacing, or kerning

Since I’ve just made up a rule, let’s see how many fonts I have that support said rule:

(fam <- unique(system_fonts()[["family"]])) %>% 
  get_font_features() %>% 
  set_names(fam) %>% 
  keep(~sum(c(
    any(grepl("kern", .)), 
    any(grepl("tnum", .)),
    any(grepl(".lig|liga", .)) 
    )) == 3
  ) %>% 
  names() %>% 
  sort()
##  [1] "Barlow"                 "Goldman Sans"           "Goldman Sans Condensed" "Grantha Sangam MN"     
##  [5] "Inter"                  "Kohinoor Devanagari"    "Mukta Mahee"            "Museo Slab"            
##  [9] "Neufile Grotesk"        "Roboto"                 "Roboto Black"           "Roboto Condensed"      
## [13] "Roboto Light"           "Roboto Medium"          "Roboto Thin"            "Tamil Sangam MN"       
## [17] "Trattatello"           

I do have more, but they’re on a different Mac 😎.

{hrbragg} comes with Inter, Goldman Sans, and Roboto Condensed, so let’s explore one of them — Inter — and see how we might be able to make it useful but not tedious. The supported features of Inter are above and here are the family members:

system_fonts() %>% 
  filter(family == "Inter") %>% 
  select(name, family, style, weight, width, italic, monospace)
##  A tibble: 18 x 7
##    name                   family style              weight     width  italic monospace
##    <chr>                  <chr>  <chr>              <ord>      <ord>  <lgl>  <lgl>    
##  1 Inter-ExtraLight       Inter  Extra Light        light      normal FALSE  FALSE    
##  2 Inter-MediumItalic     Inter  Medium Italic      medium     normal TRUE   FALSE    
##  3 Inter-ExtraLightItalic Inter  Extra Light Italic light      normal TRUE   FALSE    
##  4 Inter-Bold             Inter  Bold               bold       normal FALSE  FALSE    
##  5 Inter-ThinItalic       Inter  Thin Italic        ultralight normal TRUE   FALSE    
##  6 Inter-SemiBold         Inter  Semi Bold          semibold   normal FALSE  FALSE    
##  7 Inter-BoldItalic       Inter  Bold Italic        bold       normal TRUE   FALSE    
##  8 Inter-Italic           Inter  Italic             normal     normal TRUE   FALSE    
##  9 Inter-Medium           Inter  Medium             medium     normal FALSE  FALSE    
## 10 Inter-BlackItalic      Inter  Black Italic       heavy      normal TRUE   FALSE    
## 11 Inter-Light            Inter  Light              normal     normal FALSE  FALSE    
## 12 Inter-SemiBoldItalic   Inter  Semi Bold Italic   semibold   normal TRUE   FALSE    
## 13 Inter-Regular          Inter  Regular            normal     normal FALSE  FALSE    
## 14 Inter-ExtraBoldItalic  Inter  Extra Bold Italic  ultrabold  normal TRUE   FALSE    
## 15 Inter-LightItalic      Inter  Light Italic       normal     normal TRUE   FALSE    
## 16 Inter-Thin             Inter  Thin               ultralight normal FALSE  FALSE    
## 17 Inter-ExtraBold        Inter  Extra Bold         ultrabold  normal FALSE  FALSE    
## 18 Inter-Black            Inter  Black              heavy      normal FALSE  FALSE    

Nobody. I mean, nobody wants to type eighteen+ font variant registration statements, which is why {hrbragg} comes with reconfigure_font(). Just give it the family name, the features you want supported, and it will take care of the tedium for you:

reconfigure_font(
  prefix = "hrbragg-pkg",
  family = "Inter",
  width = "normal",
  ligatures = "discretionary",
  calt = 1, tnum = 1, case = 1,
  dlig = 1, ss01 = 1, kern = 1,
  zero = 0, salt = 0
) -> customized_inter

# I'll have a proper print method for this soon

str(customized_inter, 1)
## List of 17
##  $ ultralight_italic: chr "clever-prefix Inter Thin Italic"
##  $ ultralight       : chr "clever-prefix Inter Thin"
##  $ light            : chr "clever-prefix Inter Extra Light"
##  $ light_italic     : chr "clever-prefix Inter Extra Light Italic"
##  $ normal_italic    : chr "clever-prefix Inter Light Italic"
##  $ normal_light     : chr "clever-prefix Inter Light"
##  $ normal           : chr "clever-prefix Inter Regular"
##  $ medium_italic    : chr "clever-prefix Inter Medium Italic"
##  $ medium           : chr "clever-prefix Inter Medium"
##  $ semibold         : chr "clever-prefix Inter Semi Bold"
##  $ semibold_italic  : chr "clever-prefix Inter Semi Bold Italic"
##  $ bold             : chr "clever-prefix Inter Bold"
##  $ bold_italic      : chr "clever-prefix Inter Bold Italic"
##  $ ultrabold_italic : chr "clever-prefix Inter Extra Bold Italic"
##  $ ultrabold        : chr "clever-prefix Inter Extra Bold"
##  $ heavy_italic     : chr "clever-prefix Inter Black Italic"
##  $ heavy            : chr "clever-prefix Inter Black"
##  - attr(*, "family")= chr "Inter"

The reconfigure_font() function applies the feature settings to all the family members, gives each a name with the stated prefix and provides a return value that supports autocompletion of the name in smart IDEs and practically negates the need to type out long, unique font names, like this:

ggplot() +
  geom_text(
    aes(1, 2, label = "Welcome to a <- customized -> Inter!"),
    size = 6, family = customized_inter$ultrabold
  ) +
  theme_void()

Note that we have a lovely emboldened font with clean ligatures without much work at all! (I should mention that if a prefix is not specified, a UUID is chosen instead since we don’t really care about the elongated names anymore).

While we’ve streamlined things a bit already, we can do even better.

Font-centric Themes

Just like {hrbrthemes}, {hrbragg} comes with some font/typographic-centric themes. We’ll focus on the one with Inter for the blog post. For the moment, you’ll need to install_inter() (you likely got prompted to do that if you already installed the package). This requirement will go away soon, but you’ll want to use Inter everywhere anyway, so I’d keep it installed.

Once that’s done, you’re ready to use theme_inter().

What’s that you say? Don’t we need to create a font variant first?

Would I do that to you? Never! {hrbragg} comes with a preconfigured inter_pkg font variant (which I’ll be tweaking a bit over the weekend for some edge cases) that pairs nicely with theme_inter(). Here it is in action with an old friend of ours:

ggplot() +
  geom_point(
    data = mtcars,
    aes(mpg, wt, color = factor(cyl))
  ) +
  geom_label(
    aes(
      x = 15, y = 5.48,
      label = "<- A fairly useless annotation\n       that uses the custom Inter\n          variant by default."
    ),
    label.size = 0, hjust = 0, vjust = 1
  ) +
  labs(
    x = "Fuel efficiency (mpg)", y = "Weight (tons)",
    title = "Seminal ggplot2 scatterplot example",
    subtitle = "A plot that is only useful for demonstration purposes",
    caption = "Brought to you by the letter 'g'"
  ) -> gg1

gg1 + theme_inter(grid = "XY", mode = "light") 

Wonderful kerning, a custom-built arrow due to fantastic, built-in ligatures, and spiffy tabular numbers. Gorgeous!

What was that you just asked? What’s up with that mode = "light"?. Did I forget to mention that all the {hrbragg} themes come with dark-mode support built in? My sincerest apologies. Choosing mode = "dark" will use a (configuratble) dark theme and using mode = "rstudio" (if you’re an RStudio user) will have the charts take on the IDE theme setting automagically. Here’s dark mode:

gg1 + theme_inter(grid = "XY", mode = "dark") 

The font+theme pairs automatically work and reconfigure all the ggplot2 aesthetic defaults accordingly. Since this makes heavy use of update_geom_defaults() I’ve included a (very necessary) reset_ggplot2_defaults() to get things back to normal when you need to.

Note that you can use adaptive_color() to help enable dark/light-mode color switching for your own pairings, and theme_background_color() or theme_foreground_color to utilize the (reconfigurable) default fore- and background theme colors.

Try before you buy…into using a given font

One can’t know ahead of time whether a font is going to work well, and you might want go get a feel for how a given set of family variants work for you. To that end, I’ve made it possible to preview any font you’ve reconfigured with reconfigure_font() via preview_variant(). It uses some pre-set text that exercises the key features I’ve outlined, but you can sub your own for them if you want to look at something in particular. Let’s give inter_pkg a complete look:

preview_variant(inter_pkg)

We can look at another one that we’ll create now (I did not realize this font had tabular numbers until Thomas built all these wonderful toys to play with!):

reconfigure_font(
  family = "Trattatello",
  width = "normal",
  ligatures = "discretionary",
  calt = 1, tnum = 1, case = 1,
  dlig = 1, kern = 1,
  zero = 0, salt = 0
) -> trat

preview_variant(trat)

FIN

The {hrbragg} package is not even 24 hours old yet, so there are breaking changes and many new, heh, features still to come, but please — as usual — kick the tyres and post questions, feedback, contributions, or suggestions wherever you’re most comfortable (the package is on most of the popular social coding sites).

💙 Expand for EKG code
library(hrbrthemes)
library(elementalist) # remotes::install_github("teunbrand/elementalist")
library(ggplot2)

read_csv(
  file = "~/Data/apple_health_export/electrocardiograms/ecg_2020-09-24.csv", # this is extracted below
  skip = 12,
  col_names = "µV"
) %>% 
  mutate(
    idx = 1:n()
  ) -> ekg

ggplot() +
  geom_line_theme(
    data = ekg %>% tail(3000) %>% head(2500),
    aes(idx, µV),
    size = 0.125, color = "#cb181d"
  ) +
  labs(x = NULL, y = NULL) +
  theme_ipsum_inter(grid="") +
  theme(
    panel.background = element_rect(color = NA, fill = "#141414"),
    plot.background = element_rect(color = NA, fill = "#141414")
  ) +
  theme(
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    elementalist.geom_line= element_line_glow()
  )

Apple Watch owners have the ability to export their tracked data and do whatever they like with it. Since it’s Valentine’s Day, I thought it might be fun to show two ways to read heart rate data from these exports.

Why two ways? Well, I’ve owned an Apple Watch off-and-on ever since the first generation device, and when Apple says you can export all your data, they mean all. The apple_health_export.zip archive is generated by going to the “Health” iOS app, tapping your avatar in the upper left, then scrolling down and tapping the export button:

apple health data export screenshot

(NOTE: I suggest saving it to and then downloading it from iCloud vs using local AirDrop to your system.)

This compressed file is a deceivingly ~58 MB in size. Opening it up results in a directory tree of nearly 3 GB of consumed drive space O_o. That tree has the following structure:

fs::dir_tree("~/Data/apple_health_export", recurse = 1)
## ~/Data/apple_health_export
## ├── electrocardiograms
## │   └── ecg_2020-09-24.csv             # 122 KB
## ├── export.xml                         # 882 MB
## ├── export_cda.xml                     # 950 MB
## └── workout-routes                     #  81 MB
##     ├── ...
##     ├── route_2021-01-28_5.21pm.gpx
##     ├── route_2021-01-31_4.28pm.gpx
##     ├── route_2021-02-02_1.26pm.gpx
##     ├── route_2021-02-04_3.52pm.gpx
##     ├── route_2021-02-06_2.24pm.gpx
##     └── route_2021-02-10_4.54pm.gpx

The heart rate data is in the just-under 1 GB export.xml and is mixed in with all the other data points Apple records. They look like this:

<Record 
  type="HKQuantityTypeIdentifierHeartRate" 
  sourceName="Apple Watch" 
  sourceVersion="3.2" 
  device="<<HKDevice: 0x2812d8a00>, name:Apple Watch, manufacturer:Apple, model:Watch, hardware:Watch1,2, software:3.2>" 
  unit="count/min" 
  creationDate="2017-04-29 12:21:15 -0500" 
  startDate="2017-04-29 12:21:15 -0500" 
  endDate="2017-04-29 12:21:15 -0500" 
  value="102"
/>

Note that newer records of this type are not empty tags.

While dealing with gigabyte+ XML files are not nearly as untenable as they used to be in R, building a parsed XML tree in memory for all of those records will take up a non-insignificant amount of RAM (we’ll see how much below). Since I want to start playing with this data more often I decided to try two approaches: one that processes the XML in streaming “chunks” and one that does it the way you’re likely used to (if you’re unfortunate enough to have to work with XML regularly).

Streaming 💙 Beats

We’ll start with the streaming approach, which means using the venerable {XML} package, which has xmlEventParse() which is an event-driven or SAX (Simple API for XML) style parser which process XML without building the tree but rather identifies tokens in the stream of characters and passes them to handlers which can make sense of them in context. Since we’re going old-school, we’ll also use {data.table} to get a tidy dataset to work with.

We’re going to be finding heart rate records and storing the data from them into a list, so we’ll need to make room for them and use indexed-based value assignments to avoid making thousands of copies with append(). To figure out how much room we’ll need I’m going to “cheat” a bit and use ripgrep to count how many HKQuantityTypeIdentifierHeartRate records exist and use that result to reserve list space:

library(XML)
library(data.table)

nl <- system("rg -c 'type=\"HKQuantityTypeIdentifierHeartRate' ~/Data/apple_health_export/export.xml", intern = TRUE)
records <- vector(mode = "list", as.numeric(nl))
idx <- 1

There are just under 790K records buried in that file. The xmlEventParse() function has a handlers parameter which takes a list named functions for various events. The event we care about is the one where we start processing an XML element, which is unsurprisingly called startElement. In it, we’ll only process HKQuantityTypeIdentifierHeartRate records and further only care about data since 2019:

invisible(xmlEventParse(
  file = "~/Data/apple_health_export/export.xml",
  handlers = list(

    # process at element start

    startElement = function(name, attrs) {

      # only care about the heart rate recs

      if ((name == "Record") && (attrs["type"] == "HKQuantityTypeIdentifierHeartRate")) {

        # only care about records >= the year 2019

        if (substr(attrs["endDate"], 1, 4) >= 2019) {

          # if we find them, add them to the list (note the <<-)
          records[idx] <<- list(as.list(unname(attrs[c("endDate", "value")]))) # not using names reduces memory
          idx <<- idx + 1

        }
      }
    }
  )
))

At this point we have a list of all those records and have taken the R session memory from 131 MiB to 629 MiB (so, we’re eating about ~500 MiB of RAM with that call), and it took around 34 painful seconds to process the XML file.

Now, we’ll use {data.table} to tidy it up:

records <- records[lengths(records) != 0]         # get rid of any list elements we didn't use

records <- rbindlist(records, use.names = FALSE)  # make a data frame
setattr(records, 'names', c("ts", "rate"))

records[, c("ts", "rate") := list(
  as.POSIXct(ts, format = "%Y-%m-%d %H:%M:%S %z"),
  as.integer(rate)
)]  
##                          ts rate
##      1: 2019-02-12 15:19:54   69
##      2: 2019-02-12 15:26:11   90
##      3: 2019-02-12 15:31:33   92
##      4: 2019-02-12 15:34:24   89
##      5: 2019-02-12 15:57:33  120
##     ---                         
## 734526: 2021-02-13 10:17:08  118
## 734527: 2021-02-13 10:26:50  124
## 734528: 2021-02-13 10:22:56  110
## 734529: 2021-02-13 10:34:56   98
## 734530: 2021-02-13 10:39:34   99

That took around 4.5 seconds, and when the R garbage collector kicks in we’re now consuming ~695 MiB, so not much more than the previous step.

So, ~38s for the ingestion & conversion, and a maximum of ~695 MiB in play at any time during the R session. Let’s see how the new/modern way (i.e. {xml2}) compares.

Modern 💙

Unless I missed something in the {xml2} index page, there is no equivalent streaming processor, so we have to read the entire document into active RAM:

library(xml2)
library(tidyverse)

records <- xml2::read_xml("~/Data/apple_health_export/export.xml")

This operation takes 15.7s and the R session now consumes ~5.8 GiB of RAM. That is a “G”, as in gigabyte.

Now, we’ll find all the records that we care about (as above). We’ll do this via a modest XPath selector:

xml_find_all(
  records,
  xpath = "
    .//Record[
         @type = 'HKQuantityTypeIdentifierHeartRate' and
         (starts-with(@endDate, '2019') or 
          starts-with(@endDate, '2020') or 
          starts-with(@endDate, '2021'))
      ]"
) -> records

That operation took around ~6.5s and we’re still consuming around 6.23 GiB of RAM.

Now, we’ll tidy that up:

tibble(
  ts = records %>% 
    xml_attr("endDate") %>% 
    as.POSIXct(format = "%Y-%m-%d %H:%M:%S %z"),  
  rate = records %>% 
    xml_attr("value") %>% 
    as.integer()
) -> records

records
## # A tibble: 734,530 x 2
##    ts                   rate
##    <dttm>              <int>
##  1 2019-02-12 15:19:54    69
##  2 2019-02-12 15:26:11    90
##  3 2019-02-12 15:31:33    92
##  4 2019-02-12 15:34:24    89
##  5 2019-02-12 15:57:33   120
##  6 2019-02-12 15:44:09    80
##  7 2019-02-12 16:03:24   110
##  8 2019-02-12 16:13:08   118
##  9 2019-02-12 16:08:10   100
## 10 2019-02-12 16:15:04    95
## # … with 734,520 more rows

That took around 10.4s and, after garbage collection happens, we’re back to a much more reasonable ~890 MiB of consumed RAM after a workflow maximum of over 6 GiB, taking a total of ~32.6 seconds.

FIN 💙

If/when memory is tight, it’s nice to have some alternatives besides “get a bigger box”, and this is one approach (there are others) for performing this type of XML surgery in R.

Stay safe/strong, folks.

Presented without much commentary since I stopped once {ggrepel} and {graphlayouts} failed (RStudio doesn’t support it yet, either, which I knew).

The following steps will get you a fully working and STUPID FAST fully native ARM64 M1/Apple Silicon R setup with {tidyverse} and {rJava}.

Just remember, that if you need RStudio (or anything that links against the x86_64 R dylib) you’re going to be reverting this to get stuff done.

# Setting up ARM64 R on Apple Silicon/M1

# I'd run each section by hand, but feel free to live dangerously

# save off what you have installed in homebrew
brew list > ~/Documents/currently-installed-homebrew-formulas.txt

# uninstall x86_64 homebrew
# NOTE that in theory x86_64 and arm64 homebrew can live happily together but
# I went cheap on the SSD in the M1 Mini and wld like the space back
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/uninstall.sh)"

# make sure you unalias "brew" if you aliased it with "arch"
# need to do this in whatever shell startup script(s) you use, too
unalias brew

# install arm64 homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

# do what it says re: paths

# install wget (to make sure stuff is working)
brew install wget 

# place for r-libs; ty R Core & Prof Ripley!
mkdir ~/Downloads/libs-arm64/

# go there
cd ~/Downloads/libs-arm64

# grab'em from https://mac.r-project.org/libs-arm64/
for dl in $(curl -sS "https://mac.r-project.org/libs-arm64/" \
  | xmllint --html --xpath '//td/a[contains(@href, 'tar.gz')]/@href' 2>/dev/null - \
  | sed -e 's/ href="//g' -e 's/"/\n/g'); do wget "https://mac.r-project.org/libs-arm64/${dl}" ; done

# prime sudo (not rly necessary but I dislike having to enter sudo passwords in a for loop)
sudo ls ~/Downloads/libs-arm64

# extract'em
for gz in $(ls ~/Downloads/libs-arm64/*gz); do  
  sudo tar fvxz ${gz} -C /
done

# grab r-devel
cd ~/Downloads
wget https://mac.r-project.org/big-sur/R-devel/arm64/R-devel.tar.gz

# extract it
tar fvxz R*.tar.gz -C /

# install libxml2 and more (to prime libraries)
brew install libxml2 ccache libgit2 unixodbc poppler coreutils

# review ~/Documents/currently-installed-homebrew-formulas.txt and add what you need from there

# put this in your shell startup (macOS folks shld get used to zsh, so ~/.zshrc is a good place to stick it at the end)
export PATH=/opt/R/arm64/bin:$PATH

# and also run it at the command prompt
export PATH=/opt/R/arm64/bin:$PATH

# go for broke!
Rscript -e "install.packages('tidyverse')"

# throw caution to the wind!
Rscript -e "install.packages('devtools')"

# shoot for the moon!
Rscript -e "install.packages(c('DBI', 'odbc'))"

# it's crazytown
Rscript -e "install.packages('ggraph')"

# ARGH! {ggrepel} and {graphlayouts} fail (RStudio won't work anyway so this whole thing was just an exercise)

# setup java; open and run the pkg installer; they have a tar.gz as well 
wget https://cdn.azul.com/zulu/bin/zulu11.45.27-ca-jdk11.0.10-macosx_aarch64.dmg

# tell R about it
R CMD javareconf

# setup JAVA_HOME like you would in your shell

# this is necessary as the Java framework is gone from macOS in Big Sur
Rscript -e 'install.packages("rJava",,"http://rforge.net")'

##################################
# TO ***UNDO*** ALL OF THE ABOVE #
##################################

# uninstall amd64 homebrew
# NOTE that in theory x86_64 and arm64 homebrew can live happily together but
# I went cheap on the SSD in the M1 Mini and wld like the space back
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/uninstall.sh)"

# make sure you re-alias "brew" 
# need to do this in whatever shell startup script(s) you use, too
alias brew='arch --x86_64 brew'

# install x86_64 homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

# install libxml2 and more (to prime libraries)
brew install libxml2 ccache libgit2 unixodbc poppler coreutils

# review ~/Documents/currently-installed-homebrew-formulas.txt and add what you need from there

# remove the following from your shell startup script(s)
export PATH=/opt/R/arm64/bin:$PATH

# re-setup java for x86_64; open and run the pkg installer; they have a tar.gz as well 
wget "https://cdn.azul.com/zulu/bin/zulu11.45.27-ca-jdk11.0.10-macosx_x64.dmg" 

# re-do the parts of the rest of the setup that you need to

Until CRAN has {sf} binaries, use this recipe to build it from source.

If you’ve been following me around the internets for a while you’ve likely heard me pontificate about the need to be aware of and reduce — when possible — your personal “cyber” attack surface. One of the ways you can do that is to install as few applications as possible onto your devices and make sure you have a decent handle on those you’ve kept around are doing or capable of doing.

On macOS, one application attribute you can look at is the set of “entitlements” apps have asked for and that you have actioned on (i.e. either granted or denied the entitlement request). If you have Developer Tools or Xcode installed you can use the codesign utility (it may be usable w/o the developer tools, but I never run without them so drop a note in the comments if you can confirm this) to see them:

$ codesign -d --entitlements :- /Applications/RStudio.app
Executable=/Applications/RStudio.app/Contents/MacOS/RStudio
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>

  <!-- Required by R packages which want to access the camera. -->
  <key>com.apple.security.device.camera</key>
  <true/>

  <!-- Required by R packages which want to access the microphone. -->
  <key>com.apple.security.device.audio-input</key>
  <true/>

  <!-- Required by Qt / Chromium. -->
  <key>com.apple.security.cs.disable-library-validation</key>
  <true/>
  <key>com.apple.security.cs.disable-executable-page-protection</key>
  <true/>

  <!-- We use DYLD_INSERT_LIBRARIES to load the libR.dylib dynamically. -->
  <key>com.apple.security.cs.allow-dyld-environment-variables</key>
  <true/>

  <!-- Required to knit to Word -->
  <key>com.apple.security.automation.apple-events</key>
  <true/>

</dict>
</plist>

The output is (ugh) XML, and don’t think that all app developers are as awesome as RStudio ones since those comments are pseudo-optional (i.e. you can put junk in them). I’ll continue to use RStudio throughout this example just for consistency.

Since you likely have better things to do than execute a command line tool multiple times and do significant damage to your eyes with all those pointy tags we can use R to turn the apps on our filesystem into data and examine the entitlements in a much more dignified manner.

First, we’ll write a function to wrap the codesign tool execution (and, I’ve leaked how were going to eventually look at them by putting all the library calls up front):

library(XML)
library(tidyverse)
library(igraph)
library(tidygraph)
library(ggraph)

# rewriting this to also grab the text from the comments is an exercise left to the reader

read_entitlements <- function(app) { 

  system2(
    command = "codesign",
    args = c(
      "-d",
      "--entitlements",
      ":-",
      gsub(" ", "\\\\ ", app)
    ),
    stdout = TRUE
  ) -> x

  x <- paste0(x, collapse = "\n")

  if (nchar(x) == 0) return(tibble())

  x <- XML::xmlParse(x, asText=TRUE)
  x <- try(XML::readKeyValueDB(x), silent = TRUE)

  if (inherits(x, "try-error")) return(tibble())

  x <- sapply(x, function(.x) paste0(.x, collapse=";"))

  if (length(x) == 0) return(tibble())

  data.frame(
    app = basename(app),
    entitlement = make.unique(names(x)),
    value = I(x)
  ) -> x

  x <- tibble::as_tibble(x)

  x

} 

Now, we can slurp up all the entitlements with just a few lines of code:

my_apps <- list.files("/Applications", pattern = "\\.app$", full.names = TRUE)

my_apps_entitlements <- map_df(my_apps, read_entitlements)

my_apps_entitlements %>% 
  filter(grepl("RStudio", app))
## # A tibble: 6 x 3
##   app         entitlement                                              value   
##   <chr>       <chr>                                                    <I<chr>>
## 1 RStudio.app com.apple.security.device.camera                         TRUE    
## 2 RStudio.app com.apple.security.device.audio-input                    TRUE    
## 3 RStudio.app com.apple.security.cs.disable-library-validation         TRUE    
## 4 RStudio.app com.apple.security.cs.disable-executable-page-protection TRUE    
## 5 RStudio.app com.apple.security.cs.allow-dyld-environment-variables   TRUE    
## 6 RStudio.app com.apple.security.automation.apple-events               TRUE 

Having these entitlement strings is great, but what do they mean? Unfortunately, Apple, frankly, sucks at developer documentation, and this suckage shines especially bright when it comes to documenting all the possible entitlements. We can retrieve some of them from the online documentation, so let’s do that and re-look at RStudio:

# a handful of fairly ok json URLs that back the online dev docs; they have ok, but scant entitlement definitions
c(
  "https://developer.apple.com/tutorials/data/documentation/bundleresources/entitlements.json",
  "https://developer.apple.com/tutorials/data/documentation/security/app_sandbox.json",
  "https://developer.apple.com/tutorials/data/documentation/security/hardened_runtime.json",
  "https://developer.apple.com/tutorials/data/documentation/bundleresources/entitlements/system_extensions.json"
) -> entitlements_info_urls

extract_entitlements_info <- function(x) {

  apple_ents_pg <- jsonlite::fromJSON(x)

  apple_ents_pg$references %>% 
    map_df(~{

      if (!hasName(.x, "role")) return(tibble())
      if (.x$role != "symbol") return(tibble())

      tibble(
        title = .x$title,
        entitlement = .x$name,
        description = .x$abstract$text %||% NA_character_
      )

    })

}

entitlements_info_urls %>% 
  map(extract_ents_info) %>% 
  bind_rows() %>% 
  distinct() -> apple_entitlements_definitions

# look at rstudio again ---------------------------------------------------

my_apps_entitlements %>% 
  left_join(apple_entitlements_definitions) %>% 
  filter(grepl("RStudio", app)) %>% 
  select(title, description)
## Joining, by = "entitlement"
## # A tibble: 6 x 2
##   title                            description                                                       
##   <chr>                            <chr>                                                             
## 1 Camera Entitlement               A Boolean value that indicates whether the app may capture movies…
## 2 Audio Input Entitlement          A Boolean value that indicates whether the app may record audio u…
## 3 Disable Library Validation Enti… A Boolean value that indicates whether the app may load arbitrary…
## 4 Disable Executable Memory Prote… A Boolean value that indicates whether to disable all code signin…
## 5 Allow DYLD Environment Variable… A Boolean value that indicates whether the app may be affected by…
## 6 Apple Events Entitlement         A Boolean value that indicates whether the app may prompt the use…

It might be interesting to see what the most requested entitlements are:


my_apps_entitlements %>% filter( grepl("security", entitlement) ) %>% count(entitlement, sort = TRUE) ## # A tibble: 60 x 2 ## entitlement n ## <chr> <int> ## 1 com.apple.security.app-sandbox 51 ## 2 com.apple.security.network.client 44 ## 3 com.apple.security.files.user-selected.read-write 35 ## 4 com.apple.security.application-groups 29 ## 5 com.apple.security.automation.apple-events 26 ## 6 com.apple.security.device.audio-input 19 ## 7 com.apple.security.device.camera 17 ## 8 com.apple.security.files.bookmarks.app-scope 16 ## 9 com.apple.security.network.server 16 ## 10 com.apple.security.cs.disable-library-validation 15 ## # … with 50 more rows

Playing in an app sandbox, talking to the internet, and handling files are unsurprising in the top three slots since that’s how most apps get stuff done for you.

There are a few entitlements which increase your attack surface, one of which is apps that use untrusted third-party libraries:

my_apps_entitlements %>% 
  filter(
    entitlement == "com.apple.security.cs.disable-library-validation"
  ) %>% 
  select(app)
## # A tibble: 15 x 1
##    app                      
##    <chr>                    
##  1 Epic Games Launcher.app  
##  2 GarageBand.app           
##  3 HandBrake.app            
##  4 IINA.app                 
##  5 iStat Menus.app          
##  6 krisp.app                
##  7 Microsoft Excel.app      
##  8 Microsoft PowerPoint.app 
##  9 Microsoft Word.app       
## 10 Mirror for Chromecast.app
## 11 Overflow.app             
## 12 R.app                    
## 13 RStudio.app              
## 14 RSwitch.app              
## 15 Wireshark.app 

(‘Tis ironic that one of Apple’s own apps is in that list.)

What about apps that listen on the network (i.e. are also servers)?

## # A tibble: 16 x 1
##    app                          
##    <chr>                        
##  1 1Blocker.app                 
##  2 1Password 7.app              
##  3 Adblock Plus.app             
##  4 Divinity - Original Sin 2.app
##  5 Fantastical.app              
##  6 feedly.app                   
##  7 GarageBand.app               
##  8 iMovie.app                   
##  9 Keynote.app                  
## 10 Kindle.app                   
## 11 Microsoft Remote Desktop.app 
## 12 Mirror for Chromecast.app    
## 13 Slack.app                    
## 14 Tailscale.app                
## 15 Telegram.app                 
## 16 xScope.app 

You should read through the retrieved definitions to see what else you may want to observe to be an informed macOS app user.

The Big Picture

Looking at individual apps is great, but why not look at them all? We can build a large, but searchable network graph hierarchy if we output it as PDf, so let’s do that:

# this is just some brutish force code to build a hierarchical edge list

my_apps_entitlements %>% 
  distinct(entitlement) %>% 
  pull(entitlement) %>% 
  stri_count_fixed(".") %>% 
  max() -> max_dots

my_apps_entitlements %>% 
  distinct(entitlement, app) %>% 
  separate(
    entitlement,
    into = sprintf("level_%02d", 1:(max_dots+1)),
    fill = "right",
    sep = "\\."
  ) %>% 
  select(
    starts_with("level"), app
  ) -> wide_hierarchy

bind_rows(

  distinct(wide_hierarchy, level_01) %>%
    rename(to = level_01) %>%
    mutate(from = ".") %>%
    select(from, to) %>% 
    mutate(to = sprintf("%s_1", to)),

  map_df(1:nrow(wide_hierarchy), ~{

    wide_hierarchy[.x,] %>% 
      unlist(use.names = FALSE) %>% 
      na.exclude() -> tmp

    tibble(
      from = tail(lag(tmp), -1),
      to = head(lead(tmp), -1),
      lvl = 1:length(from)
    ) %>% 
      mutate(
        from = sprintf("%s_%d", from, lvl),
        to = sprintf("%s_%d", to, lvl+1)
      )

  }) %>% 
    distinct()

) -> long_hierarchy

# all that so we can make a pretty graph! ---------------------------------

g <- graph_from_data_frame(long_hierarchy, directed = TRUE)

ggraph(g, 'partition', circular = FALSE) + 
  geom_edge_diagonal2(
    width = 0.125, color = "gray70"
  ) + 
  geom_node_text(
    aes(
      label = stri_replace_last_regex(name, "_[[:digit:]]+$", "")
    ),
    hjust = 0, size = 3, family = font_gs
  ) +
  coord_flip() -> gg

# saving as PDF b/c it is ginormous, but very searchable

quartz(
  file = "~/output-tmp/.pdf", # put it where you want
  width = 21,
  height = 88,
  type = "pdf",
  family = font_gs
)
print(gg)
dev.off()

The above generates a large (dimension-wise; it’s ~<5MB on disk for me) PDF graph that is barely viewable in thumnail mode:

Here are some screen captures of portions of it. First are all network servers and clients:

Last are seekrit entitlements only for Apple:

FIN

I’ll likely put a few of these functions into {mactheknife} for easier usage.

After going through this exercise I deleted 11 apps, some for their entitlements and others that I just never use anymore. Hopefully this will help you do some early Spring cleaning as well.

I was chatting with a fellow Amazon Athena user and the topic of using Presto functions such as approx_distinct() via {d[b]plyr} came up and it seems it might not be fully common knowledge that any non-already translated function is passed to the destination intact. That means you can just “use” approx_distinct() and it will work just fine. Here’s an example using the ODBC {DBI} interface:

library(dbplyr)
library(tidyverse)

# My personal Athena workgroup has been upgraded to "engine 2"
# so Presto 0.217 functions are available. Only noting that for
# folks who may not keep up with AWS announcements.
#
# https://prestodb.io/docs/0.217/index.html

DBI::dbConnect(
  odbc::odbc(),
  driver = "/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib",
  Schema = "sampledb",
  AwsRegion = "us-east-1",
  AuthenticationType = "IAM Profile",
  AWSProfile = "personal",
  MaxCatalogNameLen = 0L,
  MaxSchemaNameLen = 0L,
  MaxColumnNameLen = 0L,
  MaxTableNameLen = 0L,
  UseResultsetStreaming = 1L,
  StringColumnLength = 32 * 1024L,
  S3OutputLocation = "s3://accessible-bucket/"
) -> con

# this comes with Athena
elb_logs <- tbl(con, "elb_logs")

elb_logs
## # Source:   table<elb_logs> [?? x 16]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##    timestamp elbname requestip requestport backendip backendport
##    <chr>     <chr>   <chr>           <int> <chr>           <int>
##  1 2014-09-… lb-demo 251.51.8…       17141 251.111.…        8000
##  2 2014-09-… lb-demo 244.201.…       17141 244.140.…        8888
##  3 2014-09-… lb-demo 242.204.…       17141 255.196.…        8888
##  4 2014-09-… lb-demo 251.51.8…       17141 255.129.…        8888
##  5 2014-09-… lb-demo 242.241.…       17141 255.129.…        8899
##  6 2014-09-… lb-demo 243.198.…       17141 255.129.…        8888
##  7 2014-09-… lb-demo 244.119.…       17141 242.89.1…          80
##  8 2014-09-… lb-demo 254.173.…       17141 251.51.8…        8000
##  9 2014-09-… lb-demo 243.198.…       17141 254.149.…        8888
## 10 2014-09-… lb-demo 249.185.…       17141 241.36.2…        8888
## # … with more rows, and 10 more variables: requestprocessingtime <dbl>,
## #   backendprocessingtime <dbl>, clientresponsetime <dbl>,
## #   elbresponsecode <chr>, backendresponsecode <chr>,
## #   receivedbytes <int64>, sentbytes <int64>, requestverb <chr>,
## #   url <chr>, protocol <chr>

elb_logs %>% 
  summarise(d = n_distinct(backendip)) # 0.62 seconds
## # Source:   lazy query [?? x 1]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##         d
##   <int64>
## 1    2311

# https://prestodb.io/docs/0.217/functions/aggregate.html#approx_distinct

elb_logs %>% 
  summarise(d = approx_distinct(backendip)) # 0.49 seconds
## # Source:   lazy query [?? x 1]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##         d
##   <int64>
## 1    2386

In this toy example there’s no real reason to use this alternate function, but on my datasets using the approximator version dramatically reduces query time, reduces query cost, and produces results that by default have a standard error of 2.3% (which is fine for the use-cases I apply this to). There’s an alternate signature which lets you supply the standard error, as well.

If you’re curious as to what functions are translated by default, just use sql_translate_env() on the connection object:

sql_translate_env(con)
## <sql_variant>
## scalar:    -, :, !, !=, (, [, [[, {, *, /, &, &&, %/%, %%, %>%,
## scalar:    %in%, ^, +, <, <=, ==, >, >=, |, ||, $, abs, acos,
## scalar:    as_date, as_datetime, as.character, as.Date,
## scalar:    as.double, as.integer, as.integer64, as.logical,
## scalar:    as.numeric, as.POSIXct, asin, atan, atan2, between,
## scalar:    bitwAnd, bitwNot, bitwOr, bitwShiftL, bitwShiftR,
## scalar:    bitwXor, c, case_when, ceil, ceiling, coalesce, cos,
## scalar:    cosh, cot, coth, day, desc, exp, floor, hour, if,
## scalar:    if_else, ifelse, is.na, is.null, log, log10, mday,
## scalar:    minute, month, na_if, nchar, now, paste, paste0, pmax,
## scalar:    pmin, qday, round, second, sign, sin, sinh, sql, sqrt,
## scalar:    str_c, str_conv, str_count, str_detect, str_dup,
## scalar:    str_extract, str_extract_all, str_flatten, str_glue,
## scalar:    str_glue_data, str_interp, str_length, str_locate,
## scalar:    str_locate_all, str_match, str_match_all, str_order,
## scalar:    str_pad, str_remove, str_remove_all, str_replace,
## scalar:    str_replace_all, str_replace_na, str_sort, str_split,
## scalar:    str_split_fixed, str_squish, str_sub, str_subset,
## scalar:    str_to_lower, str_to_title, str_to_upper, str_trim,
## scalar:    str_trunc, str_view, str_view_all, str_which,
## scalar:    str_wrap, substr, substring, switch, tan, tanh, today,
## scalar:    tolower, toupper, trimws, wday, xor, yday, year
## aggregate: cume_dist, cummax, cummean, cummin, cumsum,
## aggregate: dense_rank, first, lag, last, lead, max, mean, median,
## aggregate: min, min_rank, n, n_distinct, nth, ntile, order_by,
## aggregate: percent_rank, quantile, rank, row_number, sd, sum, var
## window:    cume_dist, cummax, cummean, cummin, cumsum,
## window:    dense_rank, first, lag, last, lead, max, mean, median,
## window:    min, min_rank, n, n_distinct, nth, ntile, order_by,
## window:    percent_rank, quantile, rank, row_number, sd, sum, var

The release of the latest versions of {d[b]plyr} destroyed a lazy, bad, hack I was using to cast columns to JSON (you’ll note the lack of a cast() function above, which is necessary for Athena since the syntax is not that of a function call). I’m _very_glad they did since it’s bad to rely on undocumented functionality and, honestly, it’s pretty straightforward to make an “official” translation for them.

First, we need the class of this Athena ODBC connection:

class(con)
## [1] "Amazon Athena"
## attr(,"package")
## [1] ".GlobalEnv"

We’ll need to write a sql_translation.Amazon Athena() function for this connection class and we’ll start with writing one that doesn’t handle our casting just to show the basic setup:

`sql_translation.Amazon Athena` <- function(x) {
  sql_variant(
    dbplyr::base_odbc_scalar,
    dbplyr::base_odbc_agg,
    dbplyr::base_odbc_win
  )
}

All that function is doing (now) is setting up the default translators you’ve seen in the above output listings.

To make it do something else, we need to add casting translator helpers, which fall under the “scalar” category. This, too, is pretty straightforward since {dbplyr} makes it possible to just extend a parent set of category translators:

sql_translator(
  .parent = dbplyr::base_odbc_scalar,
  cast_as = function(x, y) dbplyr::build_sql("CAST(", x, " AS ", y, ")"),
  try_cast_as = function(x, y) dbplyr::build_sql("TRY_CAST(", x, " AS ", y, ")")
) -> athena_scalar

`sql_translation.Amazon Athena` <- function(x) {
  sql_variant(
    athena_scalar,
    dbplyr::base_odbc_agg,
    dbplyr::base_odbc_win
  )
}

Now, let’s see if it really knows about our new casting functions:

sql_translate_env(con)
## <sql_variant>
## scalar:    -, :, !, !=, (, [, [[, {, *, /, &, &&, %/%, %%, %>%,
## scalar:    %in%, ^, +, <, <=, ==, >, >=, |, ||, $, abs, acos,
## scalar:    as_date, as_datetime, as.character, as.Date,
## scalar:    as.double, as.integer, as.integer64, as.logical,
## scalar:    as.numeric, as.POSIXct, asin, atan, atan2, between,
## scalar:    bitwAnd, bitwNot, bitwOr, bitwShiftL, bitwShiftR,
## scalar:    bitwXor, c, case_when, cast_as, ceil, ceiling,
## scalar:    coalesce, cos, cosh, cot, coth, day, desc, exp, floor,
## scalar:    hour, if, if_else, ifelse, is.na, is.null, log, log10,
## scalar:    mday, minute, month, na_if, nchar, now, paste, paste0,
## scalar:    pmax, pmin, qday, round, second, sign, sin, sinh, sql,
## scalar:    sqrt, str_c, str_conv, str_count, str_detect, str_dup,
## scalar:    str_extract, str_extract_all, str_flatten, str_glue,
## scalar:    str_glue_data, str_interp, str_length, str_locate,
## scalar:    str_locate_all, str_match, str_match_all, str_order,
## scalar:    str_pad, str_remove, str_remove_all, str_replace,
## scalar:    str_replace_all, str_replace_na, str_sort, str_split,
## scalar:    str_split_fixed, str_squish, str_sub, str_subset,
## scalar:    str_to_lower, str_to_title, str_to_upper, str_trim,
## scalar:    str_trunc, str_view, str_view_all, str_which,
## scalar:    str_wrap, substr, substring, switch, tan, tanh, today,
## scalar:    tolower, toupper, trimws, try_cast_as, wday, xor,
## scalar:    yday, year
## aggregate: cume_dist, cummax, cummean, cummin, cumsum,
## aggregate: dense_rank, first, lag, last, lead, max, mean, median,
## aggregate: min, min_rank, n, n_distinct, nth, ntile, order_by,
## aggregate: percent_rank, quantile, rank, row_number, sd, sum, var
## window:    cume_dist, cummax, cummean, cummin, cumsum,
## window:    dense_rank, first, lag, last, lead, max, mean, median,
## window:    min, min_rank, n, n_distinct, nth, ntile, order_by,
## window:    percent_rank, quantile, rank, row_number, sd, sum, var

Aye! Let’s test it out.

Unfortunately, this boring, default database has no MAP columns to really show this off, but we can convert a simple character column into JSON just to get the idea:

elb_logs %>% 
  select(backendip)
## # Source:   lazy query [?? x 1]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##    backendip      
##    <chr>          
##  1 249.6.80.219   
##  2 248.178.189.65 
##  3 254.70.228.23  
##  4 248.178.189.65 
##  5 252.0.81.65    
##  6 248.178.189.65 
##  7 245.241.133.121
##  8 244.202.183.67 
##  9 255.226.190.127
## 10 246.22.152.210 
## # … with more rows

elb_logs %>% 
  select(backendip) %>% 
  mutate(
    backendip = cast_as(backendip, JSON)
  )
## # Source:   lazy query [?? x 1]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##    backendip            
##    <chr>                
##  1 "\"244.238.214.120\""
##  2 "\"248.99.214.228\"" 
##  3 "\"243.3.190.175\""  
##  4 "\"246.235.181.255\""
##  5 "\"241.112.203.216\""
##  6 "\"240.147.242.82\"" 
##  7 "\"248.99.214.228\"" 
##  8 "\"248.99.214.228\"" 
##  9 "\"253.161.243.121\""
## 10 "\"248.99.214.228\"" 
## # … with more rows

FIN

Despite the {tidyverse} documentation being written with care and clarity, this part of the R ecosystem is so extensive and evolving that watching out for all the doors and corners can be tricky. It’s easy for the short paragraph on the “untranslated function” capability to be overlooked and it may be hard to fully grok the translation concept without an IRL example.

Hopefully this helped (even if only a little) demystify these two areas of {d[b]plyr}.

(this is an unrolled Twitter thread converted to the blog since one never knows how long content will be preserved anywhere anymore)

It looks like @StackPath (NetCDN[.]com redirects to them) is enabling insurrection-mongers. They’re fronting news[.]parler[.]com .

It seems they (Parler) have a second domain dicecrm[.]com with the actual content, too.

dicecrm[.]com is hosted in @awscloud, so it looks like Parler folks are smarter than Bezos’ minions. Amazon might want to take this down before it gets going (again).

They load JS via @Google tag manager (you can see in the HTML src). The GA_MEASUREMENT_ID is “G-P76KHELPLT

BGP Info for the IPs associated with the domain

In site source screenshot in the first tweet there’s a reference to twexit[.]com. DNS for it shows they also have leftwexit[.]com (which is a very odd site).

"Twexit" is being enabled by @awscloud @GoDaddy and @WordPress/@automattic plus @StackPath.

While the main page has (unsurprisingly) busted HTML, they’re using their old sitemap[.]xml — https://carbon.now.sh/mdyJbvddCvZaGu2tOnD6 — which has a singular recent (whining) entry: http://dicecrm[.]com/updates/facebook-continues-their-confusing-hypocritical-stifling-of-free-speech-

Looks like @Shareaholic is also enabling Parler. Their “shareaholic:site_id” is “f7b53d75b2e7afdc512ea898bbbff585“.

shareaholic id capture

One of the CDN content refs is this (attached img). It’s loading content for Parler from free[ ]pressers[.]com, which is a pretty nutjob fake news site enabled by @IBMcloud (so IBM is enabling Parler as well). the free[ ]pressers Twitter is equally nutjob.

I suspect Parler is going to keep rejiggering this nutjob-fueled content network knowing that AWS, IBM (et al) won't play whack-a-mole and are rly just waiting for our collective memory and attention to fade so they can go back to making $ from divisiveness, greed, & hate.

protip: perhaps not spin up a new FQDN with such hastily-crafted garbage behind it when you know lots of very technically-resourced 👀 are on you.

Originally tweeted by (@hrbrmstr) on 2021-01-29.

The past two posts have (lightly) introduced how to use compiled Swift code in R, but they’ve involved a bunch of “scary” command line machinations and incantations.

One feature of {Rcpp} I’ve always 💙 is the cppFunction() (“r-lib” zealots have a similar cpp11::cpp_function()) which lets one experiment with C[++] code in R with as little friction as possible. To make it easier to start experimenting with Swift, I’ve built an extremely fragile swift_function() in {swiftr} that intends to replicate this functionality. Explaining it will be easier with an example.

Reading Property Lists in R With Swift

macOS relies heavily on property lists for many, many things. These can be plain text (XML) or binary files and there are command-line tools and Python libraries (usable via {reticulate}) that can read them along with the good ‘ol XML::readKeyValueDB(). We’re going to create a Swift function to read property lists and return JSON which we can use back in R via {jsonlite}.

This time around there’s no need to create extra files, just install {swiftr} and your favorite R IDE and enter the following (expository is after the code):

library(swiftr)

swift_function(
  code = '

func ignored() {
  print("""
this will be ignored by swift_function() but you could use private
functions as helpers for the main public Swift function which will be 
made available to R.
""")
}  

@_cdecl ("read_plist")
public func read_plist(path: SEXP) -> SEXP {

  var out: SEXP = R_NilValue

  do {
    // read in the raw plist
    let plistRaw = try Data(contentsOf: URL(fileURLWithPath: String(cString: R_CHAR(STRING_ELT(path, 0)))))

    // convert it to a PropertyList  
    let plist = try PropertyListSerialization.propertyList(from: plistRaw, options: [], format: nil) as! [String:Any]

    // serialize it to JSON
    let jsonData = try JSONSerialization.data(withJSONObject: plist , options: .prettyPrinted)

    // setup the JSON string return
    String(data: jsonData, encoding: .utf8)?.withCString { 
      cstr in out = Rf_mkString(cstr) 
    }

  } catch {
    debugPrint("\\(error)")
  }

  return(out)

}
')

This new swift_function() function — for the moment (the API is absolutely going to change) — is defined as:

swift_function(
  code,
  env = globalenv(),
  imports = c("Foundation"),
  cache_dir = tempdir()
)

where:

  • code is a length 1 character vector of Swift code
  • env is the environment to expose the function in (defaults to the global environment)
  • imports is a character vector of any extra Swift frameworks that need to be imported
  • cache_dir is where all the temporary files will be created and compiled dynlib will be stored. It defaults to a temporary directory so specify your own directory (that exists) if you want to keep the files around after you close the R session

Folks familiar with cppFunction() will notice some (on-purpose) similarities.

The function expects you to expose only one public Swift function which also (for the moment) needs to have the @_cdecl decorator before it. You can have as many other valid Swift helper functions as you like, but are restricted to one function that will be turned into an R function automagically.

In this example, swift_function() will see public func read_plist(path: SEXP) -> SEXP { and be able to identify

  • the function name (read_plist)
  • the number of parameters (they all need to be SEXP, for now)
  • the names of the parameters

A complete source file with all the imports will be created and a pre-built bridging header (which comes along for the ride with {swiftr}) will be included in the compilation step and a dylib will be built and loaded into the R session. Finally, an R function that wraps a .Call() will be created and will have the function name of the Swift function as well as all the parameter names (if any).

In the case of our example, above, the built R function is:

function(path) {
  .Call("read_plist", path)
}

There’s a good chance you’re using RStudio, so we can test this with it’s property list, or you can substitute any other application’s property list (or any .plist you have) to test this out:

read_plist("/Applications/RStudio.app/Contents/Info.plist") %>% 
  jsonlite::fromJSON() %>% 
  str(1)
## List of 32
##  $ NSPrincipalClass                     : chr "NSApplication"
##  $ NSCameraUsageDescription             : chr "R wants to access the camera."
##  $ CFBundleIdentifier                   : chr "org.rstudio.RStudio"
##  $ CFBundleShortVersionString           : chr "1.4.1093-1"
##  $ NSBluetoothPeripheralUsageDescription: chr "R wants to access bluetooth."
##  $ NSRemindersUsageDescription          : chr "R wants to access the reminders."
##  $ NSAppleEventsUsageDescription        : chr "R wants to run AppleScript."
##  $ NSHighResolutionCapable              : logi TRUE
##  $ LSRequiresCarbon                     : logi TRUE
##  $ NSPhotoLibraryUsageDescription       : chr "R wants to access the photo library."
##  $ CFBundleGetInfoString                : chr "RStudio 1.4.1093-1, © 2009-2020 RStudio, PBC"
##  $ NSLocationWhenInUseUsageDescription  : chr "R wants to access location information."
##  $ CFBundleInfoDictionaryVersion        : chr "6.0"
##  $ NSSupportsAutomaticGraphicsSwitching : logi TRUE
##  $ CSResourcesFileMapped                : logi TRUE
##  $ CFBundleVersion                      : chr "1.4.1093-1"
##  $ OSAScriptingDefinition               : chr "RStudio.sdef"
##  $ CFBundleLongVersionString            : chr "1.4.1093-1"
##  $ CFBundlePackageType                  : chr "APPL"
##  $ NSContactsUsageDescription           : chr "R wants to access contacts."
##  $ NSCalendarsUsageDescription          : chr "R wants to access calendars."
##  $ NSMicrophoneUsageDescription         : chr "R wants to access the microphone."
##  $ CFBundleDocumentTypes                :'data.frame':  16 obs. of  8 variables:
##  $ NSPhotoLibraryAddUsageDescription    : chr "R wants write access to the photo library."
##  $ NSAppleScriptEnabled                 : logi TRUE
##  $ CFBundleExecutable                   : chr "RStudio"
##  $ CFBundleSignature                    : chr "Rstd"
##  $ NSHumanReadableCopyright             : chr "RStudio 1.4.1093-1, © 2009-2020 RStudio, PBC"
##  $ CFBundleName                         : chr "RStudio"
##  $ LSApplicationCategoryType            : chr "public.app-category.developer-tools"
##  $ CFBundleIconFile                     : chr "RStudio.icns"
##  $ CFBundleDevelopmentRegion            : chr "English"

FIN

A source_swift() function is on the horizon as is adding a ton of checks/validations to swift_function(). I’ll likely be adding some of the SEXP and R Swift utility functions I’ve demonstrated in the [unfinished] book to make it fairly painless to interface Swift and R code in this new and forthcoming function.

As usual, kick the tyres, submit feature requests and bugs in any forum that’s comfortable and stay strong, wear a 😷, and socially distanced when out and about.

The previous post introduced the topic of how to compile Swift code for use in R using a useless, toy example. This one goes a bit further and makes a case for why one might want to do this by showing how to use one of Apple’s machine learning libraries, specifically the Natural Language one, focusing on extracting parts of speech from text.

I made a parts-of-speech directory to keep the code self-contained. In it are two files. The first is partsofspeech.swift (swiftc seems to dislike dashes in names of library code and I dislike underscores):

import NaturalLanguage
import CoreML

extension Array where Element == String {
  var SEXP: SEXP? {
    let charVec = Rf_protect(Rf_allocVector(SEXPTYPE(STRSXP), count))
    defer { Rf_unprotect(1) }
    for (idx, elem) in enumerated() { SET_STRING_ELT(charVec, idx, Rf_mkChar(elem)) }
    return(charVec)
  }
}

@_cdecl ("part_of_speech")
public func part_of_speech(_ x: SEXP) -> SEXP {

  let text = String(cString: R_CHAR(STRING_ELT(x, 0)))
  let tagger = NLTagger(tagSchemes: [.lexicalClass])

  tagger.string = text

  let options: NLTagger.Options = [.omitPunctuation, .omitWhitespace]

  var txts = [String]()
  var tags = [String]()

  tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .lexicalClass, options: options) { tag, tokenRange in
    if let tag = tag {
      txts.append("\(text[tokenRange])")
      tags.append("\(tag.rawValue)")
    }
    return true
  }

  let out = Rf_protect(Rf_allocVector(SEXPTYPE(VECSXP), 2))
  SET_VECTOR_ELT(out, 0, txts.SEXP)
  SET_VECTOR_ELT(out, 1, tags.SEXP)
  Rf_unprotect(1)

  return(out!)
}

The other is bridge code that seems to be the same for every one of these (or could be) so I’ve just named it swift-r-glue.h (it’s the same as the bridge code in the previous post):

#define USE_RINTERNALS

#include <R.h>
#include <Rinternals.h>

const char* R_CHAR(SEXP x);

Let’s walk through the Swift code.

We need to two imports:

import NaturalLanguage
import CoreML

to make use of the NLP functionality provided by Apple.

The following extension to the String Array class:

extension Array where Element == String {
  var SEXP: SEXP? {
    let charVec = Rf_protect(Rf_allocVector(SEXPTYPE(STRSXP), count))
    defer { Rf_unprotect(1) }
    for (idx, elem) in enumerated() { SET_STRING_ELT(charVec, idx, Rf_mkChar(elem)) }
    return(charVec)
  }
}

will reduce the amount of code we need to type later on to turn Swift String Arrays to R character vectors.

The start of the function:

@_cdecl ("part_of_speech")
public func part_of_speech(_ x: SEXP) -> SEXP {

tells swiftc to make this a C-compatible call and notes that the function takes one parameter (in this case, it’s expecting a length 1 character vector) and returns an R-compatible value (which will be a list that we’ll turn into a data.frame in R just for brevity).

The following sets up our inputs and outputs:

  let text = String(cString: R_CHAR(STRING_ELT(x, 0)))
  let tagger = NLTagger(tagSchemes: [.lexicalClass])

  tagger.string = text

  let options: NLTagger.Options = [.omitPunctuation, .omitWhitespace]

  var txts = [String]()
  var tags = [String]()

We convert the passed-in parameter to a Swift String, initialize the NLP tagger, and setup two arrays to hold the results (sentence component in txts and the part of speech that component is in tags).

The following code is mostly straight from Apple and (inefficiently) populates the previous two arrays:


tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .lexicalClass, options: options) { tag, tokenRange in if let tag = tag { txts.append("\(text[tokenRange])") tags.append("\(tag.rawValue)") } return true }

Finally, we use the Swift-R bridge to make a list much like one would in C:


let out = Rf_protect(Rf_allocVector(SEXPTYPE(VECSXP), 2)) SET_VECTOR_ELT(out, 0, txts.SEXP) SET_VECTOR_ELT(out, 1, tags.SEXP) Rf_unprotect(1) return(out!)

To get a shared library we can use from R, we just need to compile this like last time:

swiftc \
  -I /Library/Frameworks/R.framework/Headers \
  -F/Library/Frameworks \
  -framework R \
  -import-objc-header swift-r-glue.h \
  -emit-library \
  partsofspeech.swift

Let’s run that on some text! First, we’ll load the new shared library into R:

dyn.load("libpartsofspeech.dylib")

Next, we’ll make a wrapper function to avoid messy .Call(…)s and to make a data.frame:

parts_of_speech <- function(x) {
  res <- .Call("part_of_speech", x)  
  as.data.frame(stats::setNames(res, c("name", "tag")))
}

Finally, let’s try this on some text!

tibble::as_tibble(
  parts_of_speech(paste0(c(
"The comm wasn't working. Feeling increasingly ridiculous, he pushed",
"the button for the 1MC channel several more times. Nothing. He opened",
"his eyes and saw that all the lights on the panel were out. Then he",
"turned around and saw that the lights on the refrigerator and the",
"ovens were out. It wasn’t just the coffeemaker; the entire galley was",
"in open revolt. Holden looked at the ship name, Rocinante, newly",
"stenciled onto the galley wall, and said, Baby, why do you hurt me",
"when I love you so much?"
  ), collapse = " "))
)
## # A tibble: 92 x 2
##    name         tag
##    <chr>        <chr>
##  1 The          Determiner
##  2 comm         Noun
##  3 was          Verb
##  4 n't          Adverb
##  5 working      Verb
##  6 Feeling      Verb
##  7 increasingly Adverb
##  8 ridiculous   Adjective
##  9 he           Pronoun
## 10 pushed       Verb
## # … with 82 more rows

FIN

If you’re playing along at home, try adding a function to this Swift file that uses Apple’s entity tagger.

The next installment of this topic will be how to wrap all this into a package (then all these examples get tweaked and go into the tome.