Skip navigation

The last post showed how to work with the macOS mdls command line XML output, but with {swiftr} we can avoid the command line round trip by bridging the low-level Spotlight API (which mdls uses) directly in R via Swift.

If you’ve already played with {swiftr} before but were somewhat annoyed at various boilerplate elements you’ve had to drag along with you every time you used swift_function() you’ll be pleased that I’ve added some SEXP conversion helpers to the {swiftr} package, so there’s less cruft when using swift_function().

Let’s add an R↔Swift bridge function to retrieve all available Spotlight attributes for a macOS file:

library(swiftr)

swift_function('

  // Add an extension to URL which will retrieve the spotlight 
  // attributes as an array of Swift Strings
  extension URL {

  var mdAttributes: [String]? {

    get {
      guard isFileURL else { return nil }
      let item = MDItemCreateWithURL(kCFAllocatorDefault, self as CFURL)
      let attrs = MDItemCopyAttributeNames(item)!
      return(attrs as? [String])
    }

  }

}

@_cdecl ("file_attrs")
public func file_attrs(path: SEXP) -> SEXP {

  // Grab the attributres
  let outAttr = URL(fileURLWithPath: String(path)!).mdAttributes!

  // send them to R
  return(outAttr.SEXP!)

}
')

And, then try it out:

fil <-  "/Applications/RStudio.app"

file_attrs(fil)
##  [1] "kMDItemContentTypeTree"                 "kMDItemContentType"                    
##  [3] "kMDItemPhysicalSize"                    "kMDItemCopyright"                      
##  [5] "kMDItemAppStoreCategory"                "kMDItemKind"                           
##  [7] "kMDItemDateAdded_Ranking"               "kMDItemDocumentIdentifier"             
##  [9] "kMDItemContentCreationDate"             "kMDItemAlternateNames"                 
## [11] "kMDItemContentModificationDate_Ranking" "kMDItemDateAdded"                      
## [13] "kMDItemContentCreationDate_Ranking"     "kMDItemContentModificationDate"        
## [15] "kMDItemExecutableArchitectures"         "kMDItemAppStoreCategoryType"           
## [17] "kMDItemVersion"                         "kMDItemCFBundleIdentifier"             
## [19] "kMDItemInterestingDate_Ranking"         "kMDItemDisplayName"                    
## [21] "_kMDItemDisplayNameWithExtensions"      "kMDItemLogicalSize"                    
## [23] "kMDItemUsedDates"                       "kMDItemLastUsedDate"                   
## [25] "kMDItemLastUsedDate_Ranking"            "kMDItemUseCount"                       
## [27] "kMDItemFSName"                          "kMDItemFSSize"                         
## [29] "kMDItemFSCreationDate"                  "kMDItemFSContentChangeDate"            
## [31] "kMDItemFSOwnerUserID"                   "kMDItemFSOwnerGroupID"                 
## [33] "kMDItemFSNodeCount"                     "kMDItemFSInvisible"                    
## [35] "kMDItemFSTypeCode"                      "kMDItemFSCreatorCode"                  
## [37] "kMDItemFSFinderFlags"                   "kMDItemFSHasCustomIcon"                
## [39] "kMDItemFSIsExtensionHidden"             "kMDItemFSIsStationery"                 
## [41] "kMDItemFSLabel"   

No system() (et al.) round trip!

Now, lets make R↔Swift bridge function to retrieve the value of an attribute.

Before we do that, let me be up-front that relying on debugDescription (which makes a string representation of a Swift object) is a terrible hack that I’m using just to make the example as short as possible. We should do far more error checking and then further check the type of the object coming from the Spotlight API call and return an R-compatible version of that type. This mdAttr() method will almost certainly break depending on the item being returned.

swift_function('
extension URL {

  // Add an extension to URL which will retrieve the spotlight 
  // attribute value as a String. This will almost certainly die 
  // under various value conditions.

  func mdAttr(_ attr: String) -> String? {
    guard isFileURL else { return nil }
    let item = MDItemCreateWithURL(kCFAllocatorDefault, self as CFURL)
    return(MDItemCopyAttribute(item, attr as CFString).debugDescription!)
  }

}

@_cdecl ("file_attr")
public func file_attr(path: SEXP, attr: SEXP) -> SEXP {

  // file path as Swift String
  let xPath = String(cString: R_CHAR(Rf_asChar(path)))

  // attribute we want as a Swift String
  let xAttr = String(cString: R_CHAR(Rf_asChar(attr)))

  // the Swift debug string value of the attribute
  let outAttr = URL(fileURLWithPath: xPath).mdAttr(xAttr)

  // returned as an R string
  return(Rf_mkString(outAttr))
}
')

And try this out on some carefully selected attributes:

file_attr(fil, "kMDItemDisplayName")
## [1] "RStudio.app"

file_attr(fil, "kMDItemAppStoreCategory")
## [1] "Developer Tools"

file_attr(fil, "kMDItemVersion")
## [1] "1.4.1651"

Note that if we try to get fancy and retrieve an attribute value that is something like an array of strings, it doesn’t work so well:

file_attr(fil, "kMDItemExecutableArchitectures")
## [1] "<__NSSingleObjectArrayI 0x7fe1f6d19bf0>(\nx86_64\n)\n"

Again, ideally, we’d make a small package wrapper vs use swift_function() for this in production, but I wanted to show how straightforward it can be to get access to some fun and potentially powerful features of macOS right in R with just a tiny bit of Swift glue code.

Also, I hadn’t tried {swiftr} on the M1 Mini before and it seems I need to poke a bit to see what needs doing to get it to work properly in the arm64 RStudio rsession.

UPDATE (2021-04-14 a bit later)

It dawned on me that a minor tweak to the Swift mdAttr() function would make the method more resilient (but still hacky):

  func mdAttr(_ attr: String) -> String {
    guard isFileURL else { return "" }
    let item = MDItemCreateWithURL(kCFAllocatorDefault, self as CFURL)
    let x = MDItemCopyAttribute(item, attr as CFString)
    if (x == nil) {
      return("")
    } else {
      return("\(x!)")
    }
  }

Now we can (more) safely do something like this:

str(as.list(sapply(
  file_attrs(fil),
  function(attr) {
    file_attr(fil, attr)
  }
)), 1)
## List of 41
##  $ kMDItemContentTypeTree                : chr "(\n    \"com.apple.application-bundle\",\n    \"com.apple.application\",\n    \"public.executable\",\n    \"com"| __truncated__
##  $ kMDItemContentType                    : chr "com.apple.application-bundle"
##  $ kMDItemPhysicalSize                   : chr "767619072"
##  $ kMDItemCopyright                      : chr "RStudio 1.4.1651, © 2009-2021 RStudio, PBC"
##  $ kMDItemAppStoreCategory               : chr "Developer Tools"
##  $ kMDItemKind                           : chr "Application"
##  $ kMDItemDateAdded_Ranking              : chr "2021-04-09 00:00:00 +0000"
##  $ kMDItemDocumentIdentifier             : chr "0"
##  $ kMDItemContentCreationDate            : chr "2021-03-25 23:08:34 +0000"
##  $ kMDItemAlternateNames                 : chr "(\n    \"RStudio.app\"\n)"
##  $ kMDItemContentModificationDate_Ranking: chr "2021-03-25 00:00:00 +0000"
##  $ kMDItemDateAdded                      : chr "2021-04-09 13:25:11 +0000"
##  $ kMDItemContentCreationDate_Ranking    : chr "2021-03-25 00:00:00 +0000"
##  $ kMDItemContentModificationDate        : chr "2021-03-25 23:08:34 +0000"
##  $ kMDItemExecutableArchitectures        : chr "(\n    \"x86_64\"\n)"
##  $ kMDItemAppStoreCategoryType           : chr "public.app-category.developer-tools"
##  $ kMDItemVersion                        : chr "1.4.1651"
##  $ kMDItemCFBundleIdentifier             : chr "org.rstudio.RStudio"
##  $ kMDItemInterestingDate_Ranking        : chr "2021-04-15 00:00:00 +0000"
##  $ kMDItemDisplayName                    : chr "RStudio.app"
##  $ _kMDItemDisplayNameWithExtensions     : chr "RStudio.app"
##  $ kMDItemLogicalSize                    : chr "763253198"
##  $ kMDItemUsedDates                      : chr "(\n    \"2021-03-26 04:00:00 +0000\",\n    \"2021-03-30 04:00:00 +0000\",\n    \"2021-04-02 04:00:00 +0000\",\n"| __truncated__
##  $ kMDItemLastUsedDate                   : chr "2021-04-15 00:21:45 +0000"
##  $ kMDItemLastUsedDate_Ranking           : chr "2021-04-15 00:00:00 +0000"
##  $ kMDItemUseCount                       : chr "12"
##  $ kMDItemFSName                         : chr "RStudio.app"
##  $ kMDItemFSSize                         : chr "763253198"
##  $ kMDItemFSCreationDate                 : chr "2021-03-25 23:08:34 +0000"
##  $ kMDItemFSContentChangeDate            : chr "2021-03-25 23:08:34 +0000"
##  $ kMDItemFSOwnerUserID                  : chr "501"
##  $ kMDItemFSOwnerGroupID                 : chr "80"
##  $ kMDItemFSNodeCount                    : chr "1"
##  $ kMDItemFSInvisible                    : chr "0"
##  $ kMDItemFSTypeCode                     : chr "0"
##  $ kMDItemFSCreatorCode                  : chr "0"
##  $ kMDItemFSFinderFlags                  : chr "0"
##  $ kMDItemFSHasCustomIcon                : chr ""
##  $ kMDItemFSIsExtensionHidden            : chr "1"
##  $ kMDItemFSIsStationery                 : chr ""
##  $ kMDItemFSLabel                        : chr "0"

We’re still better off (in the long run) checking for and using proper types.

FIN

I hope to be able to carve out some more time in the not-too-distant-future for both {swiftr} and the in-progress guide on using Swift and R, but hopefully this post [re-]piqued interest in this topic for some R and/or Swift users.

(reminder: Quick Hits have minimal explanatory blathering, but I can elaborate on anything if folks submit a comment).

I’m playing around with Screen Time on xOS again and noticed mdls (macOS command line utility for getting file metadata) has a -plist option (it probably has for a while & I just never noticed it). I further noticed there’s a kMDItemExecutableArchitectures key (which, too, may have been “a thing” before as well). Having application metadata handy for the utility functions I’m putting together for Rmd-based Screen Time reports would be handy, so I threw together some quick code to show how to work with it in R.

Running mdls -plist /some/file.plist ...path-to-apps... will generate a giant property list file with all metadata for all the apps specified. It’s a wicked fast command even when grabbing and outputting metadata for all apps on a system.

Each entry looks like this:

<dict>
    <key>_kMDItemDisplayNameWithExtensions</key>
    <string>RStudio — tycho.app</string>
    <key>kMDItemAlternateNames</key>
    <array>
      <string>RStudio — tycho.app</string>
    </array>
    <key>kMDItemCFBundleIdentifier</key>
    <string>com.RStudio_—_tycho</string>
    <key>kMDItemContentCreationDate</key>
    <date>2021-01-31T17:56:46Z</date>
    <key>kMDItemContentCreationDate_Ranking</key>
    <date>2021-01-31T00:00:00Z</date>
    <key>kMDItemContentModificationDate</key>
    <date>2021-01-31T17:56:46Z</date>
    <key>kMDItemContentModificationDate_Ranking</key>
    <date>2021-01-31T00:00:00Z</date>
    <key>kMDItemContentType</key>
    <string>com.apple.application-bundle</string>
    <key>kMDItemContentTypeTree</key>
    <array>
      <string>com.apple.application-bundle</string>
      <string>com.apple.application</string>
      <string>public.executable</string>
      <string>com.apple.localizable-name-bundle</string>
      <string>com.apple.bundle</string>
      <string>public.directory</string>
      <string>public.item</string>
      <string>com.apple.package</string>
    </array>
    <key>kMDItemCopyright</key>
    <string>Copyright © 2017-2020 BZG Inc. All rights reserved.</string>
    <key>kMDItemDateAdded</key>
    <date>2021-04-09T18:29:52Z</date>
    <key>kMDItemDateAdded_Ranking</key>
    <date>2021-04-09T00:00:00Z</date>
    <key>kMDItemDisplayName</key>
    <string>RStudio — tycho.app</string>
    <key>kMDItemDocumentIdentifier</key>
    <integer>0</integer>
    <key>kMDItemExecutableArchitectures</key>
    <array>
      <string>x86_64</string>
    </array>
    <key>kMDItemFSContentChangeDate</key>
    <date>2021-01-31T17:56:46Z</date>
    <key>kMDItemFSCreationDate</key>
    <date>2021-01-31T17:56:46Z</date>
    <key>kMDItemFSCreatorCode</key>
    <integer>0</integer>
    <key>kMDItemFSFinderFlags</key>
    <integer>0</integer>
    <key>kMDItemFSInvisible</key>
    <false/>
    <key>kMDItemFSIsExtensionHidden</key>
    <true/>
    <key>kMDItemFSLabel</key>
    <integer>0</integer>
    <key>kMDItemFSName</key>
    <string>RStudio — tycho.app</string>
    <key>kMDItemFSNodeCount</key>
    <integer>1</integer>
    <key>kMDItemFSOwnerGroupID</key>
    <integer>20</integer>
    <key>kMDItemFSOwnerUserID</key>
    <integer>501</integer>
    <key>kMDItemFSSize</key>
    <integer>37451395</integer>
    <key>kMDItemFSTypeCode</key>
    <integer>0</integer>
    <key>kMDItemInterestingDate_Ranking</key>
    <date>2021-04-13T00:00:00Z</date>
    <key>kMDItemKind</key>
    <string>Application</string>
    <key>kMDItemLastUsedDate</key>
    <date>2021-04-13T12:47:12Z</date>
    <key>kMDItemLastUsedDate_Ranking</key>
    <date>2021-04-13T00:00:00Z</date>
    <key>kMDItemLogicalSize</key>
    <integer>37451395</integer>
    <key>kMDItemPhysicalSize</key>
    <integer>38092800</integer>
    <key>kMDItemUseCount</key>
    <integer>20</integer>
    <key>kMDItemUsedDates</key>
    <array>
      <date>2021-03-15T04:00:00Z</date>
      <date>2021-03-17T04:00:00Z</date>
      <date>2021-03-18T04:00:00Z</date>
      <date>2021-03-19T04:00:00Z</date>
      <date>2021-03-22T04:00:00Z</date>
      <date>2021-03-25T04:00:00Z</date>
      <date>2021-03-30T04:00:00Z</date>
      <date>2021-04-01T04:00:00Z</date>
      <date>2021-04-03T04:00:00Z</date>
      <date>2021-04-05T04:00:00Z</date>
      <date>2021-04-07T04:00:00Z</date>
      <date>2021-04-08T04:00:00Z</date>
      <date>2021-04-12T04:00:00Z</date>
      <date>2021-04-13T04:00:00Z</date>
    </array>
    <key>kMDItemVersion</key>
    <string>4.0.1</string>
  </dict>

We can get all the metadata for all installed apps in R via:

library(sys)
library(xml2)
library(tidyverse)

# get full paths to all the apps
list.files(
  c("/Applications", "/System/Library/CoreServices", "/Applications/Utilities", "/System/Applications"), 
  pattern = "\\.app$", 
  full.names = TRUE
) -> apps

# generate a giant property list with all the app attributres
tf <- tempfile(fileext = ".plist")
sys::exec_internal("mdls", c("-plist", tf, apps))

Unfortunately, some companies — COUGH Logitech COUGH — stick illegal entities in some values, so we have to take care of those (I used xmllint to see which one(s) were bad):

# read it in and clean up CDATA error (Logitech has a bad value in one field)
fil <- readr::read_file_raw(tf)
fil[fil == as.raw(0x03)] <- charToRaw(" ")

Now, we can read in the XML without errors:

# now parse it and get the top of each app entry
applist <- xml2::read_xml(fil)
(applist <- xml_find_all(applist, "//array/dict"))
## {xml_nodeset (196)}
##  [1] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>1Blocker (Old).app</string>\n  <key>kMDItemAlternateNames</key>\n ...
##  [2] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>1Password 7.app</string>\n  <key>_kMDItemEngagementData</key>\n   ...
##  [3] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>Adblock Plus.app</string>\n  <key>kMDItemAlternateNames</key>\n   ...
##  [4] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>AdBlock.app</string>\n  <key>kMDItemAlternateNames</key>\n  <arra ...
##  [5] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>AdGuard for Safari.app</string>\n  <key>kMDItemAlternateNames</ke ...
##  [6] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>Agenda.app</string>\n  <key>kMDItemAlternateNames</key>\n  <array ...
##  [7] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>Alfred 4.app</string>\n  <key>kMDItemAlternateNames</key>\n  <arr ...
##  [8] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>Android File Transfer.app</string>\n  <key>kMDItemAlternateNames< ...
##  [9] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>Asset Catalog Creator Pro.app</string>\n  <key>kMDItemAlternateNa ...
## [10] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>Awsaml.app</string>\n  <key>kMDItemAlternateNames</key>\n  <array ...
## [11] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>Boop.app</string>\n  <key>kMDItemAlternateNames</key>\n  <array>\ ...
## [12] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>Buffer.app</string>\n  <key>kMDItemAlternateNames</key>\n  <array ...
## [13] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>Burp Suite Community Edition.app</string>\n  <key>kMDItemAlternat ...
## [14] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>Camera Settings.app</string>\n  <key>kMDItemAlternateNames</key>\ ...
## [15] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>Cisco Webex Meetings.app</string>\n  <key>kMDItemAlternateNames</ ...
## [16] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>Claquette.app</string>\n  <key>kMDItemAlternateNames</key>\n  <ar ...
## [17] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>Discord.app</string>\n  <key>kMDItemAlternateNames</key>\n  <arra ...
## [18] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>Elgato Control Center.app</string>\n  <key>kMDItemAlternateNames< ...
## [19] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>F5 Weather.app</string>\n  <key>kMDItemAlternateNames</key>\n  <a ...
## [20] <dict>\n  <key>_kMDItemDisplayNameWithExtensions</key>\n  <string>Fantastical.app</string>\n  <key>kMDItemAlternateNames</key>\n  < ...
## ...

I really dislike property lists as I’m not a fan of position-dependent records in XML files. To get values for keys, we have to find the key, then go to the next sibling, figure out its type, and handle it accordingly. This is a verbose enough process to warrant creating a small helper function:

# helper function to retrieve the values for a given key
kval <- function(doc, key) {

  val <- xml_find_first(doc, sprintf(".//key[contains(., '%s')]/following-sibling::*", key))

  switch(
    unique(na.omit(xml_name(val))),
    "array" = as_list(val) |> map(unlist, use.names = FALSE) |> map(unique),
    "integer" = xml_integer(val),
    "true" = TRUE,
    "false" = FALSE,
    "string" = xml_text(val, trim = TRUE)
  )

}

This is nowhere near as robust as XML::readKeyValueDB() but it doesn’t have to be for this particular use case.

We can build up a data frame with certain fields (I wanted to know how many apps still aren’t Universal):

tibble(
  category = kval(applist, "kMDItemAppStoreCategory"),
  bundle_id = kval(applist, "kMDItemCFBundleIdentifier"),
  display_name = kval(applist, "kMDItemDisplayName"),
  arch = kval(applist, "kMDItemExecutableArchitectures"),
) |> 
  print() -> app_info
## # A tibble: 196 x 4
##    category        bundle_id                            display_name                  arch     
##    <chr>           <chr>                                <chr>                         <list>   
##  1 Productivity    com.khanov.BlockerMac                1Blocker (Old).app            <chr [2]>
##  2 Productivity    com.agilebits.onepassword7           1Password 7.app               <chr [2]>
##  3 Productivity    org.adblockplus.adblockplussafarimac Adblock Plus.app              <chr [2]>
##  4 Productivity    com.betafish.adblock-mac             AdBlock.app                   <chr [1]>
##  5 Utilities       com.adguard.safari.AdGuard           AdGuard for Safari.app        <chr [1]>
##  6 Productivity    com.momenta.agenda.macos             Agenda.app                    <chr [2]>
##  7 Productivity    com.runningwithcrayons.Alfred        Alfred 4.app                  <chr [2]>
##  8 NA              com.google.android.mtpviewer         Android File Transfer.app     <chr [1]>
##  9 Developer Tools com.bridgetech.asset-catalog         Asset Catalog Creator Pro.app <chr [2]>
## 10 Developer Tools com.rapid7.awsaml                    Awsaml.app                    <chr [1]>
## # … with 186 more rows

Finally, we can expand the arch column and see how many apps support Apple Silicon:

app_info |> 
  unnest(arch) |> 
  spread(arch, arch) |> 
  mutate_at(
    vars(arm64, x86_64),
    ~!is.na(.x)
  ) |> 
  count(arm64)
## # A tibble: 2 x 2
##   arm64     n
##   <lgl> <int>
## 1 FALSE    33
## 2 TRUE    163

Alas, there are still some stragglers stuck in Rosetta 2.

FIN

Drop comments if anything requires more blathering and have some fun with your macOS filesystem!

Guillaume Pressiat (@GuillaumePressiat) did a solid post & video on using Selenium to scrape a paginated table from understat[.]com/league/EPL/2020 (I just cannot bring myself to provide an active link to any SportsBall site). He does a great job walking folks through acquiring & orchestrating the heavy dependency that is Selenium.

I did a quick “look at browser Developer Tools” tweet a few weeks back that included the entire code for retrieving the Forbes billionaires list via the JSON file the Forbes’ site loads via an XHR request responding to a similar fine article by another R user on using Selenium to do the same thing.

If you find yourself thwarted by rvest::read_html() not returning “nodes that are clearly there” it is likely due to the page rendering nodes dynamically via javascript. Selenium orchestrates full or headless browsers and lets you scrape the dynamically rendered DOM. You can see this yourself if you first view the source of an HTML page (via the browser’s “view source” menu) and then use Developer Tools to inspect the browser session. The “view source” view (in Blink-based browsers, at least) will be the raw, unrendered source HTML from the site and the DevTools “Elements” tab will have the rendered DOM elements.

The “Nework” tab of DevTools has an “XHR” tab of its own, but if you try to use it on this SportsBall site to see the JSON it loads, you’ll be bitterly disappointed because — while it does indeed render JSON into HTML DOM nodes dynamically — that JSON is embedded in the web page:

sportsball DOM node with JSON

We can work in two different ways without the use of Selenium.

First, we’ll “cheat” and use the {V8} package, which is an R interface to a javascript virtual machine, the type of which browsers use to run javascript on web pages. I say “cheat” because we’re still depending on a chunk of a browser engine.

Let’s get some boilerplate out of the way:

library(V8)        # V8 engine
library(rvest)     # Scraping
library(stringi)   # String manipulation which we'll use later
library(tidyverse) # Duh

ctx <- v8() # create a new instance of the javascript VM

pg <- read_html("https://understat.com/league/EPL/2020") # read sportsball page

If you examine the SportsBall page you’ll see that JSON.parse in a few different locations, let’s target them all:

html_nodes(pg, xpath = ".//script[contains(., 'JSON.parse')]")
## {xml_nodeset (4)}
## [1] <script>\n\tvar datesData \t= JSON.parse('\\x5B\\x7B\\x22id\\x22\\x3A\\x2214086 ...
## [2] <script>\n\tvar teamsData = JSON.parse('\\x7B\\x2271\\x22\\x3A\\x7B\\x22id\\x22 ...
## [3] <script>\n\tvar playersData\t= JSON.parse('\\x5B\\x7B\\x22id\\x22\\x3A\\x22647\ ...
## [4] <script>\n\t\tWebFont.load({\n\t\t\tgoogle: {\n\t\t\t\tfamilies: ['Barlow:500', ...

We don’t need that last one, so the first three contain all the data we need.

Turning that into data is pretty straightforward work:

html_nodes(pg, xpath = ".//script[contains(., 'JSON.parse')]") %>% 
  .[1:3] %>%         # only want the first three nodes
  html_text() %>%    # turn the nodes into text
  walk(ctx$eval)     # tell V8 to evaluate the javascript

The VM we created now has those three variables:

ctx$get(JS("Object.keys(global)"))
## [1] "print"       "console"     "global"      "datesData"   "_week"      
## [6] "_year"       "teamsData"   "playersData"

and, we can retrieve them like this:

as_tibble(ctx$get("datesData"))
##  A tibble: 380 x 8
##    id    isResult h$id  $title $short_title a$id  $title $short_title goals$h $a   
##    <chr> <lgl>    <chr> <chr>  <chr>        <chr> <chr>  <chr>        <chr>   <chr>
##  1 14086 TRUE     228   Fulham FLH          83    Arsen… ARS          0       3    
##  2 14087 TRUE     78    Cryst… CRY          74    South… SOU          1       0    
##  3 14090 TRUE     87    Liver… LIV          245   Leeds  LED          4       3    
##  4 14091 TRUE     81    West … WHU          86    Newca… NEW          0       2    
##  5 14092 TRUE     76    West … WBA          75    Leice… LEI          0       3    
##  6 14093 TRUE     82    Totte… TOT          72    Evert… EVE          0       1    
##  7 14094 TRUE     238   Sheff… SHE          229   Wolve… WOL          0       2    
##  8 14095 TRUE     220   Brigh… BRI          80    Chels… CHE          1       3    
##  9 14096 TRUE     72    Evert… EVE          76    West … WBA          5       2    
## 10 14097 TRUE     245   Leeds  LED          228   Fulham FLH          4       3    
## # … with 370 more rows, and 6 more variables: xG$h <chr>, $a <chr>, datetime <chr>,
## #   forecast$w <chr>, $d <chr>, $l <chr>

Note that we need to do some extra processing of the second one to make it a bit tidier:

ctx$get("teamsData") %>% 
  map_df(~{
    .x$history$id <- .x$id
    .x$history$title <- .x$title
    .x$history
  }) %>% 
  as_tibble()
## # A tibble: 616 x 21
##    h_a      xG   xGA  npxG  npxGA ppda$att  $def ppda_allowed$att  $def  deep
##    <chr> <dbl> <dbl> <dbl>  <dbl>    <int> <int>            <int> <int> <int>
##  1 h     0.805 0.850 0.805 0.0885       89    20              247    14    17
##  2 a     2.03  0.535 2.03  0.535       307    33              143    24    10
##  3 h     3.08  1.66  3.08  1.66        365    25              119    25     7
##  4 a     0.874 0.672 0.874 0.672       212    23              210    24     7
##  5 h     1.50  2.38  1.50  2.38        225    17              124    34     7
##  6 h     2.45  1.00  1.69  1.00        161    23              164    22     5
##  7 a     1.99  1.39  1.99  1.39        331    24              169    15    16
##  8 h     1.77  1.50  1.77  1.50        257    14              208    17     6
##  9 a     2.39  0.572 1.63  0.572       144    11              289    23     8
## 10 a     1.27  1.14  0.508 1.14        162    28              166    20     5
## # … with 606 more rows, and 13 more variables: deep_allowed <int>, scored <int>,
## #   missed <int>, xpts <dbl>, result <chr>, date <chr>, wins <int>, draws <int>,
## #   loses <int>, pts <int>, npxGD <dbl>, id <chr>, title <chr>

The last one does not need any extra help:

as_tibble(ctx$get("playersData"))
## # A tibble: 505 x 18
##    id    player_name games time  goals xG    assists xA    shots key_passes
##    <chr> <chr>       <chr> <chr> <chr> <chr> <chr>   <chr> <chr> <chr>     
##  1 647   Harry Kane  29    2557  19    17.6… 13      6.73… 113   39        
##  2 1250  Mohamed Sa… 30    2529  19    16.1… 3       4.55… 99    40        
##  3 1228  Bruno Fern… 31    2659  16    13.4… 11      10.8… 95    87        
##  4 453   Son Heung-… 30    2509  14    9.35… 9       8.03… 55    56        
##  5 822   Patrick Ba… 31    2572  14    14.8… 7       3.44… 93    24        
##  6 5555  Dominic Ca… 26    2248  14    15.7… 0       0.95… 65    13        
##  7 3277  Alexandre … 27    1818  13    11.8… 2       1.77… 43    21        
##  8 314   Ilkay Günd… 24    1776  12    8.64… 1       3.29… 45    35        
##  9 755   Jamie Vardy 27    2230  12    16.0… 7       4.26… 64    22        
## 10 8865  Ollie Watk… 30    2700  12    13.7… 3       4.15… 81    36        
## # … with 495 more rows, and 8 more variables: yellow_cards <chr>, red_cards <chr>,
## #   position <chr>, team_title <chr>, npg <chr>, npxG <chr>, xGChain <chr>,
## #   xGBuildup <chr>

We don’t really need {V8} for this, though, if we’re willing to use some regular expressions. We have to be a bit careful since some extra, non-JSON data comes along for the ride with that first <script> tag (see the embedded image above).

We perform the same initial setup (get the text of the first three <script> tags), then we erase everthing that isn’t JSON data (so all the var and javascript punctuation). By using comments = TRUE in call to stri_replace_all_regex we can provide documentation along with the (ugly) regex.

The creator of the SportsBall site did some encoding to make the string easier to shove into a <script> tag, so we need to undo that by converting the hex-escapes to HTML entity escapes (replace \x with %) and then decoding them with curl::curl_unescape()

We could have made the regular expression uglier to avoid the other javascript cruft in the first <script> tag, but it’s just as easy to split them all into lines and pull the first line out.

Then, it’s just a matter of running each one through jsonlite::fromJSON(). I kept it a list and just set the names as the names of the variables above.

html_nodes(pg, xpath = ".//script[contains(., 'JSON.parse')]") %>% 
  .[1:3] %>% 
  html_text() %>% 
  stri_replace_all_regex("
^[^\\(]+\\('            # remove everything from the beginning of the line to the first ('
|                       # OR
'\\)[;,][[:space:]]*    # remove the last ') and everything after it
$
", "", comments = TRUE, multiline = TRUE) %>% 
  stri_replace_all_fixed("\\x", "%") %>% 
  curl::curl_unescape() %>% 
  stri_split_lines() %>% 
  map_chr(1) %>% 
  map(jsonlite::fromJSON) %>%
  map(as_tibble) %>% 
  set_names(c("datesData", "teamsData", "playersData")) %>% 
  str(3)
## List of 3
##  $ datesData  : tibble [380 × 8] (S3: tbl_df/tbl/data.frame)
##   ..$ id      : chr [1:380] "14086" "14087" "14090" "14091" ...
##   ..$ isResult: logi [1:380] TRUE TRUE TRUE TRUE TRUE TRUE ...
##   ..$ h       :'data.frame':  380 obs. of  3 variables:
##   ..$ a       :'data.frame':  380 obs. of  3 variables:
##   ..$ goals   :'data.frame':  380 obs. of  2 variables:
##   ..$ xG      :'data.frame':  380 obs. of  2 variables:
##   ..$ datetime: chr [1:380] "2020-09-12 11:30:00" "2020-09-12 14:00:00" "2020-09-12 16:30:00" "2020-09-12 19:00:00" ...
##   ..$ forecast:'data.frame':  380 obs. of  3 variables:
##  $ teamsData  : tibble [3 × 20] (S3: tbl_df/tbl/data.frame)
##   ..$ 71 :List of 3
##   ..$ 72 :List of 3
##   ..$ 74 :List of 3
##   ..$ 75 :List of 3
##   ..$ 76 :List of 3
##   ..$ 78 :List of 3
##   ..$ 80 :List of 3
##   ..$ 81 :List of 3
##   ..$ 82 :List of 3
##   ..$ 83 :List of 3
##   ..$ 86 :List of 3
##   ..$ 87 :List of 3
##   ..$ 88 :List of 3
##   ..$ 89 :List of 3
##   ..$ 92 :List of 3
##   ..$ 220:List of 3
##   ..$ 228:List of 3
##   ..$ 229:List of 3
##   ..$ 238:List of 3
##   ..$ 245:List of 3
##  $ playersData: tibble [505 × 18] (S3: tbl_df/tbl/data.frame)
##   ..$ id          : chr [1:505] "647" "1250" "1228" "453" ...
##   ..$ player_name : chr [1:505] "Harry Kane" "Mohamed Salah" "Bruno Fernandes" "Son Heung-Min" ...
##   ..$ games       : chr [1:505] "29" "30" "31" "30" ...
##   ..$ time        : chr [1:505] "2557" "2529" "2659" "2509" ...
##   ..$ goals       : chr [1:505] "19" "19" "16" "14" ...
##   ..$ xG          : chr [1:505] "17.650331255048513" "16.19410896115005" "13.438796618022025" "9.352356541901827" ...
##   ..$ assists     : chr [1:505] "13" "3" "11" "9" ...
##   ..$ xA          : chr [1:505] "6.7384555246680975" "4.557050030678511" "10.812157344073057" "8.036493374034762" ...
##   ..$ shots       : chr [1:505] "113" "99" "95" "55" ...
##   ..$ key_passes  : chr [1:505] "39" "40" "87" "56" ...
##   ..$ yellow_cards: chr [1:505] "1" "0" "5" "0" ...
##   ..$ red_cards   : chr [1:505] "0" "0" "0" "0" ...
##   ..$ position    : chr [1:505] "F" "F S" "M S" "F M S" ...
##   ..$ team_title  : chr [1:505] "Tottenham" "Liverpool" "Manchester United" "Tottenham" ...
##   ..$ npg         : chr [1:505] "15" "13" "8" "14" ...
##   ..$ npxG        : chr [1:505] "14.605655785650015" "11.627095961943269" "6.5883138151839375" "9.352356541901827" ...
##   ..$ xGChain     : chr [1:505] "20.556765687651932" "21.694580920040607" "22.04182725213468" "17.928756553679705" ...
##   ..$ xGBuildup   : chr [1:505] "3.99019683804363" "8.287332298234105" "8.843060294166207" "5.881684513762593" ...

You can use the cleanup code from the {V8} example to reshape that second element, and readr::type_convert() can help you turn the character vectors into something more useful.

FIN

It really always pays to take a look at the DevTools pane before introducing heavy dependencies. More sites are using very straightforward idioms that make the dynamically rendered page JSON source data readily available. Further, sites often add extra fields that you don’t see rendered, but may be useful to have around as you work with the resulting data.

Greynoise helps security teams focus on potential threats by reducing the noise from logs, alerts, and SIEMs. They constantly watch for badly behaving internet hosts, keep track of the benign ones, and use this research to classify IP addresses. Teams can use these classifications to only focus on things that (potentially) matter.

They also have a generous (10K calls/day), free community API which does not require credentialed access and returns a subset of information that the full API does. This is handy for folks who can’t afford the service or who only need to occasionally poke at IP addresses.

Andrew, GN’s CEO, tweeted out a super-hacky shell one-liner, the other day, that grabs the external IPs of all the ESTABLISHED IPv4 TCP connections and runs them through the community API via curl. Even though I made it a bit less-hacky:

sudo netstat -anp TCP \
  | rg ESTAB \
  | rg "(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" -o \
  | rg -v "(^127\.)|(^10\.)|(^172\.1[6-9]\.)|(^172\.2[0-9]\.)|(^172\.3[0-1]\.)|(^192\.168\.)" \
  | rg -v "$(dig +short viz.greynoise.io @9.9.9.9 | rg '^\d' | tr '\n' '|' | sed -e 's/.$//g')" \
  | sort -u \
  | while read IP; do echo $(curl --silent https://api.greynoise.io/v3/community/$IP); done |
  Rscript -e 'tibble::as_tibble(jsonlite::stream_in(file("stdin"), verbose=FALSE))'

its still a “run-on-demand” process that you could put in a script and launchd, but then you’d still have to keep a terminal up or remember to watch some file. Plus, it relies on full executables.

I decided to make things a bit easier for folks on macOS Big Sur by cranking out a small SwiftUI app I’ve dubbed GreyWatch:

Each list entry show an IP address your Mac previously connected to (since app launch) or currently has established TCP connections to. The three indicator dots show (in order) whether Greynoise has detected scanning behavior from the IP address within the last 30 days, whether it has a “Rule It OuT” (RIOT) classification, and what — if any — classification the IP address has. The app only shows an IP address once even it you continue to connect to it and it puts new connections on top.

If an IP address has a classification, double-clicking it will open your default browser to the Greynoise visualizer, otherwise said double-click will take you to the IPInfo entry for the IP address.

Needless to say, if your Mac is talking to a host Greynoise has classified as horribad, your other 99 problems no longer take precedence. I’ll likely add a notification action if that condition occurrs.

There’s an “Export…” item in the file menu that lets you save a copy of the current IP list (with metadata) to an ndlines formatted JSON file.

The app does not shell out to dig or netstat and has a light memory and energy footprint.

There are pre-built, notarized binaries in the releases section, and I’ll gradually be adding features (submit yours via new issues!). You can also submit bug reports or other questions via GH issues as well.

Many thanks to Andrew and team for their generous free tier, which enables semi-useful community hacks like this one!

Apple M1/Apple Silicon/arm64 macOS can run x86_64 programs via Rosetta and most M1 systems currently (~March 2021) very likely run a mix of x86_64 and arm64 processes.

Activity Monitor can show the architecture:

but command line tools such as ps and top do not due to Apple hiding the details of the proper sysctl() incantations necessary to get this info.

Patrick Wardle reverse engineered Activity Monitor — https://www.patreon.com/posts/45121749 — and I slapped that hack together with some code from Sydney San Martin — https://gist.github.com/s4y/1173880/9ea0ed9b8a55c23f10ecb67ce288e09f08d9d1e5 — into a nascent, bare-bones command line utility: archinfo.

It returns columnar output or JSON (via --json) — that will work nicely with jq — of running processes and their respective architectures.

Build from source or grab from the releases via my git (https://git.rud.is/hrbrmstr/archinfo) or GH (https://github.com/hrbrmstr/archinfo).

$ archinfo
...
   5949  arm64 /System/Library/Frameworks/AudioToolbox.framework/AudioComponentRegistrar
   5923  arm64 /System/Library/CoreServices/LocationMenu.app/Contents/MacOS/LocationMenu
   5901 x86_64 /Library/Application Support/Adobe/Adobe Desktop Common/IPCBox/AdobeIPCBroker.app/Contents/MacOS/AdobeIPCBroker
   5873  arm64 /Applications/Utilities/Adobe Creative Cloud Experience/CCXProcess/CCXProcess.app/Contents/MacOS/../libs/Adobe_CCXProcess.node
   5863  arm64 /bin/sleep
   5861 x86_64 /Applications/Tailscale.app/Contents/PlugIns/IPNExtension.appex/Contents/MacOS/IPNExtension
   5855 x86_64 /Applications/Elgato Control Center.app/Contents/MacOS/Elgato Control Center
   5852 x86_64 /Applications/Tailscale.app/Contents/MacOS/Tailscale
   5849  arm64 /System/Library/CoreServices/TextInputSwitcher.app/Contents/MacOS/TextInputSwitcher
...
library(tidyverse)

arch <- jsonlite::stream_in(textConnection(system("/usr/local/bin/archinfo --json", intern=TRUE)))

arch %>% 
  as_tibble() %>% 
  mutate(
    name = basename(name)
  ) %>% 
  select(
    name, arch
  ) 
## # A tibble: 448 x 2
##    executable                                          arch
##    <chr>                                               <chr>
## ...
## 50 com.apple.WebKit.WebContent                         arm64
## 51 com.apple.WebKit.Networking                         arm64
## 52 com.apple.WebKit.WebContent                         arm64
## 53 RStudio — tycho                                     x86_64
## 54 QtWebEngineProcess                                  x86_64
## 55 VTEncoderXPCService                                 arm64
## 56 rsession-arm64                                      arm64
## 57 RStudio                                             x86_64
## 58 MTLCompilerService                                  arm64
## 59 MTLCompilerService                                  arm64
## 60 coreautha                                           arm64
## ...

table(arch[["arch"]])
##
##  arm64 x86_64
##    419     29

UPDATE 2021-03-14

My original goal was to use Swift for this, but it dawned on me that the vast majority of the codebase is in C, so I’ve removed the Xcode dependency and simplified the build process.

The updated code also now defaults to columnar output. Use --json to return ndjson output.

Brim Security maintains a free, Electron-based desktop GUI for exploration of PCAPs and select cybersecurity logs:

along with a broad ecosystem of tools which can be used independently of the GUI.

The standalone or embedded zqd server, as well as the zq command line utility let analysts run ZQL (a domain-specific query language) queries on cybersecurity data sources.

The Brim team maintains a Python module that is capable of working with the zqd HTTP API and my nascent {brimr}gitea|gh|gl|bb R package provides a similar API structure to perform similar operations in R, along with a wrapper for the zq commmand line tool.

PCAPs! In! Spaaaaacce[s]!

Brim Desktop organizes input sources into something called “spaces”. We can check for available spaces with brim_spaces():

library(brimr)
library(tibble)

brim_spaces()
##                               id                                                            name
## 1 sp_1p6pwLgtsESYBTHU9PL9fcl2iBn 2021-02-17-Trickbot-gtag-rob13-infection-in-AD-environment.pcap
##                                                                                              data_path storage_kind
## 1 file:///Users/demo/Library/Application%20Support/Brim/data/spaces/sp_1p6pwLgtsESYBTHU9PL9fcl2iBn    filestore

This single space availble is a sample capture of Trickbot

Let’s profile the network connections in this capture:

# ZQL query to fetch Zeek connection data
zql1 <- '_path=conn | count() by id.orig_h, id.resp_h, id.resp_p | sort id.orig_h, id.resp_h, id.resp_p'

space <- "2021-02-17-Trickbot-gtag-rob13-infection-in-AD-environment.pcap"

r1 <- brim_search(space, zql1)

r1
## ZQL query took 0.0000 seconds; 384 records matched; 1,082 records read; 238,052 bytes read

(r1 <- as_tibble(tidy_brim(r1)))
## # A tibble: 74 x 4
##    orig_h      resp_h       resp_p count
##    <chr>       <chr>        <chr>  <int>
##  1 10.2.17.2   10.2.17.101  49787      1
##  2 10.2.17.101 3.222.126.94 80         1
##  3 10.2.17.101 10.2.17.1    445        1
##  4 10.2.17.101 10.2.17.2    53        97
##  5 10.2.17.101 10.2.17.2    88        27
##  6 10.2.17.101 10.2.17.2    123        5
##  7 10.2.17.101 10.2.17.2    135        8
##  8 10.2.17.101 10.2.17.2    137        2
##  9 10.2.17.101 10.2.17.2    138        2
## 10 10.2.17.101 10.2.17.2    389       37
## # … with 64 more rows

Brim auto-processed the PCAP into Zeek log format and _path=conn in query string indicates that’s where we’re going to perform further data operations (the queries are structured a bit like jq filters). We then ask Brim/zqd to summarize and sort source IP, destination IP, and port counts. {brimr} sends this query over to the server. The raw response is a custom data structure that we can turn into a tidy data frame via tidy_brim().

We can do something similar with the Suricata data that Brim also auto-processes for us:

# Z query to fetch Suricata alerts including the count of alerts per source:destination 
zql2 <- "event_type=alert | count() by src_ip, dest_ip, dest_port, alert.severity, alert.signature | sort src_ip, dest_ip, dest_port, alert.severity, alert.signature"

r2 <- brim_search(space, zql2)

r2
## ZQL query took 0.0000 seconds; 47 records matched; 870 records read; 238,660 bytes read

(r2 <- (as_tibble(tidy_brim(r2))))
## # A tibble: 35 x 6
##    src_ip     dest_ip    dest_port severity signature                                                              count
##    <chr>      <chr>          <int>    <int> <chr>                                                                  <int>
##  1 10.2.17.2  10.2.17.1…     49674        3 SURICATA Applayer Detect protocol only one direction                       1
##  2 10.2.17.2  10.2.17.1…     49680        3 SURICATA Applayer Detect protocol only one direction                       1
##  3 10.2.17.2  10.2.17.1…     49687        3 SURICATA Applayer Detect protocol only one direction                       1
##  4 10.2.17.2  10.2.17.1…     49704        3 SURICATA Applayer Detect protocol only one direction                       1
##  5 10.2.17.2  10.2.17.1…     49709        3 SURICATA Applayer Detect protocol only one direction                       1
##  6 10.2.17.2  10.2.17.1…     49721        3 SURICATA Applayer Detect protocol only one direction                       1
##  7 10.2.17.2  10.2.17.1…     50126        3 SURICATA Applayer Detect protocol only one direction                       1
##  8 10.2.17.1… 3.222.126…        80        2 ET POLICY curl User-Agent Outbound                                         1
##  9 10.2.17.1… 36.95.27.…       443        1 ET HUNTING Suspicious POST with Common Windows Process Names - Possib…     1
## 10 10.2.17.1… 36.95.27.…       443        1 ET MALWARE Win32/Trickbot Data Exfiltration                                1
## # … with 25 more rows

Finally, for this toy example, we’ll also generate a visual overview of these connections:

library(igraph)
library(ggraph)
library(tidyverse)

gdf <- count(r1, orig_h, resp_h, wt=count)

count(gdf, node = resp_h, wt=n, name = "in_degree") %>% 
  full_join(
    count(gdf, node = orig_h, name = "out_degree")
  ) %>% 
  mutate_at(
    vars(in_degree, out_degree),
    replace_na, 1
  ) %>% 
  arrange(in_degree) -> vdf

g <- graph_from_data_frame(gdf, vertices = vdf)

ggraph(g, layout = "linear") +
  geom_node_point(
    aes(size = in_degree), shape = 21
  ) +
  geom_edge_arc(
    width = 0.125, 
    arrow = arrow(
      length = unit(5, "pt"),
      type = "closed"
    )
  )

We can also process log files directly (i.e. without any server) with zq_cmd():

zq_cmd(
  c(
    '"* | cut ts,id.orig_h,id.orig_p"', # note the quotes
    system.file("logs", "conn.log.gz", package = "brimr")
   )
 )
##           id.orig_h id.orig_p                          ts
##   1:  10.164.94.120     39681 2018-03-24T17:15:21.255387Z
##   2:    10.47.25.80     50817 2018-03-24T17:15:21.411148Z
##   3:    10.47.25.80     50817 2018-03-24T17:15:21.926018Z
##   4:    10.47.25.80     50813 2018-03-24T17:15:22.690601Z
##   5:    10.47.25.80     50813 2018-03-24T17:15:23.205187Z
##  ---                                                     
## 988: 10.174.251.215     33003 2018-03-24T17:15:21.429238Z
## 989: 10.174.251.215     33003 2018-03-24T17:15:21.429315Z
## 990: 10.174.251.215     33003 2018-03-24T17:15:21.429479Z
## 991:  10.164.94.120     38265 2018-03-24T17:15:21.427375Z
## 992: 10.174.251.215     33003 2018-03-24T17:15:21.433306Z

FIN

This package is less than 24 hrs old (as of the original blog post date) and there are still a few bits missing, which means y’all have the ability to guide the direction it heads in. So kick the tyres and interact where you’re most comfortable.

Horrible puns aside, hopefully everyone saw the news, earlier this week, from @thomasp85 on the evolution of modern typographic capabilities in the R ecosystem. Thomas (and some cohorts) has been working on {systemfonts}, {ragg}, and {textshaping} for quite a while now, and the — shall we say tidyglyphs ecosystem — is super-ready for prime time.

Thomas covered a seriously large amount of ground in his post, so please take some time to digest that before continuing.

Back? 👍🏽

While it is possible to mangage typographic needs with the foundry tools provided via the font-rendering package-triad, one would be hard-pressed to say that the following is “fun”, or even truly manageable coding:

library(systemfonts)

register_variant(
  name = "some-unique-prefix Inter some-style-01",
  weight = "normal",
  features = font_feature(
    poss = 1, ibly = 1, many = 1, 
    four = 1, char = 1, open = 1,
    type = 1, code = 1, spec = 1
  )
)

# remember that name

register_variant(
  name = "some-unique-prefix Inter some-style-02",
  weight = "normal",
  features = font_feature(
    poss = 1, ibly = 1, many = 1, 
    four = 1, char = 1, open = 1,
    type = 1, code = 1, spec = 1
  )
)

# remember that name 

# add a dozen more lines ...

ggplot() +
   geom_text(family = "oops-i-just-misspelled-the-family-name-*again*", ...) 

We’ve been given the power to level up our chart typography, but it’s sort of where literal typesetters (the ones who put blocks of type into a press) were and we can totally make our lives easier and charts prettier with the help of a new package — {hrbragg} https://git.rud.is/hrbrmstr/hrbragg — which is somewhat of a bridge between {ragg}, {systemfonts}, {textshaping} and a surprisingly popular package of mine: {hrbrthemes}. {hrbragg} is separate from {hrbrthemes} since this new typographic ecosystem is fairly restricted to {ragg} graphics devices (for the moment, as Thomas alluded the other day), and the new themes provided in {hrbragg} are a bit of a level-up from those in its sibling package.

Feature Management

At the heart of {systemfonts} lies the ability to tweak font features and bend them to your will. This somewhat old post shows why these tweaks exist and delves (but not too deeply) into the details of them, down to the four-letter codes that are used to represent and work with a given feature. But, what does calt mean? And, what is this tnum fellow you’ll be seeing a great deal of in R-land over the coming months? While one could leave the comfort of RStudio, VS Code, or vim to visit one of the reference links in Thomas’ package or {hrbragg}, I’ve included the most recent copy of tag-code<->full-tag-name<->short-tab-description in {hrbragg} as a usable data frame so you can treat it like the data it is!

library(systemfonts) # access to and tweaking OTFs!
library(textshaping) # lets us treat type as data
library(ragg)        # because it'll be lonely w/o the other two
library(hrbragg)     # remotes::install_git("https://git.rud.is/hrbrmstr/hrbragg.git")
library(tidyverse)   # nice printing, {ggplot2}, and b/c we'll do some font data wrangling

data("feature_dict")

feature_dict
## # A tibble: 122 x 3
##    tag   long_name                   description                                                                                             
##    <chr> <chr>                       <chr>                                                                                                   
##  1 aalt  Access All Alternates       Special feature: used to present user with choice all alternate forms of the character                  
##  2 abvf  Above-base Forms            Replaces the above-base part of a vowel sign. For Khmer and similar scripts.                            
##  3 abvm  Above-base Mark Positioning Positions a mark glyph above a base glyph.                                                              
##  4 abvs  Above-base Substitutions    Ligates a consonant with an above-mark.                                                                 
##  5 afrc  Alternative Fractions       Converts figures separated by slash with alternative stacked fraction form                              
##  6 akhn  Akhand                      Hindi for unbreakable.  Ligates consonant+halant+consonant, usually only for k-ss and j-ny combinations.
##  7 blwf  Below-base Forms            Replaces halant+consonant combination with a subscript form.                                            
##  8 blwm  Below-base Mark Positioning Positions a mark glyph below a base glyph                                                               
##  9 blws  Below-base Substitutions    Ligates a consonant with a below-mark.                                                                  
## 10 c2pc  Capitals to Petite Caps     Substitutes capital letters with petite caps                                                            
## # … with 112 more rows

You can also use help("opentype_typographic_features") to see an R help page with the same information. That page also has links external resource, one of which is a detailed manual of each feature with use-cases (in the event even the short-description is not as helpful as it could be).

Before one can think about using the bare-metal register_variant(..., font_feature(...)) duo, one has to know what features a particular type family supports. We can retrieve the feature codes with textshaping::get_font_features() and look them up in this data frame to get an at-a-glance view:

# old school subsetting ftw!
feature_dict[feature_dict$tag %in% textshaping::get_font_features("Inter")[[1]],]
## # A tibble: 19 x 3
##    tag   long_name                   description                                                                                                                 
##    <chr> <chr>                       <chr>                                                                                                                       
##  1 aalt  Access All Alternates       Special feature: used to present user with choice all alternate forms of the character                                      
##  2 calt  Contextual Alternates       Applies a second substitution feature based on a match of a character pattern within a context of surrounding patterns      
##  3 case  Case Sensitive Forms        Replace characters, especially punctuation, with forms better suited for all-capital text, cf. titl                         
##  4 ccmp  Glyph Composition/Decompos… Either calls a ligature replacement on a sequence of characters or replaces a character with a sequence of glyphs. Provides…
##  5 cpsp  Capital Spacing             Adjusts spacing between letters in all-capitals text                                                                        
##  6 dlig  Discretionary Ligatures     Ligatures to be applied at the user's discretion                                                                            
##  7 dnom  Denominator                 Converts to appropriate fraction denominator form, invoked by frac                                                          
##  8 frac  Fractions                   Converts figures separated by slash with diagonal fraction                                                                  
##  9 kern  Kerning                     Fine horizontal positioning of one glyph to the next, based on the shapes of the glyphs                                     
## 10 locl  Localized Forms             Substitutes character with the preferred form based on script language                                                      
## 11 mark  Mark Positioning            Fine positioning of a mark glyph to a base character                                                                        
## 12 numr  Numerator                   Converts to appropriate fraction numerator form, invoked by frac                                                            
## 13 ordn  Ordinals                    Replaces characters with ordinal forms for use after numbers                                                                
## 14 pnum  Proportional Figures        Replaces numerals with glyphs of proportional width, often also onum                                                        
## 15 salt  Stylistic Alternates        Either replaces with, or displays list of, stylistic alternatives for a character                                           
## 16 subs  Subscript                   Replaces character with subscript version, cf. numr                                                                         
## 17 sups  Superscript                 Replaces character with superscript version, cf. dnom                                                                       
## 18 tnum  Tabular Figures             Replaces numerals with glyphs of uniform width, often also lnum                                                             
## 19 zero  Slashed Zero                Replaces 0 figure with slashed 0        

Most of those will not be super-useful (yet) but there are three key features that I believe one needs when picking a font for a chart:

  • One of the *ligs (because ligatures.) are so gosh darn cool, pretty, and useful)
  • tnum for tabular numbers (essential in axis value display, and more)
  • kern for sweet, sweet letterspacing, or kerning

Since I’ve just made up a rule, let’s see how many fonts I have that support said rule:

(fam <- unique(system_fonts()[["family"]])) %>% 
  get_font_features() %>% 
  set_names(fam) %>% 
  keep(~sum(c(
    any(grepl("kern", .)), 
    any(grepl("tnum", .)),
    any(grepl(".lig|liga", .)) 
    )) == 3
  ) %>% 
  names() %>% 
  sort()
##  [1] "Barlow"                 "Goldman Sans"           "Goldman Sans Condensed" "Grantha Sangam MN"     
##  [5] "Inter"                  "Kohinoor Devanagari"    "Mukta Mahee"            "Museo Slab"            
##  [9] "Neufile Grotesk"        "Roboto"                 "Roboto Black"           "Roboto Condensed"      
## [13] "Roboto Light"           "Roboto Medium"          "Roboto Thin"            "Tamil Sangam MN"       
## [17] "Trattatello"           

I do have more, but they’re on a different Mac 😎.

{hrbragg} comes with Inter, Goldman Sans, and Roboto Condensed, so let’s explore one of them — Inter — and see how we might be able to make it useful but not tedious. The supported features of Inter are above and here are the family members:

system_fonts() %>% 
  filter(family == "Inter") %>% 
  select(name, family, style, weight, width, italic, monospace)
##  A tibble: 18 x 7
##    name                   family style              weight     width  italic monospace
##    <chr>                  <chr>  <chr>              <ord>      <ord>  <lgl>  <lgl>    
##  1 Inter-ExtraLight       Inter  Extra Light        light      normal FALSE  FALSE    
##  2 Inter-MediumItalic     Inter  Medium Italic      medium     normal TRUE   FALSE    
##  3 Inter-ExtraLightItalic Inter  Extra Light Italic light      normal TRUE   FALSE    
##  4 Inter-Bold             Inter  Bold               bold       normal FALSE  FALSE    
##  5 Inter-ThinItalic       Inter  Thin Italic        ultralight normal TRUE   FALSE    
##  6 Inter-SemiBold         Inter  Semi Bold          semibold   normal FALSE  FALSE    
##  7 Inter-BoldItalic       Inter  Bold Italic        bold       normal TRUE   FALSE    
##  8 Inter-Italic           Inter  Italic             normal     normal TRUE   FALSE    
##  9 Inter-Medium           Inter  Medium             medium     normal FALSE  FALSE    
## 10 Inter-BlackItalic      Inter  Black Italic       heavy      normal TRUE   FALSE    
## 11 Inter-Light            Inter  Light              normal     normal FALSE  FALSE    
## 12 Inter-SemiBoldItalic   Inter  Semi Bold Italic   semibold   normal TRUE   FALSE    
## 13 Inter-Regular          Inter  Regular            normal     normal FALSE  FALSE    
## 14 Inter-ExtraBoldItalic  Inter  Extra Bold Italic  ultrabold  normal TRUE   FALSE    
## 15 Inter-LightItalic      Inter  Light Italic       normal     normal TRUE   FALSE    
## 16 Inter-Thin             Inter  Thin               ultralight normal FALSE  FALSE    
## 17 Inter-ExtraBold        Inter  Extra Bold         ultrabold  normal FALSE  FALSE    
## 18 Inter-Black            Inter  Black              heavy      normal FALSE  FALSE    

Nobody. I mean, nobody wants to type eighteen+ font variant registration statements, which is why {hrbragg} comes with reconfigure_font(). Just give it the family name, the features you want supported, and it will take care of the tedium for you:

reconfigure_font(
  prefix = "hrbragg-pkg",
  family = "Inter",
  width = "normal",
  ligatures = "discretionary",
  calt = 1, tnum = 1, case = 1,
  dlig = 1, ss01 = 1, kern = 1,
  zero = 0, salt = 0
) -> customized_inter

# I'll have a proper print method for this soon

str(customized_inter, 1)
## List of 17
##  $ ultralight_italic: chr "clever-prefix Inter Thin Italic"
##  $ ultralight       : chr "clever-prefix Inter Thin"
##  $ light            : chr "clever-prefix Inter Extra Light"
##  $ light_italic     : chr "clever-prefix Inter Extra Light Italic"
##  $ normal_italic    : chr "clever-prefix Inter Light Italic"
##  $ normal_light     : chr "clever-prefix Inter Light"
##  $ normal           : chr "clever-prefix Inter Regular"
##  $ medium_italic    : chr "clever-prefix Inter Medium Italic"
##  $ medium           : chr "clever-prefix Inter Medium"
##  $ semibold         : chr "clever-prefix Inter Semi Bold"
##  $ semibold_italic  : chr "clever-prefix Inter Semi Bold Italic"
##  $ bold             : chr "clever-prefix Inter Bold"
##  $ bold_italic      : chr "clever-prefix Inter Bold Italic"
##  $ ultrabold_italic : chr "clever-prefix Inter Extra Bold Italic"
##  $ ultrabold        : chr "clever-prefix Inter Extra Bold"
##  $ heavy_italic     : chr "clever-prefix Inter Black Italic"
##  $ heavy            : chr "clever-prefix Inter Black"
##  - attr(*, "family")= chr "Inter"

The reconfigure_font() function applies the feature settings to all the family members, gives each a name with the stated prefix and provides a return value that supports autocompletion of the name in smart IDEs and practically negates the need to type out long, unique font names, like this:

ggplot() +
  geom_text(
    aes(1, 2, label = "Welcome to a <- customized -> Inter!"),
    size = 6, family = customized_inter$ultrabold
  ) +
  theme_void()

Note that we have a lovely emboldened font with clean ligatures without much work at all! (I should mention that if a prefix is not specified, a UUID is chosen instead since we don’t really care about the elongated names anymore).

While we’ve streamlined things a bit already, we can do even better.

Font-centric Themes

Just like {hrbrthemes}, {hrbragg} comes with some font/typographic-centric themes. We’ll focus on the one with Inter for the blog post. For the moment, you’ll need to install_inter() (you likely got prompted to do that if you already installed the package). This requirement will go away soon, but you’ll want to use Inter everywhere anyway, so I’d keep it installed.

Once that’s done, you’re ready to use theme_inter().

What’s that you say? Don’t we need to create a font variant first?

Would I do that to you? Never! {hrbragg} comes with a preconfigured inter_pkg font variant (which I’ll be tweaking a bit over the weekend for some edge cases) that pairs nicely with theme_inter(). Here it is in action with an old friend of ours:

ggplot() +
  geom_point(
    data = mtcars,
    aes(mpg, wt, color = factor(cyl))
  ) +
  geom_label(
    aes(
      x = 15, y = 5.48,
      label = "<- A fairly useless annotation\n       that uses the custom Inter\n          variant by default."
    ),
    label.size = 0, hjust = 0, vjust = 1
  ) +
  labs(
    x = "Fuel efficiency (mpg)", y = "Weight (tons)",
    title = "Seminal ggplot2 scatterplot example",
    subtitle = "A plot that is only useful for demonstration purposes",
    caption = "Brought to you by the letter 'g'"
  ) -> gg1

gg1 + theme_inter(grid = "XY", mode = "light") 

Wonderful kerning, a custom-built arrow due to fantastic, built-in ligatures, and spiffy tabular numbers. Gorgeous!

What was that you just asked? What’s up with that mode = "light"?. Did I forget to mention that all the {hrbragg} themes come with dark-mode support built in? My sincerest apologies. Choosing mode = "dark" will use a (configuratble) dark theme and using mode = "rstudio" (if you’re an RStudio user) will have the charts take on the IDE theme setting automagically. Here’s dark mode:

gg1 + theme_inter(grid = "XY", mode = "dark") 

The font+theme pairs automatically work and reconfigure all the ggplot2 aesthetic defaults accordingly. Since this makes heavy use of update_geom_defaults() I’ve included a (very necessary) reset_ggplot2_defaults() to get things back to normal when you need to.

Note that you can use adaptive_color() to help enable dark/light-mode color switching for your own pairings, and theme_background_color() or theme_foreground_color to utilize the (reconfigurable) default fore- and background theme colors.

Try before you buy…into using a given font

One can’t know ahead of time whether a font is going to work well, and you might want go get a feel for how a given set of family variants work for you. To that end, I’ve made it possible to preview any font you’ve reconfigured with reconfigure_font() via preview_variant(). It uses some pre-set text that exercises the key features I’ve outlined, but you can sub your own for them if you want to look at something in particular. Let’s give inter_pkg a complete look:

preview_variant(inter_pkg)

We can look at another one that we’ll create now (I did not realize this font had tabular numbers until Thomas built all these wonderful toys to play with!):

reconfigure_font(
  family = "Trattatello",
  width = "normal",
  ligatures = "discretionary",
  calt = 1, tnum = 1, case = 1,
  dlig = 1, kern = 1,
  zero = 0, salt = 0
) -> trat

preview_variant(trat)

FIN

The {hrbragg} package is not even 24 hours old yet, so there are breaking changes and many new, heh, features still to come, but please — as usual — kick the tyres and post questions, feedback, contributions, or suggestions wherever you’re most comfortable (the package is on most of the popular social coding sites).

💙 Expand for EKG code
library(hrbrthemes)
library(elementalist) # remotes::install_github("teunbrand/elementalist")
library(ggplot2)

read_csv(
  file = "~/Data/apple_health_export/electrocardiograms/ecg_2020-09-24.csv", # this is extracted below
  skip = 12,
  col_names = "µV"
) %>% 
  mutate(
    idx = 1:n()
  ) -> ekg

ggplot() +
  geom_line_theme(
    data = ekg %>% tail(3000) %>% head(2500),
    aes(idx, µV),
    size = 0.125, color = "#cb181d"
  ) +
  labs(x = NULL, y = NULL) +
  theme_ipsum_inter(grid="") +
  theme(
    panel.background = element_rect(color = NA, fill = "#141414"),
    plot.background = element_rect(color = NA, fill = "#141414")
  ) +
  theme(
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    elementalist.geom_line= element_line_glow()
  )

Apple Watch owners have the ability to export their tracked data and do whatever they like with it. Since it’s Valentine’s Day, I thought it might be fun to show two ways to read heart rate data from these exports.

Why two ways? Well, I’ve owned an Apple Watch off-and-on ever since the first generation device, and when Apple says you can export all your data, they mean all. The apple_health_export.zip archive is generated by going to the “Health” iOS app, tapping your avatar in the upper left, then scrolling down and tapping the export button:

apple health data export screenshot

(NOTE: I suggest saving it to and then downloading it from iCloud vs using local AirDrop to your system.)

This compressed file is a deceivingly ~58 MB in size. Opening it up results in a directory tree of nearly 3 GB of consumed drive space O_o. That tree has the following structure:

fs::dir_tree("~/Data/apple_health_export", recurse = 1)
## ~/Data/apple_health_export
## ├── electrocardiograms
## │   └── ecg_2020-09-24.csv             # 122 KB
## ├── export.xml                         # 882 MB
## ├── export_cda.xml                     # 950 MB
## └── workout-routes                     #  81 MB
##     ├── ...
##     ├── route_2021-01-28_5.21pm.gpx
##     ├── route_2021-01-31_4.28pm.gpx
##     ├── route_2021-02-02_1.26pm.gpx
##     ├── route_2021-02-04_3.52pm.gpx
##     ├── route_2021-02-06_2.24pm.gpx
##     └── route_2021-02-10_4.54pm.gpx

The heart rate data is in the just-under 1 GB export.xml and is mixed in with all the other data points Apple records. They look like this:

<Record 
  type="HKQuantityTypeIdentifierHeartRate" 
  sourceName="Apple Watch" 
  sourceVersion="3.2" 
  device="<<HKDevice: 0x2812d8a00>, name:Apple Watch, manufacturer:Apple, model:Watch, hardware:Watch1,2, software:3.2>" 
  unit="count/min" 
  creationDate="2017-04-29 12:21:15 -0500" 
  startDate="2017-04-29 12:21:15 -0500" 
  endDate="2017-04-29 12:21:15 -0500" 
  value="102"
/>

Note that newer records of this type are not empty tags.

While dealing with gigabyte+ XML files are not nearly as untenable as they used to be in R, building a parsed XML tree in memory for all of those records will take up a non-insignificant amount of RAM (we’ll see how much below). Since I want to start playing with this data more often I decided to try two approaches: one that processes the XML in streaming “chunks” and one that does it the way you’re likely used to (if you’re unfortunate enough to have to work with XML regularly).

Streaming 💙 Beats

We’ll start with the streaming approach, which means using the venerable {XML} package, which has xmlEventParse() which is an event-driven or SAX (Simple API for XML) style parser which process XML without building the tree but rather identifies tokens in the stream of characters and passes them to handlers which can make sense of them in context. Since we’re going old-school, we’ll also use {data.table} to get a tidy dataset to work with.

We’re going to be finding heart rate records and storing the data from them into a list, so we’ll need to make room for them and use indexed-based value assignments to avoid making thousands of copies with append(). To figure out how much room we’ll need I’m going to “cheat” a bit and use ripgrep to count how many HKQuantityTypeIdentifierHeartRate records exist and use that result to reserve list space:

library(XML)
library(data.table)

nl <- system("rg -c 'type=\"HKQuantityTypeIdentifierHeartRate' ~/Data/apple_health_export/export.xml", intern = TRUE)
records <- vector(mode = "list", as.numeric(nl))
idx <- 1

There are just under 790K records buried in that file. The xmlEventParse() function has a handlers parameter which takes a list named functions for various events. The event we care about is the one where we start processing an XML element, which is unsurprisingly called startElement. In it, we’ll only process HKQuantityTypeIdentifierHeartRate records and further only care about data since 2019:

invisible(xmlEventParse(
  file = "~/Data/apple_health_export/export.xml",
  handlers = list(

    # process at element start

    startElement = function(name, attrs) {

      # only care about the heart rate recs

      if ((name == "Record") && (attrs["type"] == "HKQuantityTypeIdentifierHeartRate")) {

        # only care about records >= the year 2019

        if (substr(attrs["endDate"], 1, 4) >= 2019) {

          # if we find them, add them to the list (note the <<-)
          records[idx] <<- list(as.list(unname(attrs[c("endDate", "value")]))) # not using names reduces memory
          idx <<- idx + 1

        }
      }
    }
  )
))

At this point we have a list of all those records and have taken the R session memory from 131 MiB to 629 MiB (so, we’re eating about ~500 MiB of RAM with that call), and it took around 34 painful seconds to process the XML file.

Now, we’ll use {data.table} to tidy it up:

records <- records[lengths(records) != 0]         # get rid of any list elements we didn't use

records <- rbindlist(records, use.names = FALSE)  # make a data frame
setattr(records, 'names', c("ts", "rate"))

records[, c("ts", "rate") := list(
  as.POSIXct(ts, format = "%Y-%m-%d %H:%M:%S %z"),
  as.integer(rate)
)]  
##                          ts rate
##      1: 2019-02-12 15:19:54   69
##      2: 2019-02-12 15:26:11   90
##      3: 2019-02-12 15:31:33   92
##      4: 2019-02-12 15:34:24   89
##      5: 2019-02-12 15:57:33  120
##     ---                         
## 734526: 2021-02-13 10:17:08  118
## 734527: 2021-02-13 10:26:50  124
## 734528: 2021-02-13 10:22:56  110
## 734529: 2021-02-13 10:34:56   98
## 734530: 2021-02-13 10:39:34   99

That took around 4.5 seconds, and when the R garbage collector kicks in we’re now consuming ~695 MiB, so not much more than the previous step.

So, ~38s for the ingestion & conversion, and a maximum of ~695 MiB in play at any time during the R session. Let’s see how the new/modern way (i.e. {xml2}) compares.

Modern 💙

Unless I missed something in the {xml2} index page, there is no equivalent streaming processor, so we have to read the entire document into active RAM:

library(xml2)
library(tidyverse)

records <- xml2::read_xml("~/Data/apple_health_export/export.xml")

This operation takes 15.7s and the R session now consumes ~5.8 GiB of RAM. That is a “G”, as in gigabyte.

Now, we’ll find all the records that we care about (as above). We’ll do this via a modest XPath selector:

xml_find_all(
  records,
  xpath = "
    .//Record[
         @type = 'HKQuantityTypeIdentifierHeartRate' and
         (starts-with(@endDate, '2019') or 
          starts-with(@endDate, '2020') or 
          starts-with(@endDate, '2021'))
      ]"
) -> records

That operation took around ~6.5s and we’re still consuming around 6.23 GiB of RAM.

Now, we’ll tidy that up:

tibble(
  ts = records %>% 
    xml_attr("endDate") %>% 
    as.POSIXct(format = "%Y-%m-%d %H:%M:%S %z"),  
  rate = records %>% 
    xml_attr("value") %>% 
    as.integer()
) -> records

records
## # A tibble: 734,530 x 2
##    ts                   rate
##    <dttm>              <int>
##  1 2019-02-12 15:19:54    69
##  2 2019-02-12 15:26:11    90
##  3 2019-02-12 15:31:33    92
##  4 2019-02-12 15:34:24    89
##  5 2019-02-12 15:57:33   120
##  6 2019-02-12 15:44:09    80
##  7 2019-02-12 16:03:24   110
##  8 2019-02-12 16:13:08   118
##  9 2019-02-12 16:08:10   100
## 10 2019-02-12 16:15:04    95
## # … with 734,520 more rows

That took around 10.4s and, after garbage collection happens, we’re back to a much more reasonable ~890 MiB of consumed RAM after a workflow maximum of over 6 GiB, taking a total of ~32.6 seconds.

FIN 💙

If/when memory is tight, it’s nice to have some alternatives besides “get a bigger box”, and this is one approach (there are others) for performing this type of XML surgery in R.

Stay safe/strong, folks.