Skip navigation

Search Results for: reticulate

I had to processes a bunch of emails for a $DAYJOB task this week and my “default setting” is to use R for pretty much everything (this should come as no surprise). Treating mail as data is not an uncommon task and many R packages exist that can reach out and grab mail from servers or work directly with local mail archives.

Mbox’in off the rails on a crazy tm1

This particular mail corpus is in mbox? format since it was saved via Apple Mail. It’s one big text file with each message appearing one after the other. The format has been around for decades, and R’s tm package — via the tm.plugin.mail plugin package — can process these mbox files.

To demonstrate, we’ll use an Apple Mail archive excerpt from a set of R mailing list messages as they are not private/sensitive:

library(tm)
library(tm.plugin.mail)

# point the tm corpus machinery to the mbox file and let it know the timestamp format since it varies
VCorpus(
  MBoxSource("~/Data/test.mbox/mbox"),
  readerControl = list(
    reader = readMail(DateFormat = "%a, %e %b %Y %H:%M:%S %z")
  )
) -> mbox

str(unclass(mbox), 1)
## List of 3
##  $ content:List of 198
##  $ meta   : list()
##   ..- attr(*, "class")= chr "CorpusMeta"
##  $ dmeta  :'data.frame': 198 obs. of  0 variables

str(unclass(mbox[[1]]), 1)
## List of 2
##  $ content: chr [1:476] "Try this:" "" "> library(lubridate)" "> library(tidyverse)" ...
##  $ meta   :List of 9
##   ..- attr(*, "class")= chr "TextDocumentMeta"

str(unclass(mbox[[1]]$meta), 1)
## List of 9
##  $ author       : chr "jim holtman "
##  $ datetimestamp: POSIXlt[1:1], format: "2018-08-01 15:01:17"
##  $ description  : chr(0) 
##  $ heading      : chr "Re: [R] read txt file - date - no space"
##  $ id           : chr ""
##  $ language     : chr "en"
##  $ origin       : chr(0) 
##  $ header       : chr [1:145] "Delivered-To: bob@rud.is" "Received: by 2002:ac0:e681:0:0:0:0:0 with SMTP id b1-v6csp950182imq;" "        Wed, 1 Aug 2018 08:02:23 -0700 (PDT)" "X-Google-Smtp-Source: AAOMgpcdgBD4sDApBiF2DpKRfFZ9zi/4Ao32Igz9n8vT7EgE6InRoa7VZelMIik7OVmrFCRPDBde" ...
##  $              : NULL

We’re using unclass() since the str() output gets a bit crowded with all of the tm class attributes stuck in the output display.

The tm suite is designed for text mining. My task had nothing to do with text mining and I really just needed some header fields and body content in a data frame. If you’ve been working with R for a while, some things in the str() output will no doubt cause a bit of angst. For instance:

  • datetimestamp: POSIXlt[1:1], : POSIXlt ? and data frames really don’t mix well
  • description : chr(0) / origin : chr(0): zero-length character vectors ☹️
  • $ : NULL : Blank element name with a NULL value…I Don’t Even ??‍♀️2

The tm suite is also super opinionated and “helpfully” left out a ton of headers (though it did keep the source for the complete headers around). Still, we can roll up our sleeves and turn that into a data frame:

# helper function for cleaner/shorter code
`%|0|%` <- function(x, y) { if (length(x) == 0) y else x }

# might as well stay old-school since we're using tm
do.call(
  rbind.data.frame,
  lapply(mbox, function(.x) {

    # we have a few choices, but this one is pretty explicit abt what it does
    # so we'll likely be able to decipher it quickly in 2 years when/if we come
    # back to it

    data.frame(
      author = .x$meta$author %|0|% NA_character_,
      datetimestamp = as.POSIXct(.x$meta$datetimestamp %|0|% NA),
      description = .x$meta$description %|0|% NA_character_,
      heading = .x$meta$heading %|0|% NA_character_,
      id = .x$meta$id %|0|% NA_character_,
      language = .x$meta$language %|0|% NA_character_,
      origin = .x$meta$origin %|0|% NA_character_,
      header = I(list(.x$meta$header %|0|% NA_character_)),
      body = I(list(.x$content %|0|% NA_character_)),
      stringsAsFactors = FALSE
    )

  })
) %>%
  glimpse()
## Observations: 198
## Variables: 9
## $ author         "jim holtman ", "PIKAL Petr ...
## $ datetimestamp  2018-08-01 15:01:17, 2018-08-01 13:09:18, 2018-...
## $ description    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ heading        "Re: [R] read txt file - date - no space", "Re: ...
## $ id             " "en", "en", "en", "en", "en", "en", "en", "en", ...
## $ origin         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ header         Delivere...., Delivere...., Delivere...., De...
## $ body           Try this...., SGkNCg0K...., Dear Pik...., De... 

That wasn’t a huge effort, but we would now have to re-process the headers and/or write a custom version of tm.plugin.mail::readMail() (the function source is very readable and extendable) to get any extra data out. Here’s what that might look like:

# Custom msg reader
read_mail <- function(elem, language, id) {

  # extract header val
  hdr_val <- function(src, pat) {
    gsub(
      sprintf("%s: ", pat), "",
      grep(sprintf("^%s:", pat), src, "", value = TRUE, useBytes = TRUE)
    ) %|0|% NA
  }

  mail <- elem$content

  index <- which(mail == "")[1]
  header <- mail[1:index]
  mid <- hdr_val(header, "Message-ID")

  PlainTextDocument(
    x = mail[(index + 1):length(mail)],
    author = hdr_val(header, "From"),

    spam_score = hdr_val(header, "X-Spam-Score"), ### <<==== an extra header!

    datetimestamp = as.POSIXct(hdr_val(header, "Date"), format = "%a, %e %b %Y %H:%M:%S %z", tz = "GMT"),
    description = NA_character_,
    header = header,
    heading = hdr_val(header, "Subject"),
    id = if (length(mid)) mid[1] else id,
    language = language,
    origin = hdr_val(header, "Newsgroups"),
    class = "MailDocument"
  )

}

VCorpus(
  MBoxSource("~/Data/test.mbox/mbox"),
  readerControl = list(reader = read_mail)
) -> mbox

str(unclass(mbox[[1]]$meta), 1)
## List of 9
##  $ author       : chr "jim holtman "
##  $ datetimestamp: POSIXct[1:1], format: "2018-08-01 15:01:17"
##  $ description  : chr NA
##  $ heading      : chr "Re: [R] read txt file - date - no space"
##  $ id           : chr ""
##  $ language     : chr "en"
##  $ origin       : chr NA
##  $ spam_score   : chr "-3.631"
##  $ header       : chr [1:145] "Delivered-To: bob@rud.is" "Received: by 2002:ac0:e681:0:0:0:0:0 with SMTP id b1-v6csp950182imq;" "        Wed, 1 Aug 2018 08:02:23 -0700 (PDT)" "X-Google-Smtp-Source: AAOMgpcdgBD4sDApBiF2DpKRfFZ9zi/4Ao32Igz9n8vT7EgE6InRoa7VZelMIik7OVmrFCRPDBde" ...

If we wanted all the headers, there are even more succinct ways to solve for that use case.

Packaging up emails with a reticulated message.mbox

Since the default functionality of tm.plugin.mail::readMail() forced us to work a bit to get what we needed there’s some justification in seeking out an alternative path. I’ve written about reticulate before and am including it in this post as the Python standard library module mailbox? can also make quick work of mbox files.

Two pieces of advice I generally reiterate when I talk about reticulate is that I highly recommend using Python 3 (remember, it’s a fragmented ecosystem) and that I prefer specifying the specific target Python to use via the RETICULATE_PYTHON environment variable that I have in ~/.Renviron as RETICULATE_PYTHON=/usr/local/bin/python3.

Let’s bring the mailbox module into R:

library(reticulate)
library(tidyverse)

mailbox <- import("mailbox")

If you're unfamiliar with a Python module or object, you can get help right in R via reticulate::py_help(). Et sequitur3: py_help(mailbox) will bring up the text help for that module and py_help(mailbox$mbox) (remember, we swap out dots for dollars when referencing Python object components in R) will do the same for the mailbox.mbox class.

Text help is great and all, but we can also render it to HTML with this helper function:

py_doc <- function(x) {
  require("htmltools")
  require("reticulate")
  pydoc <- reticulate::import("pydoc")
  htmltools::html_print(
    htmltools::HTML(
      pydoc$render_doc(x, renderer=pydoc$HTMLDoc())
    )
  )
}

Here's what the text and HTML help for mailbox.mbox look like side-by-side:

We can also use a helper function to view the online documentation:

readthedocs <- function(obj, py_ver=3, check_keywords = "yes") {
  require("glue")
  query <- obj$`__name__`
  browseURL(
    glue::glue(
      "https://docs.python.org/{py_ver}/search.html?q={query}&check_keywords={check_keywords}"
    )
  )
}

Et sequitur: readthedocs(mailbox$mbox) will take us to this results page

Going back to the task at hand, we need to cycle through the messages and make a data frame for the bits we (well, I) care about). The reticulate package does an amazing job making Python objects first-class citizens in R, but Python objects may feel "opaque" to R users since we have to use the $ syntax to get to methods and values and — very often — familiar helpers such as str() are less than helpful on these objects. Let's try to look at the first message (remember, Python is 0-indexed):

msg1 <- mbox$get(0)

str(msg1)

msg1

The output for those last two calls is not shown because they both are just a large text dump of the message source. #unhelpful

We can get more details, and we'll wrap some punctuation-filled calls in two, small helper functions that have names that will sound familiar:

pstr <- function(obj, ...) { str(obj$`__dict__`, ...) } # like 'str()`

pnames <- function(obj) { import_builtins()$dir(obj) } # like 'names()' but more complete

Lets see them in action:

pstr(msg1, 1) # we can pass any params str() will take
## List of 10
##  $ _from        : chr "jholtman@gmail.com Wed Aug 01 15:02:23 2018"
##  $ policy       :Compat32()
##  $ _headers     :List of 56
##  $ _unixfrom    : NULL
##  $ _payload     : chr "Try this:\n\n> library(lubridate)\n> library(tidyverse)\n> input <- read.csv(text =3D \"date,str1,str2,str3\n+ "| __truncated__
##  $ _charset     : NULL
##  $ preamble     : NULL
##  $ epilogue     : NULL
##  $ defects      : list()
##  $ _default_type: chr "text/plain"

pnames(msg1)
##  [1] "__bytes__"                 "__class__"                
##  [3] "__contains__"              "__delattr__"              
##  [5] "__delitem__"               "__dict__"                 
##  [7] "__dir__"                   "__doc__"                  
##  [9] "__eq__"                    "__format__"               
## [11] "__ge__"                    "__getattribute__"         
## [13] "__getitem__"               "__gt__"                   
## [15] "__hash__"                  "__init__"                 
## [17] "__init_subclass__"         "__iter__"                 
## [19] "__le__"                    "__len__"                  
## [21] "__lt__"                    "__module__"               
## [23] "__ne__"                    "__new__"                  
## [25] "__reduce__"                "__reduce_ex__"            
## [27] "__repr__"                  "__setattr__"              
## [29] "__setitem__"               "__sizeof__"               
## [31] "__str__"                   "__subclasshook__"         
## [33] "__weakref__"               "_become_message"          
## [35] "_charset"                  "_default_type"            
## [37] "_explain_to"               "_from"                    
## [39] "_get_params_preserve"      "_headers"                 
## [41] "_payload"                  "_type_specific_attributes"
## [43] "_unixfrom"                 "add_flag"                 
## [45] "add_header"                "as_bytes"                 
## [47] "as_string"                 "attach"                   
## [49] "defects"                   "del_param"                
## [51] "epilogue"                  "get"                      
## [53] "get_all"                   "get_boundary"             
## [55] "get_charset"               "get_charsets"             
## [57] "get_content_charset"       "get_content_disposition"  
## [59] "get_content_maintype"      "get_content_subtype"      
## [61] "get_content_type"          "get_default_type"         
## [63] "get_filename"              "get_flags"                
## [65] "get_from"                  "get_param"                
## [67] "get_params"                "get_payload"              
## [69] "get_unixfrom"              "is_multipart"             
## [71] "items"                     "keys"                     
## [73] "policy"                    "preamble"                 
## [75] "raw_items"                 "remove_flag"              
## [77] "replace_header"            "set_boundary"             
## [79] "set_charset"               "set_default_type"         
## [81] "set_flags"                 "set_from"                 
## [83] "set_param"                 "set_payload"              
## [85] "set_raw"                   "set_type"                 
## [87] "set_unixfrom"              "values"                   
## [89] "walk"

names(msg1)
##  [1] "add_flag"                "add_header"             
##  [3] "as_bytes"                "as_string"              
##  [5] "attach"                  "defects"                
##  [7] "del_param"               "epilogue"               
##  [9] "get"                     "get_all"                
## [11] "get_boundary"            "get_charset"            
## [13] "get_charsets"            "get_content_charset"    
## [15] "get_content_disposition" "get_content_maintype"   
## [17] "get_content_subtype"     "get_content_type"       
## [19] "get_default_type"        "get_filename"           
## [21] "get_flags"               "get_from"               
## [23] "get_param"               "get_params"             
## [25] "get_payload"             "get_unixfrom"           
## [27] "is_multipart"            "items"                  
## [29] "keys"                    "policy"                 
## [31] "preamble"                "raw_items"              
## [33] "remove_flag"             "replace_header"         
## [35] "set_boundary"            "set_charset"            
## [37] "set_default_type"        "set_flags"              
## [39] "set_from"                "set_param"              
## [41] "set_payload"             "set_raw"                
## [43] "set_type"                "set_unixfrom"           
## [45] "values"                  "walk"

# See the difference between pnames() and names()

setdiff(pnames(msg1), names(msg1))
##  [1] "__bytes__"                 "__class__"                
##  [3] "__contains__"              "__delattr__"              
##  [5] "__delitem__"               "__dict__"                 
##  [7] "__dir__"                   "__doc__"                  
##  [9] "__eq__"                    "__format__"               
## [11] "__ge__"                    "__getattribute__"         
## [13] "__getitem__"               "__gt__"                   
## [15] "__hash__"                  "__init__"                 
## [17] "__init_subclass__"         "__iter__"                 
## [19] "__le__"                    "__len__"                  
## [21] "__lt__"                    "__module__"               
## [23] "__ne__"                    "__new__"                  
## [25] "__reduce__"                "__reduce_ex__"            
## [27] "__repr__"                  "__setattr__"              
## [29] "__setitem__"               "__sizeof__"               
## [31] "__str__"                   "__subclasshook__"         
## [33] "__weakref__"               "_become_message"          
## [35] "_charset"                  "_default_type"            
## [37] "_explain_to"               "_from"                    
## [39] "_get_params_preserve"      "_headers"                 
## [41] "_payload"                  "_type_specific_attributes"
## [43] "_unixfrom"

Using just names() excludes the "hidden" builtins for Python objects, but knowing they are there and what they are can be helpful, depending on the program context.

Let's continue on the path to our messaging goal and see what headers are available. We'll use some domain knowledge about the _headers component, though we won't end up going that route to build a data frame:

map_chr(msg1$`_headers`, ~.x[[1]])
##  [1] "Delivered-To"               "Received"                  
##  [3] "X-Google-Smtp-Source"       "X-Received"                
##  [5] "ARC-Seal"                   "ARC-Message-Signature"     
##  [7] "ARC-Authentication-Results" "Return-Path"               
##  [9] "Received"                   "Received-SPF"              
## [11] "Authentication-Results"     "Received"                  
## [13] "X-Virus-Scanned"            "Received"                  
## [15] "Received"                   "Received"                  
## [17] "X-Virus-Scanned"            "X-Spam-Flag"               
## [19] "X-Spam-Score"               "X-Spam-Level"              
## [21] "X-Spam-Status"              "Received"                  
## [23] "Received"                   "Received"                  
## [25] "Received"                   "DKIM-Signature"            
## [27] "X-Google-DKIM-Signature"    "X-Gm-Message-State"        
## [29] "X-Received"                 "MIME-Version"              
## [31] "References"                 "In-Reply-To"               
## [33] "From"                       "Date"                      
## [35] "Message-ID"                 "To"                        
## [37] "X-Tag-Only"                 "X-Filter-Node"             
## [39] "X-Spam-Level"               "X-Spam-Status"             
## [41] "X-Spam-Flag"                "Content-Disposition"       
## [43] "Subject"                    "X-BeenThere"               
## [45] "X-Mailman-Version"          "Precedence"                
## [47] "List-Id"                    "List-Unsubscribe"          
## [49] "List-Archive"               "List-Post"                 
## [51] "List-Help"                  "List-Subscribe"            
## [53] "Content-Type"               "Content-Transfer-Encoding" 
## [55] "Errors-To"                  "Sender"

The mbox object does provide a get() method to retrieve header values so we'll go that route to build our data frame but we'll make yet-another helper since doing something like msg1$get("this header does not exist") will return NULL just like list(a=1)$b would. We'll actually make two new helpers since we want to be able to safely work with the payload content and that means ensuring it's in UTF-8 encoding (mail systems are horribly diverse beasts and the R community is international and, remember, we're using R mailing list messages):

# execute an object's get() method and return a character string or NA if no value was present for the key
get_chr <- function(.x, .y) { as.character(.x[["get"]](.y)) %|0|% NA_character_ }

# get the object's value as a valid UTF-8 string
utf8_decode <- function(.x) { .x[["decode"]]("utf-8", "ignore") %|0|% NA_character_ }

We're also doing this because I get really tired of using the $ syntax.

We also want the message content or payload. Modern mail messages can be really complex structures with many multiple part entities. To put it a different way, there may be HTML, RTF and plaintext versions of a message all in the same envelope. We want the plaintext ones so we'll have to iterate through any multipart messages to (hopefully) get to a plaintext version. Since this post is already pretty long and we ignored errors in the tm portion, I'll refrain from including any error handling code here as well.

map_df(1:py_len(mbox), ~{

  m <- mbox$get(.x-1) # python uses 0-index lists

  list(
    date = as.POSIXct(get_chr(m, "date"), format = "%a, %e %b %Y %H:%M:%S %z"),
    from = get_chr(m, "from"),
    to = get_chr(m, "to"),
    subj = get_chr(m, "subject"),
    spam_score = get_chr(m, "X-Spam-Score")
  ) -> mdf

  content_type <-  m$get_content_maintype() %|0|% NA_character_

  if (content_type[1] == "text") { # we don't want images
    while (m$is_multipart()) m <- m$get_payload()[[1]] # cycle through until we get to something we can use
    mtmp <- m$get_payload(decode = TRUE) # get the message text
    mdf$body <- utf8_decode(mtmp) # make it safe to use
  }

  mdf

}) -> mbox_df

glimpse(mbox_df)
## Observations: 198
## Variables: 7
## $ date          2018-08-01 11:01:17, 2018-08-01 09:09:18, 20...
## $ from          "jim holtman ", "PIKAL Pe...
## $ to            "diego.avesani@gmail.com, R mailing list  "Re: [R] read txt file - date - no space", "R...
## $ spam_score    "-3.631", "-3.533", "-3.631", "-3.631", "-3.5...
## $ content_type  "text", "text", "text", "text", "text", "text...
## $ body          "Try this:\n\n library(lubridate)\n library...

FIN

By now, you've likely figured out this post really had nothing to do with reading mbox files. I mean, it did — and this was a task I had to do this week — but the real goal was to use a fairly basic task to help R folks edge a bit closer to becoming more friendly with Python in R. There hundreds of thousands of Python packages out there and, while I'm one to wax poetic about having R or C[++]-backed R-native packages — and am wont to point out Python's egregiously prolific flaws — sometimes you just need to get something done quickly and wish to avoid reinventing the wheel. The reticulate package makes that eminently possible.

I'll be wrapping up some of the reticulate helper functions into a small package soon, so keep your eyes on RSS.


: You might want to read this even if you're not interested in mbox files. FIN (right above this note) might have some clues as to why.
1: yes, the section title was a stretch
2: am I doing this right, Mara? ;-)
3: Make Latin Great Again

Lynn (of TITAA and general NLP wizardy fame) was gracious enough to lend me a Bluesky invite, so I could claim my handle on yet-another social media site. I’m still wary of it (as noted in one of this week’s Drops), but the AT protocol — whilst super (lacking a better word) “verbose” — is pretty usable, especially thanks to Ilya Siamionau’s atproto AT Protocol SDK for Python.

Longtime readers know I am most certainly not going to use Python directly, as such practice has been found to cause early onset dementia. But, that module is so well done that I’ll gladly use it from within R.

I whipped up a small R script CLI that will fetch my feed and display it via the terminal. While I also use the web app and the Raycast extension to read the feed, it’s a billion degrees outside, so used the need to stay indoors as an excuse to add this third way of checking what’s new.

Store your handle and app-password in BSKY_USER and BSKY_KEY, respectively, adjust the shebang accordingly, add execute permissions to the file and 💥, you can do the same.

#!/usr/local/bin/Rscript

suppressPackageStartupMessages({
  library(reticulate, quietly = TRUE, warn.conflicts = FALSE)
  library(lubridate, include.only = c("as.period", "interval"), quietly = TRUE, warn.conflicts = FALSE)
  library(crayon, quietly = TRUE, warn.conflicts = FALSE)
})

# Get where {reticlulate} thinks your python is via py_config()$python
# then use the full path to 
#   /full/path/to/python3 -m pip install atproto

atproto <- import("atproto")

client <- atproto$Client()

profile <- client$login(Sys.getenv("BSKY_USER"), Sys.getenv("BSKY_KEY"))

res <- client$bsky$feed$get_timeline(list(algorithm = "reverse-chronological"))

for (item in rev(res$feed)) (
  cat(
    blue(item$post$author$displayName), " • ",
    silver(gsub("\\.[[:digit:]]+", "", tolower(as.character(as.period(interval(item$post$record$createdAt, Sys.time()))))), "ago\n"),
    italic(paste0(strwrap(item$post$record$text, 50), collapse="\n")), "\n",
    ifelse(
      hasName(item$post$record$embed, "images"), 
      sprintf(
        green("[%s IMAGE%s]\n"), 
        length(item$post$record$embed$images),
        ifelse(length(item$post$record$embed$images) > 1, "s", "")
      ),
      ""
    ),
    ifelse(
      hasName(item$post$record$embed, "external"),
      yellow(sprintf(
        "\n%s\n   │\n%s\n\n",
        bold(paste0(strwrap(item$post$embed$external$title, 47, prefix = "   │"), collapse = "\n")),
        italic(paste0(strwrap(item$post$embed$external$description, 47, prefix = "   │"), collapse = "\n"))
      )),
      ""
    ),
    "\n",
    sep = ""
  )
)

This is a sample of the output, showing how it handles embeds and images:

feed output

Code is on GitLab.

FIN

There’s tons of room for improvement in this hastily-crafted bit of code, and I’ll get it up on GitLab once their servers come back to life.

If you want to experience Bluesky but have no account, the firehose — which Elon charges $40K/month for on the birdsite — is free and can be accessed sans authentication:

library(reticulate)

atproto <- import("atproto")

hose <- atproto$firehose$FirehoseSubscribeReposClient()

handler <- \(msg) {
  res <- atproto$firehose$parse_subscribe_repos_message(msg)
  print(res) # you need to do a bit more than this to get the actual commit type and contents
}

hose$start(handler)

You can find me over on bsky at @hrbrmstr.dev.

The past two posts have (lightly) introduced how to use compiled Swift code in R, but they’ve involved a bunch of “scary” command line machinations and incantations.

One feature of {Rcpp} I’ve always 💙 is the cppFunction() (“r-lib” zealots have a similar cpp11::cpp_function()) which lets one experiment with C[++] code in R with as little friction as possible. To make it easier to start experimenting with Swift, I’ve built an extremely fragile swift_function() in {swiftr} that intends to replicate this functionality. Explaining it will be easier with an example.

Reading Property Lists in R With Swift

macOS relies heavily on property lists for many, many things. These can be plain text (XML) or binary files and there are command-line tools and Python libraries (usable via {reticulate}) that can read them along with the good ‘ol XML::readKeyValueDB(). We’re going to create a Swift function to read property lists and return JSON which we can use back in R via {jsonlite}.

This time around there’s no need to create extra files, just install {swiftr} and your favorite R IDE and enter the following (expository is after the code):

library(swiftr)

swift_function(
  code = '

func ignored() {
  print("""
this will be ignored by swift_function() but you could use private
functions as helpers for the main public Swift function which will be 
made available to R.
""")
}  

@_cdecl ("read_plist")
public func read_plist(path: SEXP) -> SEXP {

  var out: SEXP = R_NilValue

  do {
    // read in the raw plist
    let plistRaw = try Data(contentsOf: URL(fileURLWithPath: String(cString: R_CHAR(STRING_ELT(path, 0)))))

    // convert it to a PropertyList  
    let plist = try PropertyListSerialization.propertyList(from: plistRaw, options: [], format: nil) as! [String:Any]

    // serialize it to JSON
    let jsonData = try JSONSerialization.data(withJSONObject: plist , options: .prettyPrinted)

    // setup the JSON string return
    String(data: jsonData, encoding: .utf8)?.withCString { 
      cstr in out = Rf_mkString(cstr) 
    }

  } catch {
    debugPrint("\\(error)")
  }

  return(out)

}
')

This new swift_function() function — for the moment (the API is absolutely going to change) — is defined as:

swift_function(
  code,
  env = globalenv(),
  imports = c("Foundation"),
  cache_dir = tempdir()
)

where:

  • code is a length 1 character vector of Swift code
  • env is the environment to expose the function in (defaults to the global environment)
  • imports is a character vector of any extra Swift frameworks that need to be imported
  • cache_dir is where all the temporary files will be created and compiled dynlib will be stored. It defaults to a temporary directory so specify your own directory (that exists) if you want to keep the files around after you close the R session

Folks familiar with cppFunction() will notice some (on-purpose) similarities.

The function expects you to expose only one public Swift function which also (for the moment) needs to have the @_cdecl decorator before it. You can have as many other valid Swift helper functions as you like, but are restricted to one function that will be turned into an R function automagically.

In this example, swift_function() will see public func read_plist(path: SEXP) -> SEXP { and be able to identify

  • the function name (read_plist)
  • the number of parameters (they all need to be SEXP, for now)
  • the names of the parameters

A complete source file with all the imports will be created and a pre-built bridging header (which comes along for the ride with {swiftr}) will be included in the compilation step and a dylib will be built and loaded into the R session. Finally, an R function that wraps a .Call() will be created and will have the function name of the Swift function as well as all the parameter names (if any).

In the case of our example, above, the built R function is:

function(path) {
  .Call("read_plist", path)
}

There’s a good chance you’re using RStudio, so we can test this with it’s property list, or you can substitute any other application’s property list (or any .plist you have) to test this out:

read_plist("/Applications/RStudio.app/Contents/Info.plist") %>% 
  jsonlite::fromJSON() %>% 
  str(1)
## List of 32
##  $ NSPrincipalClass                     : chr "NSApplication"
##  $ NSCameraUsageDescription             : chr "R wants to access the camera."
##  $ CFBundleIdentifier                   : chr "org.rstudio.RStudio"
##  $ CFBundleShortVersionString           : chr "1.4.1093-1"
##  $ NSBluetoothPeripheralUsageDescription: chr "R wants to access bluetooth."
##  $ NSRemindersUsageDescription          : chr "R wants to access the reminders."
##  $ NSAppleEventsUsageDescription        : chr "R wants to run AppleScript."
##  $ NSHighResolutionCapable              : logi TRUE
##  $ LSRequiresCarbon                     : logi TRUE
##  $ NSPhotoLibraryUsageDescription       : chr "R wants to access the photo library."
##  $ CFBundleGetInfoString                : chr "RStudio 1.4.1093-1, © 2009-2020 RStudio, PBC"
##  $ NSLocationWhenInUseUsageDescription  : chr "R wants to access location information."
##  $ CFBundleInfoDictionaryVersion        : chr "6.0"
##  $ NSSupportsAutomaticGraphicsSwitching : logi TRUE
##  $ CSResourcesFileMapped                : logi TRUE
##  $ CFBundleVersion                      : chr "1.4.1093-1"
##  $ OSAScriptingDefinition               : chr "RStudio.sdef"
##  $ CFBundleLongVersionString            : chr "1.4.1093-1"
##  $ CFBundlePackageType                  : chr "APPL"
##  $ NSContactsUsageDescription           : chr "R wants to access contacts."
##  $ NSCalendarsUsageDescription          : chr "R wants to access calendars."
##  $ NSMicrophoneUsageDescription         : chr "R wants to access the microphone."
##  $ CFBundleDocumentTypes                :'data.frame':  16 obs. of  8 variables:
##  $ NSPhotoLibraryAddUsageDescription    : chr "R wants write access to the photo library."
##  $ NSAppleScriptEnabled                 : logi TRUE
##  $ CFBundleExecutable                   : chr "RStudio"
##  $ CFBundleSignature                    : chr "Rstd"
##  $ NSHumanReadableCopyright             : chr "RStudio 1.4.1093-1, © 2009-2020 RStudio, PBC"
##  $ CFBundleName                         : chr "RStudio"
##  $ LSApplicationCategoryType            : chr "public.app-category.developer-tools"
##  $ CFBundleIconFile                     : chr "RStudio.icns"
##  $ CFBundleDevelopmentRegion            : chr "English"

FIN

A source_swift() function is on the horizon as is adding a ton of checks/validations to swift_function(). I’ll likely be adding some of the SEXP and R Swift utility functions I’ve demonstrated in the [unfinished] book to make it fairly painless to interface Swift and R code in this new and forthcoming function.

As usual, kick the tyres, submit feature requests and bugs in any forum that’s comfortable and stay strong, wear a 😷, and socially distanced when out and about.

Over Christmas break I teased some screencaps:

of some almost-natural “R” looking code (this is a snippet):

Button("Run") {
  do { // calls to R can fail so there are lots of "try"s; poking at less ugly alternatives

    // handling dots in named calls is a WIP
    _  = try R.evalParse("options(tidyverse.quiet = TRUE )")

    // in practice this wld be called once in a model
    try R.library("ggplot2")
    try R.library("hrbrthemes")
    try R.library("magick")

    // can mix initialiation of an R list with Swift and R objects
    let mvals: RObject = [
      "month": [ "Jan", "Feb", "Mar", "Apr", "May", "Jun" ],
      "value": try R.sample(100, 6)
    ]

    // ggplot2! `mvals` is above, `col.hexValue` comes from the color picker
    // can't do R.as.data.frame b/c "dots" so this is a deliberately exposed alternate call
    let gg = try R.ggplot(R.as_data_frame(mvals)) +
      R.geom_col(R.aes_string("month", "value"), fill: col.hexValue) + // supports both [un]named
      R.scale_y_comma() +
      R.labs(
        x: rNULL, y: "# things",
        title: "Monthly Bars"
      ) +
      R.theme_ipsum_gs(grid: "Y")

    // an alternative to {magick} could be getting raw SVG from {svglite} device
    // we get Image view width/height and pass that to {magick}
    // either beats disk/ssd round-trip
    let fig = try R.image_graph(
      width: Double(imageRect.width), 
      height: Double(imageRect.height), 
      res: 144
    )

    try R.print(gg)
    _ = R.dev_off() // can't do R.dev.off b/c "dots" so this is a deliberately exposed alternate call

    let res = try R.image_write(fig, path: rNULL, format: "png")

    imgData = Data(res) // "imgData" is a reactive SwiftUI bound object; when it changes Image does too

  } catch {
  }

}

that works in Swift as part of a SwiftUI app that displays a ggplot2 plot inside of a macOS application.

It doesn’t shell out to R, but uses Swift 5’s native abilities to interface with R’s C interface.

I’m not ready to reveal that SwiftR code/library just yet (break’s over and the core bits still need some tweaking) but I can provide some interim resources with an online book about working with R’s C interface from Swift on macOS. It is uninspiringly called SwiftR — Using R from Swift.

There are, at present, six chapters that introduce the Swift+R concepts via command line apps. These aren’t terribly useful (shebanged R scripts work just fine, #tyvm) in and of themselves, but command line machinations are a much lower barrier to entry than starting right in with SwiftUI (that starts in chapter seven).

FIN

If you’ve wanted a reason to burn ~20GB of drive space with an Xcode installation and start to learn Swift (or learn more about Swift) then this is a resource for you.

The topics in the chapters are also a fairly decent (albeit incomplete) overview of R’s C interface and also how to work with C code from Swift in general.

So, take advantage of the remaining pandemic time and give it a 👀.

Feedback is welcome in the comments or the book code repo (book source repo is in progress).

Hope everyone has a safe and strong new year!

I caught this post on the The Surprising Number Of Programmers Who Can’t Program from the Hacker News RSS feed. Said post links to another, classic post on the same subject and you should read both before continuing.

Back? Great! Let’s dig in.

Why does hrbrmstr care about this?

Offspring #3 completed his Freshman year at UMaine Orono last year but wanted to stay academically active over the summer (he’s majoring in astrophysics and knows he’ll need some programming skills to excel in his field) and took an introductory C++ course from UMaine that was held virtually, with 1 lecture per week (14 weeks IIRC) and 1 assignment due per week with no other grading.

After seeing what passes for a standard (UMaine is not exactly on the top list of institutions to attend if one wants to be a computer scientist) intro C++ course, I’m not really surprised “Johnny can’t code”. Thirteen weeks in the the class finally started covering OO concepts, and the course is ending with a scant intro to polymorphism. Prior to this, most of the assignments were just variations on each other (read from stdin, loop with conditionals, print output) with no program going over 100 LoC (that includes comments and spacing). This wasn’t a “compsci for non-compsci majors” course, either. Anyone majoring in an area of study that requires programming could have taken this course to fulfill one of the requirements, and they’d be set on a path of forever using StackOverflow copypasta to try to get their future work done.

I’m fairly certain most of #3’s classmates could not program fizzbuzz without googling and even more certain most have no idea they weren’t really “coding in C++” most of the course.

If this is how most other middling colleges are teaching the basics of computer programming, it’s no wonder employers are having a difficult time finding qualified talent.

You have an “R” tag — actually, a few language tags — on this post, so where’s the code?

After the article triggered the lament in the previous section, a crazy, @coolbutuseless-esque thought came into my head: “I wonder how many different language FizzBuz solutions can be created from within R?”.

The criteria for that notion is/was that there needed to be some Rcpp::cppFunction(), reticulate::py_run_string(), V8 context eval()-type way to have the code in-R but then run through those far-super-to-any-other-language’s polyglot extensibility constructs.

Before getting lost in the weeds, there were some other thoughts on language inclusion:

  • Should Java be included? I :heart: {rJava}, but cat()-ing Java code out and running system() to compile it first seemed like cheating (even though that’s kinda just what cppFunction() does). Toss a note into a comment if you think a Java example should be added (or add said Java example in a comment or link to it in one!).
  • I think Julia should be in this example list but do not care enough about it to load {JuliaCall} and craft an example (again, link or post one if you can crank it out quickly).
  • I think Lua could be in this example given the existence of {luar}. If you agree, give it a go!
  • Go & Rust compiled code can also be called in R (thanks to Romain & Jeroen) once they’re turned into C-compatible libraries. Should this polyglot example show this as well?
  • What other languages am I missing?

The aforementioned “weeds”

One criteria for each language fizzbuzz example is that they need to be readable, not hacky-cool. That doesn’t mean the solutions still can’t be a bit creative. We’ll lightly go through each one I managed to code up. First we’ll need some helpers:

suppressPackageStartupMessages({
  library(purrr)
  library(dplyr)
  library(reticulate)
  library(V8)
  library(Rcpp)
})

The R, JavaScript, and Python implementations are all in the microbenchmark() call way down below. Up here are C and C++ versions. The C implementation is boring and straightforward, but we’re using Rprintf() so we can capture the output vs have any output buffering woes impact the timings.

cppFunction('
void cbuzz() {

  // super fast plain C

  for (unsigned int i=1; i<=100; i++) {
    if      (i % 15 == 0) Rprintf("FizzBuzz\\n");
    else if (i %  3 == 0) Rprintf("Fizz\\n");
    else if (i %  5 == 0) Rprintf("Buzz\\n");
    else Rprintf("%d\\n", i);
  }

}
')

The cbuzz() example is just fine even in C++ land, but we can take advantage of some C++11 vectorization features to stay formally in C++-land and play with some fun features like lambdas. This will be a bit slower than the C version plus consume more memory, but shows off some features some folks might not be familiar with:

cppFunction('
void cppbuzz() {

  std::vector<int> numbers(100); // will eventually be 1:100
  std::iota(numbers.begin(), numbers.end(), 1); // kinda sorta equiva of our R 1:100 but not exactly true

  std::vector<std::string> fb(100); // fizzbuzz strings holder

  // transform said 1..100 into fizbuzz strings
  std::transform(
    numbers.begin(), numbers.end(), 
    fb.begin(),
    [](int i) -> std::string { // lambda expression are cool like a fez
        if      (i % 15 == 0) return("FizzBuzz");
        else if (i %  3 == 0) return("Fizz");
        else if (i %  5 == 0) return("Buzz");
        else return(std::to_string(i));
    }
  );

  // round it out with use of for_each and another lambda
  // this turns out to be slightly faster than range-based for-loop
  // collection iteration syntax.
  std::for_each(
    fb.begin(), fb.end(), 
    [](std::string s) { Rcout << s << std::endl; }
  );

}
', 
plugins = c('cpp11'))

Both of those functions are now available to R.

Next, we need to prepare to run JavaScript and Python code, so we’ll initialize both of those environments:

ctx <- v8()

py_config() # not 100% necessary but I keep my needed {reticulate} options in env vars for reproducibility

Then, we tell R to capture all the output. Using sink() is a bit better than capture.output() in this use-case since to avoid nesting calls, and we need to handle Python stdout the same way py_capture_output() does to be fair in our measurements:

output_tools <- import("rpytools.output")
restore_stdout <- output_tools$start_stdout_capture()

cap <- rawConnection(raw(0), "r+")
sink(cap)

There are a few implementations below across the tidy and base R multiverse. Some use vectorization; some do not. This will let us compare overall “speed” of solution. If you have another suggestion for a readable solution in R, drop a note in the comments:

microbenchmark::microbenchmark(

  # tidy_vectors_case() is slowest but you get all sorts of type safety 
  # for free along with very readable idioms.

  tidy_vectors_case = map_chr(1:100, ~{ 
    case_when(
      (.x %% 15 == 0) ~ "FizzBuzz",
      (.x %%  3 == 0) ~ "Fizz",
      (.x %%  5 == 0) ~ "Buzz",
      TRUE ~ as.character(.x)
    )
  }) %>% 
    cat(sep="\n"),

  # tidy_vectors_if() has old-school if/else syntax but still
  # forces us to ensure type safety which is cool.

  tidy_vectors_if = map_chr(1:100, ~{ 
    if (.x %% 15 == 0) return("FizzBuzz")
    if (.x %%  3 == 0) return("Fizz")
    if (.x %%  5 == 0) return("Buzz")
    return(as.character(.x))
  }) %>% 
    cat(sep="\n"),

  # walk() just replaces `for` but stays in vector-land which is cool

  tidy_walk = walk(1:100, ~{
    if (.x %% 15 == 0) cat("FizzBuzz\n")
    if (.x %%  3 == 0) cat("Fizz\n")
    if (.x %%  5 == 0) cat("Buzz\n")
    cat(.x, "\n", sep="")
  }),

  # vapply() gets us some similiar type assurance, albeit with arcane syntax

  base_proper = vapply(1:100, function(.x) {
    if (.x %% 15 == 0) return("FizzBuzz")
    if (.x %%  3 == 0) return("Fizz")
    if (.x %%  5 == 0) return("Buzz")
    return(as.character(.x))
  }, character(1), USE.NAMES = FALSE) %>% 
    cat(sep="\n"),

  # sapply() is def lazy but this can outperform vapply() in some
  # circumstances (like this one) and is a bit less arcane.

  base_lazy = sapply(1:100, function(.x) {
    if (.x %% 15 == 0)  return("FizzBuzz")
    if (.x %%  3 == 0) return("Fizz")
    if (.x %%  5 == 0) return("Buzz")
    return(.x)
  }, USE.NAMES = FALSE) %>% 
    cat(sep="\n"),

  # for loops...ugh. might as well just use C

  base_for = for(.x in 1:100) {
    if      (.x %% 15 == 0) cat("FizzBuzz\n")
    else if (.x %%  3 == 0) cat("Fizz\n")
    else if (.x %%  5 == 0) cat("Buzz\n")
    else cat(.x, "\n", sep="")
  },

  # ok, we'll just use C!

  c_buzz = cbuzz(),

  # we can go back to vector-land in C++

  cpp_buzz = cppbuzz(),

  # some <3 for javascript

  js_readable = ctx$eval('
for (var i=1; i <101; i++){
  if      (i % 15 == 0) console.log("FizzBuzz")
  else if (i %  3 == 0) console.log("Fizz")
  else if (i %  5 == 0) console.log("Buzz")
  else console.log(i)
}
'),

  # icky readable, non-vectorized python

  python = reticulate::py_run_string('
for x in range(1, 101):
  if (x % 15 == 0):
    print("Fizz Buzz")
  elif (x % 5 == 0):
    print("Buzz")
  elif (x % 3 == 0):
    print("Fizz")
  else:
    print(x)
')

) -> res

Turn off output capturing:

sink()
if (!is.null(restore_stdout)) invisible(output_tools$end_stdout_capture(restore_stdout))

We used microbenchmark(), so here are the results:

res
## Unit: microseconds
##               expr       min         lq        mean     median         uq       max neval   cld
##  tidy_vectors_case 20290.749 21266.3680 22717.80292 22231.5960 23044.5690 33005.960   100     e
##    tidy_vectors_if   457.426   493.6270   540.68182   518.8785   577.1195   797.869   100  b   
##          tidy_walk   970.455  1026.2725  1150.77797  1065.4805  1109.9705  8392.916   100   c  
##        base_proper   357.385   375.3910   554.13973   406.8050   450.7490 13907.581   100  b   
##          base_lazy   365.553   395.5790   422.93719   418.1790   444.8225   587.718   100 ab   
##           base_for   521.674   545.9155   576.79214   559.0185   584.5250   968.814   100  b   
##             c_buzz    13.538    16.3335    18.18795    17.6010    19.4340    33.134   100 a    
##           cpp_buzz    39.405    45.1505    63.29352    49.1280    52.9605  1265.359   100 a    
##        js_readable   107.015   123.7015   162.32442   174.7860   187.1215   270.012   100 ab   
##             python  1581.661  1743.4490  2072.04777  1884.1585  1985.8100 12092.325   100    d 

Said results are 🤷🏻‍♀️ since this is a toy example, but I wanted to show that Jeroen’s {V8} can be super fast, especially when there’s no value marshaling to be done and that some things you may have thought should be faster, aren’t.

FIN

Definitely add links or code for changes or additions (especially the aforementioned other languages). Hopefully my lament about the computer science program at UMaine is not universally true for all the programming courses there.

There are two fledgling rJava-based R packages that enable working with the AWS SDK for Athena:

They’re both needed to conform with the way CRAN like rJava-based packages submitted that also have large JAR dependencies. The goal is to eventually have wrappers for anything R folks need under the AWS Java SDK menu.

All package pairs will eventually cohabitate under the Cloudy R Project once each gets to 90% API coverage, passes CRAN checks and has passing Travis checks.

One thing I did get working right up front was the asynchronous dplyr chain query execution collect_async(), so if you need that and would rather not use reticulated wrappers, now’s your chance.

You would be correct in assuming this is an offshoot of the recent work on updating metis. My primary impetus for this is to remove the reticulate dependency from our Dockerized production setups but I also have discovered I like the Java libraries more than the boto3-based ones (not really a shocker there if you know my views on Python). As a result I should be able to quickly wrap most any library you may need (see below).

FIN

The next major wrapper coming is S3 (there are bits of it implemented in awsathena now but that’s temporary) and — for now — you can toss a comment here or file an issue in any of the social coding sites you like for priority wrapping of other AWS Java SDK libraries. Also, if you want some experience working with rJava packages in a judgement-free zone, drop a note into these or any new AWS rJava-based package repos and I’ll gladly walk you through your first PR.

A soon-to-be organized list of R packages for use in cybersecurity research, DFIR, risk analysis, metadata collection, document/data processing and more (not just by me, but the current list is made up of ones I’ve created or resurrected). If you want your packages to appear here, add the r-cyber topic to GitLab or GitHub repos and this list will be automagically periodically updated.

  • AnomalyDetection : ⏰ Anomaly Detection with R (separately maintained fork of Twitter’s AnomalyDetection ?) (r, rstats, anomaly-detection, anomalydetection, r-cyber)
  • aquarium : ???? Validate ‘Phishing’ ‘URLs’ with the ‘PhishTank’ Service (r, rstats, phishing, phishtank, r-cyber)
  • astools : ⚒ Tools to Work With Autonomous System (‘AS’) Network and Organization Data (r, rstats, autonomous-systems, routeviews, r-cyber)
  • blackmagic : ? Automagically Convert XML to JSON an JSON to XML (r, rstats, xmltojson, xml-to-json, xml-js, json-to-xml-converter, json-to-xml, jsontoxml, r-cyber)
  • burrp : ? Tools to Import and Process ‘PortSwigger’ ‘Burp’ Proxy Data (rstats, r, burpsuite, proxy, har, r-cyber)
  • carbondater : ? Estimate the Age of Web Resources (r, rstats, r-cyber)
  • cc : ⛏Extract metadata of a specific target based on the results of “commoncrawl.org” (r, rstats, common-crawl, domains, urls, reconnaissance, recon, r-cyber)
  • cdx : ? Query Web Archive Crawl Indexes (‘CDX’) (r, rstats, cdx, web-archives, r-cyber)
  • censys : R interface to the Censys “cyber”/scans search engine • https://www.censys.io/tutorial (censys-data, censys-api, r, rstats, r-cyber)
  • clandnstine : ㊙️ Perform ‘DNS’ over ‘TLS’ Queries (r, rstats, dns-over-tls, getdnsapi, getdns, dns, r-cyber)
  • crafter : ? An R package to work with PCAPs (r, rstats, pcap, pcap-files, pcap-analyzer, packet-capture, r-cyber)
  • curlconverter : ➰➡️➖ Translate cURL command lines into parameters for use with httr or actual httr calls (R) (curl, httr, r, rstats, r-cyber)
  • curlparse : ?Parse ‘URLs’ with ‘libcurl’ (r, rstats, libcurl, url-parse, r-cyber)
  • cymruservices : ? package that provides interfaces to various Team Cymru Services (r, rstats, team-cymru-webservice, malware-hash-registry, bogons, r-cyber)
  • czdaptools : R tools for downloading zone data from ICANN’s CZDS application (r, rstats, r-cyber, czdap)
  • decapitated : Headless ‘Chrome’ Orchestration in R (r, rstats, headless-chrome, web-scraping, javascript, r-cyber)
  • devd : Install, Start and Stop ‘devd’ Instances from R (r, rstats, devd, r-cyber)
  • dnsflare : ❓?Query ‘Cloudflare’ Domain Name System (‘DNS’) Servers over ‘HTTPS’ (r, dns, dns-over-https, cloudflare, cloudflare-1-dot-1-dot-1-dot-1, 1-dot-1-dot-1-dot-1, rstats, r-cyber)
  • dnshelpers : ℹ Tools to Process ‘DNS’ Response Data (r, rstats, dns, dns-parser, r-cyber)
  • domaintools : R API interface to the DomainTools API (r, rstats, domaintools, domaintools-api, r-cyber)
  • dshield : Query ‘SANS’ ‘DShield’ ‘API’ (r, rstats, dshield, isc, r-cyber)
  • exiv : ? Read and Write ‘Exif’ Image/Media Tags with R (r, rstats, exiv2-library, exiv2, exif, r-cyber)
  • gdns : Tools to work with the Google DNS over HTTPS API in R (spf-record, google-dns, dns, rstats, r, r-cyber)
  • gepetto : ? ScrapingHub Splash-like REST API for Headless Chrome (headless-chrome, nodejs, node-js, npm, hapi, hapijs, splash, r-cyber)
  • greynoise : Query ‘GreyNoise Intelligence ‘API’ in R (r, rstats, r-cyber, greynoise-intelligence)
  • hgr : ? Tools to Work with the ‘Postlight’ ‘Mercury’ ‘API’ — https://mercury.postlight.com/web-parser/ — in R (r, rstats, postlight-mercury-api, postlight, r-cyber)
  • hormel : ⚙️ Retrieve and Process ‘Spamhaus’ Zone/Host Metadata (r, rstats, spamhaus, spam, block-list, r-cyber)
  • htmltidy : ? Tidy Up and Test XPath Queries on HTML and XML Content in R (r, rstats, html, xml, r-cyber)
  • htmlunit : ??☕️Tools to Scrape Dynamic Web Content via the ‘HtmlUnit’ Java Library (r, rstats, htmlunit, web-scraping, javascript, r-cyber)
  • htmlunitjars : ☕️ Java Archive Wrapper Supporting the ‘htmlunit’ Package (r, rstats, rjava, htmlunit, web-scraping, r-cyber)
  • ipapi : An R package to geolocate IPv4/6 addresses and/or domain names using ip-api.com’s API (r, rstats, r-cyber, ipapi)
  • ipinfo : ℹ Collect Metadata on ‘IP’ Addresses and Autonomous Systems (r, rstats, ipv4, ipv6, asn, ipinfo, r-cyber)
  • ipstack : ⛏ Tools to Query ‘IP’ Address Information from the ‘ipstack’ ‘API’ (r, rstats, ipstack, ip-reputation, ip-geolocation, r-cyber)
  • iptools : ? A toolkit for manipulating, validating and testing IP addresses and ranges, along with datasets relating to IP add… (iptools, rstats, r, ipv4-address, r-cyber)
  • iptrie : ? Efficiently Store and Query ‘IPv4’ Internet Addresses with Associated Data (r, rstats, ip-address, cidr, trie, r-cyber, ipv4-trie, ipv4-address, internet-address, ip-trie)
  • jerichojars : Java Archive Wrapper Supporting the ‘jericho’ R Package (r, rstats, r-cyber, jeric)
  • jwatr : ? Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit in R (r, rstats, java, warc, r-cyber)
  • longurl : ℹ️ Small R package for no-API-required URL expansion (r, rstats, url-shortener, url, r-cyber)
  • mactheknife : ? Various ‘macOS’-oriented Tools and Utilities in R (r, rstats, reticulate, python, ds-store, macos, r-cyber)
  • MACtools : ⬢ Tools to Work with Media Access Control (‘MAC’) Addresses (r, rstats, mac-address, r-cyber, mac-age-database)
  • mhn : ? Analyze and Visualize Data from Modern Honey Network Servers with R (r, rstats, mhn, r-cyber, honeypot)
  • middlechild : R interface to MITM (r, rstats, r-cyber, mitm, mitmproxy)
  • mqtt : ? Interoperate with ‘MQTT’ Message Brokers with R (r, rstats, mqtt, mosquitto, r-cyber)
  • mrt : Tools to Retrieve and Process ‘BGP’ Files in R (r, rstats, mrt, rib, router, bgp, r-cyber)
  • msgxtractr : ? Extract contents from Outlook ‘.msg’ files in R (r, rstats, outlook, msg, attachment, r-cyber)
  • myip : Tools to Determine Your Public ‘IP’ Address in R (r, rstats, ip-address, ip-info, httpbin, icanhazip, ip-echo, amazon-checkip, akamai-whatismyp, opendns-checkip, r-cyber)
  • ndjson : ♨️ Wicked-Fast Streaming ‘JSON’ (‘ndjson’) Reader in R (r, ndjson, rstats, json, r-cyber)
  • newsflash : Tools to Work with the Internet Archive and GDELT Television Explorer in R (internet-archive, gdelt-television-explorer, r, rstats, r-cyber)
  • nmapr : Perform Network Discovery and Security Auditing with ‘nmap’ in R (r, rstats, nmap, r-cyber)
  • ooni : Tools to Access the Open Observatory of Network Interference (‘OONI’) (r, rstats, ooni, censorship, internet-measurements, r-cyber)
  • opengraph : Tools to Mine ‘Open Graph’-like Tags From ‘HTML’ Content (r, rstats, opengraph, r-cyber)
  • osqueryr : ⁇ ‘osquery’ ‘DBI’ and ‘dbplyr’ Interface for R (r, rstats, osquery, dplyr, tidyverse, dbi, r-cyber)
  • passivetotal : Useful tools for working with the PassiveTotal API in R (r, rstats, passivetotal, passive-dns-data, r-cyber)
  • passwordrandom : ? Access the PasswordRandom.com API in R (r, rstats, r-cyber)
  • pastebin : ? Tools to work with the pastebin API in R (rstats, r, pastebin, pastebin-client, r-cyber)
  • pdfbox : ?◻️ Create, Maniuplate and Extract Data from PDF Files (R Apache PDFBox wrapper) (r, rstats, pdfbox, pdf-document, pdf-files, pdfbox-wrapper, r-cyber)
  • pdfboxjars : Java ‘.jar’ Files for ‘pdfbox’ (r, rstats, java, r-cyber)
  • porc : ? Tools to Work with ‘Snort’ Rules, Logs and Data (r, rstats, snort, snort-rules, cybersecurity, cyber, r-cyber)
  • publicwww : Query the ‘PublicWWW’ Source Code Search Engine in R (r, rstats, publicwww, r-cyber)
  • radb : ? Tools to Query the ‘Merit’ ‘RADb’ Network Route Server (r, rstats, merit, radb, r-cyber)
  • rappalyzer : ? :: WIP :: R port of Wappalyzer (r, rstats, wappalyzer, r-cyber)
  • reapr : ?→ℹ️ Reap Information from Websites (r, rstats, web-scraping, r-cyber, rvest, html, xpath)
  • rgeocodio : Tools to Work with the https://geocod.io/ API (r, rstats, geocodio, geocoding, reverse-geocoding, reverse-geocode, r-cyber)
  • ripestat : ? Query and Retrieve Data from the ‘RIPEstat’ ‘API’ (r, rstats, ripe, ripestat, r-cyber)
  • robotify : ? Browser extension to check for and preview a site’s robots.txt in a new tab (if it exists) (browser-extension, robots-txt, r-cyber)
  • rpdns : R port of CIRCL.LU’s PyPDNS Python module https://github.com/CIRCL/PyPDNS (r, rstats, passive-dns, circl-lu, dns, r-cyber)
  • rrecog : ?Pattern Recognition for Hosts, Services and Content (r, rstats, recognizer, rapid7, r-cyber)
  • scamtracker : R pacakge interface to the BBB ScamTracker : https://www.bbb.org/scamtracker/us (r, rstats, r-cyber, scamtracker)
  • securitytrails : ??‍♂️Tools to Query the ‘SecurityTrails’ ‘API’ (r, rstats, securitytrails, cybersecurity, ipv4, ipv6, threat-intelligence, domain-name, whois-lookup, r-cyber)
  • securitytxt : ? Identify and Parse Web Security Policies Files in R (r, rstats, securitytxt, r-cyber)
  • sergeant : ? Tools to Transform and Query Data with ‘Apache’ ‘Drill’ (drill, parquet-files, sql, dplyr, r, rstats, apache-drill, r-cyber)
  • shodan : ? R package to work with the Shodan API (r, rstats, shodan, shodan-api, r-cyber)
  • simplemagic : ? Lightweight File ‘MIME’ Type Detection Based On Contents or Extension (r, rstats, magic, mime, mime-types, file-types, r-cyber)
  • speedtest : ? Measure upload/download speed/bandwidth for your network with R (r, rstats, bandwidth-test, bandwidth, bandwidth-monitor, r-cyber)
  • spiderbar : Lightweight R wrapper around rep-cpp for robot.txt (Robots Exclusion Protocol) parsing and path testing in R (r, rstats, robots-exclusion-protocol, robots-txt, r-cyber)
  • splashr : ? Tools to Work with the ‘Splash’ JavaScript Rendering Service in R (r, rstats, web-scraping, splash, selenium, phantomjs, har, r-cyber)
  • ssllabs : Tools to Work with the SSL Labs API in R (r, rstats, r-cyber, ssllabs, ssl-labs)
  • threatcrowd : R tools to work with the ThreatCrowd API (r, rstats, threatcrowd, r-cyber)
  • tidyweb : Easily Install and Load Modern Web-Scraping Packages (r, rstats, r-cyber)
  • tlsh : #️⃣ Local Sensitivity Hashing Using the ‘Trend Micro’ ‘TLSH’ Implementation (based on https://github.com/trendmicro/… (r, rstats, tlsh, lsh, lsh-implmentation, r-cyber)
  • tlsobs : ? Tools to Work with the ‘Mozilla’ ‘TLS’ Observatory ‘API’ in R (r, rstats, mozilla-observatory, r-cyber)
  • udpprobe : ? Send User Datagram Protocol (‘UDP’) Probes and Receive Responses in R (r, rstats, udp-client, udp, ubiquiti, r-cyber)
  • ulid : ⚙️ Universally Unique Lexicographically Sortable Identifiers in R (r, rstats, ulid, uuid, r-cyber)
  • urldiversity : ? Quantify ‘URL’ Diversity and Apply Popular Biodiversity Indices to a ‘URL’ Collection (r, rstats, species-diversity, url, urls, uri, r-cyber)
  • urlscan : ? Analyze Websites and Resources They Request (r, rstats, urlscan, analyze-websites, scanning, urlscan-io, r-cyber)
  • vershist : ??‍♀️ Collect Version Histories For Vendor Products (rstats, r, semantic-versions, version-check, version-checker, r-cyber, release-history)
  • wand : ? R interface to libmagic – returns file mime type (r, rstats, magic-bytes, file, r-cyber)
  • warc : ? Tools to Work with the Web Archive Ecosystem in R (r, rstats, warc, r-cyber, warc-ecosystem, warc-files)
  • wayback : ⏪ Tools to Work with the Various Internet Archive Wayback Machine APIs (r, rstats, wayback-machine, internet-archive, web-scraping, r-cyber, memento, wayback)
  • webhose : ? Tools to Work with the ‘webhose.io’ ‘API’ in R (r, rstats, r-cyber, webhose)
  • whoisxmlapi : ❔ R package to interface with the WhoisXMLAPI.com service (r, rstats, whoisxmlapi, r-cyber)
  • xattrs : ? Work With Filesystem Object Extended Attributes — https://hrbrmstr.github.io/xattrs/index.html (r, rstats, xattr, xattr-support, r-cyber)
  • xforce : ? Tools to Gather Threat Intelligence from ‘IBM’ ‘X-Force’ (r, rstats, ibm-xforce, threat-intel, threat-intelligence, r-cyber)
  • zdnsr : ? Perform Bulk ‘DNS’ Queries Using ‘zdns’ (r, rstats, zdns, bulk-dns, r-cyber)

The splashr package [srht|GL|GH] — an alternative to Selenium for javascript-enabled/browser-emulated web scraping — is now at version 0.6.0 (still in dev-mode but on its way to CRAN in the next 14 days).

The major change from version 0.5.x (which never made it to CRAN) is a swap out of the reticulated docker package with the pure-R stevedore? package which will make it loads more compatible across the landscape of R installs as it removes a somewhat heavy dependency on a working Python environment (something quite challenging to consistently achieve in that fragmented language ecosystem).

Another addition is a set of new user agents for Android, Kindle, Apple TV & Chromecast as an increasing number of sites are changing what type of HTML (et. al.) they send to those and other alternative glowing rectangles. A more efficient/sane user agent system will also be introduced prior to the CRAN. Now’s the time to vote on existing issues or file new ones if there is a burning desire for new or modified functionality.

Since the Travis tests now work (they were failing miserably because of they Python dependency) I’ve integrated the changes from the 0.6.0 to the master branch but you can follow the machinations of the 0.6.0 branch up until CRAN release.