Skip navigation

Tag Archives: post

There was a discussion on Twitter about the need to read in “.msg” files using R. The “MSG” file format is one of the many binary abominations created by Microsoft to lock folks and users into their platform and tools. Thankfully, they (eventually) provided documentation for the MSG file format which helped me throw together a small R packagemsgxtractr — that can read in these ‘.msg’ files and produce a list as a result.

I had previously creatred a quick version of this by wrapping a Python module, but that’s a path fraught with peril and did not work for one of the requestors (yay, not-so-cross-platform UTF woes). So, I cobbled together some bits and pieces from the C to provide a singular function read_msg() that smashes open bottled up msgs, grabs sane/useful fields and produces a list() with them all wrapped up in a bow (an example is at the end and in the GH README).

Thanks to rhub, WinBuilder and Travis the code works on macOS, Linux and Windows and even has pretty decent code coverage for a quick project. That’s a resounding testimony to the work of many members of the R community who’ve gone to great lengths to make testing virtually painless for package developers.

Now, I literally have a singular ‘.msg’ file to test with, so if folks can kick the tyres, file issues (with errors or feature suggestions) and provide some more ‘.msg’ files for testing, it would be most appreciated.

devtools::install_github("hrbrmstr/msgxtractr")

library(msgxtractr)

print(str(read_msg(system.file("extdata/unicode.msg", package="msgxtractr"))))

## List of 7
##  $ headers         :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  18 variables:
##   ..$ Return-path               : chr "<brizhou@gmail.com>"
##   ..$ Received                  :List of 1
##   .. ..$ : chr [1:4] "from st11p00mm-smtpin007.mac.com ([17.172.84.240])\nby ms06561.mac.com (Oracle Communications Messaging Server "| __truncated__ "from mail-vc0-f182.google.com ([209.85.220.182])\nby st11p00mm-smtpin007.mac.com\n(Oracle Communications Messag"| __truncated__ "by mail-vc0-f182.google.com with SMTP id ie18so3484487vcb.13 for\n<brianzhou@me.com>; Mon, 18 Nov 2013 00:26:25 -0800 (PST)" "by 10.58.207.196 with HTTP; Mon, 18 Nov 2013 00:26:24 -0800 (PST)"
##   ..$ Original-recipient        : chr "rfc822;brianzhou@me.com"
##   ..$ Received-SPF              : chr "pass (st11p00mm-smtpin006.mac.com: domain of brizhou@gmail.com\ndesignates 209.85.220.182 as permitted sender)\"| __truncated__
##   ..$ DKIM-Signature            : chr "v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com;\ns=20120113; h=mime-version:date:message-id:subject:f"| __truncated__
##   ..$ MIME-version              : chr "1.0"
##   ..$ X-Received                : chr "by 10.221.47.193 with SMTP id ut1mr14470624vcb.8.1384763184960;\nMon, 18 Nov 2013 00:26:24 -0800 (PST)"
##   ..$ Date                      : chr "Mon, 18 Nov 2013 10:26:24 +0200"
##   ..$ Message-id                : chr "<CADtJ4eNjQSkGcBtVteCiTF+YFG89+AcHxK3QZ=-Mt48xygkvdQ@mail.gmail.com>"
##   ..$ Subject                   : chr "Test for TIF files"
##   ..$ From                      : chr "Brian Zhou <brizhou@gmail.com>"
##   ..$ To                        : chr "brianzhou@me.com"
##   ..$ Cc                        : chr "Brian Zhou <brizhou@gmail.com>"
##   ..$ Content-type              : chr "multipart/mixed; boundary=001a113392ecbd7a5404eb6f4d6a"
##   ..$ Authentication-results    : chr "st11p00mm-smtpin007.mac.com; dkim=pass\nreason=\"2048-bit key\" header.d=gmail.com header.i=@gmail.com\nheader."| __truncated__
##   ..$ x-icloud-spam-score       : chr "33322\nf=gmail.com;e=gmail.com;pp=ham;spf=pass;dkim=pass;wl=absent;pwl=absent"
##   ..$ X-Proofpoint-Virus-Version: chr "vendor=fsecure\nengine=2.50.10432:5.10.8794,1.0.14,0.0.0000\ndefinitions=2013-11-18_02:2013-11-18,2013-11-17,19"| __truncated__
##   ..$ X-Proofpoint-Spam-Details : chr "rule=notspam policy=default score=0 spamscore=0\nsuspectscore=0 phishscore=0 bulkscore=0 adultscore=0 classifie"| __truncated__
##  $ sender          :List of 2
##   ..$ sender_email: chr "brizhou@gmail.com"
##   ..$ sender_name : chr "Brian Zhou"
##  $ recipients      :List of 2
##   ..$ :List of 3
##   .. ..$ display_name : NULL
##   .. ..$ address_type : chr "SMTP"
##   .. ..$ email_address: chr "brianzhou@me.com"
##   ..$ :List of 3
##   .. ..$ display_name : NULL
##   .. ..$ address_type : chr "SMTP"
##   .. ..$ email_address: chr "brizhou@gmail.com"
##  $ subject         : chr "Test for TIF files"
##  $ body            : chr "This is a test email to experiment with the MS Outlook MSG Extractor\r\n\r\n\r\n-- \r\n\r\n\r\nKind regards\r\n"| __truncated__
##  $ attachments     :List of 2
##   ..$ :List of 4
##   .. ..$ filename     : chr "importOl.tif"
##   .. ..$ long_filename: chr "import OleFileIO.tif"
##   .. ..$ mime         : chr "image/tiff"
##   .. ..$ content      : raw [1:969674] 49 49 2a 00 ...
##   ..$ :List of 4
##   .. ..$ filename     : chr "raisedva.tif"
##   .. ..$ long_filename: chr "raised value error.tif"
##   .. ..$ mime         : chr "image/tiff"
##   .. ..$ content      : raw [1:1033142] 49 49 2a 00 ...
##  $ display_envelope:List of 2
##   ..$ display_cc: chr "Brian Zhou"
##   ..$ display_to: chr "brianzhou@me.com"
## NULL

NOTE: Don’t try to read those TIFF images with magick or even the tiff package. The content seems to have some strange tags/fields. But, saving it (use writeBin()) and opening with Preview (or your favorite image viewer) should work (it did for me and produces the following image that I’ve converted to png):

I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non-“main text content” from an HTML document and it usually does a good job but there are some pages that it fails miserably on since it’s more of a brute-force method than one that uses any real “intelligence” when performing the text node targeting.

Most modern browsers have inherent or plugin-able “readability” capability, and most of those are based — at least in part — on the seminal Arc90 implementation. Many programming languages have a package or module that use a similar methodology, but I’m not aware of any R ports.

What do I mean by “clean txt”? Well, I can’t show the URL I was having trouble processing but I can show an example using a recent rOpenSci blog post. Here’s what the raw HTML looks like after retrieving it:

library(xml2)
library(httr)
library(reticulate)
library(magrittr)

res <- GET("https://ropensci.org/blog/blog/2017/08/22/visdat")

content(res, as="text", endoding="UTF-8")
## [1] "\n \n<!DOCTYPE html>\n<html lang=\"en\">\n <head>\n <meta charset=\"utf-8\">\n <meta name=\"apple-mobile-web-app-capable\" content=\"yes\" />\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\" />\n <meta name=\"apple-mobile-web-app-status-bar-style\" content=\"black\" />\n <link rel=\"shortcut icon\" href=\"/assets/flat-ui/images/favicon.ico\">\n\n <link rel=\"alternate\" type=\"application/rss+xml\" title=\"RSS\" href=\"http://ropensci.org/feed.xml\" />\n\n <link rel=\"stylesheet\" href=\"/assets/flat-ui/bootstrap/css/bootstrap.css\">\n <link rel=\"stylesheet\" href=\"/assets/flat-ui/css/flat-ui.css\">\n\n <link rel=\"stylesheet\" href=\"/assets/common-files/css/icon-font.css\">\n <link rel=\"stylesheet\" href=\"/assets/common-files/css/animations.css\">\n <link rel=\"stylesheet\" href=\"/static/css/style.css\">\n <link href=\"/assets/css/ss-social/webfonts/ss-social.css\" rel=\"stylesheet\" />\n <link href=\"/assets/css/ss-standard/webfonts/ss-standard.css\" rel=\"stylesheet\"/>\n <link rel=\"stylesheet\" href=\"/static/css/github.css\">\n <script type=\"text/javascript\" src=\"//use.typekit.net/djn7rbd.js\"></script>\n <script type=\"text/javascript\">try{Typekit.load();}catch(e){}</script>\n <script src=\"/static/highlight.pack.js\"></script>\n <script>hljs.initHighlightingOnLoad();</script>\n\n <title>Onboarding visdat, a tool for preliminary visualisation of whole dataframes</title>\n <meta name=\"keywords\" content=\"R, software, package, review, community, visdat, data-visualisation\" />\n <meta name=\"description\" content=\"\" />\n <meta name=\"resource_type\" content=\"website\"/>\n <!– RDFa Metadata (in DublinCore) –>\n <meta property=\"dc:title\" content=\"Onboarding visdat, a tool for preliminary visualisation of whole dataframes\" />\n <meta property=\"dc:creator\" content=\"\" />\n <meta property=\"dc:date\" content=\"\" />\n <meta property=\"dc:format\" content=\"text/html\" />\n <meta property=\"dc:language\" content=\"en\" />\n <meta property=\"dc:identifier\" content=\"/blog/blog/2017/08/22/visdat\" />\n <meta property=\"dc:rights\" content=\"CC0\" />\n <meta property=\"dc:source\" content=\"\" />\n <meta property=\"dc:subject\" content=\"Ecology\" />\n <meta property=\"dc:type\" content=\"website\" />\n <!– RDFa Metadata (in OpenGraph) –>\n <meta property=\"og:title\" content=\"Onboarding visdat, a tool for preliminary visualisation of whole dataframes\" />\n <meta property=\"og:author\" content=\"/index.html#me\" /> <!– Should be Liquid? URI? –>\n <meta property=\"http://ogp.me/ns/profile#first_name\" content=\"\"/>\n <meta property=\"http://ogp.me/ns/profile#last_name\" content=\"\"/>\n

(it goes on for a bit, best to run the code locally)

We can use the reticulate package to load the Python readability module to just get the clean, article text:

readability <- import("readability") # pip install readability-lxml

doc <- readability$Document(httr::content(res, as="text", endoding="UTF-8"))

doc$summary() %>%
  read_xml() %>%
  xml_text()
# [1] "Take a look at the dataThis is a phrase that comes up when you first get a dataset.It is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?Starting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.These growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.The visdat package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".Making visdat was fun, and it was easy to use. But I couldn't help but think that maybe visdat could be more. I felt like the code was a little sloppy, and that it could be better. I wanted to know whether others found it useful.What I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.Too much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci onboarding process provides.rOpenSci onboarding basicsOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with JOSS.What's in it for the author?Feedback on your packageSupport from rOpenSci membersMaintain ownership of your packagePublicity from it being under rOpenSciContribute something to rOpenSciPotentially a publicationWhat can rOpenSci do that CRAN cannot?The rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN 1. Here's what rOpenSci does that CRAN cannot:Assess documentation readability / usabilityProvide a code review to find weak points / points of improvementDetermine whether a package is overlapping with another.

(again, it goes on for a bit, best to run the code locally)

That text is now in good enough shape to tidy.

Here’s the same version with clean_text():

# devtools::install_github("hrbrmstr/hgr")
hgr::clean_text(content(res, as="text", endoding="UTF-8"))
## [1] "Onboarding visdat, a tool for preliminary visualisation of whole dataframes\n \n \n \n \n  \n \n \n \n \n August 22, 2017 \n \n \n \n \nTake a look at the data\n\n\nThis is a phrase that comes up when you first get a dataset.\n\nIt is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?\n\nStarting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.\n\nThese growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.\n\nThe package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".\n\nMaking was fun, and it was easy to use. But I couldn't help but think that maybe could be more.\n\n I felt like the code was a little sloppy, and that it could be better.\n I wanted to know whether others found it useful.\nWhat I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.\n\nToo much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci provides.\n\nrOpenSci onboarding basics\n\nOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with .\n\nWhat's in it for the author?\n\nFeedback on your package\nSupport from rOpenSci members\nMaintain ownership of your package\nPublicity from it being under rOpenSci\nContribute something to rOpenSci\nPotentially a publication\nWhat can rOpenSci do that CRAN cannot?\n\nThe rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN . Here's what rOpenSci does that CRAN cannot:\n\nAssess documentation readability / usability\nProvide a code review to find weak points / points of improvement\nDetermine whether a package is overlapping with another.

(lastly, it goes on for a bit, best to run the code locally)

As you can see, even though that version is usable, readability does a much smarter job of cleaning the text.

The Python code is quite — heh — readable, and R could really use a native port (i.e. this would be a ++gd project or an aspiring package author to take on).

I’ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to work with WARC files it’s a tad fragile and improving it would mean reinventing many wheels (i.e. there are longstanding solid implementations of WARC libraries in many other languages that could be tapped vs writing a C++-backed implementation).

One of those implementations is JWAT, a library written in Java (as many WARC use-cases involve working in what would traditionally be called map-reduce environments). It has a small footprint and is structured well-enough that I decided to take it for a spin as a set of R packages that wrap it with rJava. There are two packages since it follows a recommended CRAN model of having one package for the core Java Archive (JAR) files — since they tend to not change as frequently as the functional R package would and they tend to take up a modest amount of disk space — and another for the actual package that does the work. They are:

I’ll exposit on the full package at some later date, but I wanted to post a snippet showng that you may have a use for WARC files that you hadn’t considered before: pairing WARC files with httr web scraping tasks to maintain a local cache of what you’ve scraped.

Web scraping consumes network & compute resources on the server end that you typically don’t own and — in many cases — do not pay for. While there are scraping tasks that need to access the latest possible data, many times tasks involve scraping data that won’t change.

The same principle works for caching the results of API calls, since you may make those calls and use some data, but then realize you wanted to use more data and make the same API calls again. Caching the raw API results can also help with reproducibility, especially if the site you were using goes offline (like the U.S. Government sites that are being taken down by the anti-science folks in the current administration).

To that end I’ve put together the beginning of some “WARC wrappers” for httr verbs that make it seamless to cache scraping or API results as you gather and process them. Let’s work through an example using the U.K. open data portal on crime and policing API.

First, we’ll need some helpers:

library(rJava)
library(jwatjars) # devtools::install_github("hrbrmstr/jwatjars")
library(jwatr) # devtools::install_github("hrbrmstr/jwatr")
library(httr)
library(jsonlite)
library(tidyverse)

Just doing library(jwatr) would have covered much of that but I wanted to show some of the work R does behind the scenes for you.

Now, we’ll grab some neighbourhood and crime info:

wf <- warc_file("~/Data/wrap-test")

res <- warc_GET(wf, "https://data.police.uk/api/leicestershire/neighbourhoods")

str(jsonlite::fromJSON(content(res, as="text")), 2)
## 'data.frame':	67 obs. of  2 variables:
##  $ id  : chr  "NC04" "NC66" "NC67" "NC68" ...
##  $ name: chr  "City Centre" "Cultural Quarter" "Riverside" "Clarendon Park" ...

res <- warc_GET(wf, "https://data.police.uk/api/crimes-street/all-crime",
                query = list(lat=52.629729, lng=-1.131592, date="2017-01"))

res <- warc_GET(wf, "https://data.police.uk/api/crimes-at-location",
                query = list(location_id="884227", date="2017-02"))

close_warc_file(wf)

As you can see, the standard httr response object is returned for processing, and the HTTP response itself is being stored away for us as we process it.

file.info("~/Data/wrap-test.warc.gz")$size
## [1] 76020

We can use these results later and, pretty easily, since the WARC file will be read in as a tidy R tibble (fancy data frame):

xdf <- read_warc("~/Data/wrap-test.warc.gz", include_payload = TRUE)

glimpse(xdf)
## Observations: 3
## Variables: 14
## $ target_uri                 <chr> "https://data.police.uk/api/leicestershire/neighbourhoods", "https://data.police.uk/api/crimes-street...
## $ ip_address                 <chr> "54.76.101.128", "54.76.101.128", "54.76.101.128"
## $ warc_content_type          <chr> "application/http; msgtype=response", "application/http; msgtype=response", "application/http; msgtyp...
## $ warc_type                  <chr> "response", "response", "response"
## $ content_length             <dbl> 2984, 511564, 688
## $ payload_type               <chr> "application/json", "application/json", "application/json"
## $ profile                    <chr> NA, NA, NA
## $ date                       <dttm> 2017-08-22, 2017-08-22, 2017-08-22
## $ http_status_code           <dbl> 200, 200, 200
## $ http_protocol_content_type <chr> "application/json", "application/json", "application/json"
## $ http_version               <chr> "HTTP/1.1", "HTTP/1.1", "HTTP/1.1"
## $ http_raw_headers           <list> [<48, 54, 54, 50, 2f, 31, 2e, 31, 20, 32, 30, 30, 20, 4f, 4b, 0d, 0a, 61, 63, 63, 65, 73, 73, 2d, 63...
## $ warc_record_id             <chr> "<urn:uuid:2ae3e851-a1cf-466a-8f73-9681aab25d0c>", "<urn:uuid:77b30905-37f7-4c78-a120-2a008e194f94>",...
## $ payload                    <list> [<5b, 7b, 22, 69, 64, 22, 3a, 22, 4e, 43, 30, 34, 22, 2c, 22, 6e, 61, 6d, 65, 22, 3a, 22, 43, 69, 74...

xdf$target_uri
## [1] "https://data.police.uk/api/leicestershire/neighbourhoods"                                   
## [2] "https://data.police.uk/api/crimes-street/all-crime?lat=52.629729&lng=-1.131592&date=2017-01"
## [3] "https://data.police.uk/api/crimes-at-location?location_id=884227&date=2017-02" 

The URLs are all there, so it will be easier to map the original calls to them.

Now, the payload field is the HTTP response body and there are a few ways we can decode and use it. First, since we know it’s JSON content (that’s what the API returns), we can just decode it:

for (i in 1:nrow(xdf)) {
  res <- jsonlite::fromJSON(readBin(xdf$payload[[i]], "character"))
  print(str(res, 2))
}
## 'data.frame': 67 obs. of  2 variables:
##  $ id  : chr  "NC04" "NC66" "NC67" "NC68" ...
##  $ name: chr  "City Centre" "Cultural Quarter" "Riverside" "Clarendon Park" ...
## NULL
## 'data.frame': 1318 obs. of  9 variables:
##  $ category        : chr  "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
##  $ location_type   : chr  "Force" "Force" "Force" "Force" ...
##  $ location        :'data.frame': 1318 obs. of  3 variables:
##   ..$ latitude : chr  "52.616961" "52.629963" "52.641646" "52.635184" ...
##   ..$ street   :'data.frame': 1318 obs. of  2 variables:
##   ..$ longitude: chr  "-1.120719" "-1.122291" "-1.131486" "-1.135455" ...
##  $ context         : chr  "" "" "" "" ...
##  $ outcome_status  :'data.frame': 1318 obs. of  2 variables:
##   ..$ category: chr  NA NA NA NA ...
##   ..$ date    : chr  NA NA NA NA ...
##  $ persistent_id   : chr  "" "" "" "" ...
##  $ id              : int  54163555 54167687 54167689 54168393 54168392 54168391 54168386 54168381 54168158 54168159 ...
##  $ location_subtype: chr  "" "" "" "" ...
##  $ month           : chr  "2017-01" "2017-01" "2017-01" "2017-01" ...
## NULL
## 'data.frame': 1 obs. of  9 variables:
##  $ category        : chr "violent-crime"
##  $ location_type   : chr "Force"
##  $ location        :'data.frame': 1 obs. of  3 variables:
##   ..$ latitude : chr "52.643950"
##   ..$ street   :'data.frame': 1 obs. of  2 variables:
##   ..$ longitude: chr "-1.143042"
##  $ context         : chr ""
##  $ outcome_status  :'data.frame': 1 obs. of  2 variables:
##   ..$ category: chr "Unable to prosecute suspect"
##   ..$ date    : chr "2017-02"
##  $ persistent_id   : chr "4d83433f3117b3a4d2c80510c69ea188a145bd7e94f3e98924109e70333ff735"
##  $ id              : int 54726925
##  $ location_subtype: chr ""
##  $ month           : chr "2017-02"
## NULL

We can also use a jwatr helper function — payload_content() — which mimics the httr::content() function:

for (i in 1:nrow(xdf)) {
  
  payload_content(
    xdf$target_uri[i], 
    xdf$http_protocol_content_type[i], 
    xdf$http_raw_headers[[i]], 
    xdf$payload[[i]], as = "text"
  ) %>% 
    jsonlite::fromJSON() -> res
  
  print(str(res, 2))
  
}

The same output is printed, so I’m saving some blog content space by not including it.

Future Work

I kept this example small, but ideally one would write a warcinfo record as the first WARC record to identify the file and I need to add options and functionality to store the a WARC request record as well as a responserecord`. But, I wanted to toss this out there to get feedback on the idiom and what possible desired functionality should be added.

So, please kick the tyres and file as many issues as you have time or interest to. I’m still designing the full package API and making refinements to existing function, so there’s plenty of opportunity to tailor this to the more data science-y and reproducibility use cases R folks have.

(General reminder abt “R⁶” posts in that they are heavy on code-examples, minimal on expository. I try to design them with 2-3 “nuggets” embedded for those who take the time to walk through the code examples on their systems. I’ll always provide further expository if requested in a comment, so don’t hesitate to ask if something is confusing.)

I had to check something on the macOS systems across the abode today and — on a lark — decided to do all the “shell” scripting in R vs bash for a change. After performing the tasks, it occurred to me that not all R users on macOS realize there are hidden gems of information spread across the “boring” parts of the filesystem in SQLite databases. So, I put together a small example that:

  • identifies all the top-level apps in /Applications
  • extracts code signing information from them into a tibble
  • grabs the Gatekeeper whitelist database
  • uses an internal SQLite transformation function inside a dplyr chain
  • merges the info and gives you a basis to explore your apps
  • shows off the wicked cool processx package by @gaborcsardi that you should be using in place of system() / system2()

A quick note about this Gatekeeper database: If an app is not already recognized by macOS and you allow it to run your local system is updated to “whitelist” this app so you don’t get the notification in the future. Apple maintains a large signature list that gets updated throughout the year on your system and your own list is merged with it.

A second quick note (I guess I’m doing more expository than promised :-) about QUOTE(): In certain dplyr / dbplyr contexts, what looks like local function calls are actually passed over to the SQL side vs used locally. Even though quote() is an internal R function, the use of the lowercase version of it would still go over to the SQL side vs be eval’d locally. However, to avoid confusing others, I try to uppercase conflicts like this when they occur.

library(processx)
library(stringi)
library(docxtractr) # install_github("hrbrmstr/docxtractr")
library(tidyverse)

list.files("/Applications", pattern = "app$", full.names = TRUE) %>% 
  map_df(~{
    message(.x)
    res <- run("codesign", args = c("-dvvv", .x), error_on_status = FALSE)
    if (any(grepl("not signed at all", res$stderr))) {
      list(Executable = .x)
    } else {
      stri_split_lines(res$stderr)[[1]] %>% 
        keep(~grepl("=", .)) %>% 
        stri_split_fixed("=", 2, simplify = TRUE) -> res
      as.list(set_names(res[,2], res[,1]))
    }
  }) %>% 
  mcga() %>% 
  mutate(short_name = stri_replace_last_fixed(basename(executable), ".app", "")) -> my_apps

glimpse(my_apps)
## Observations: 102
## Variables: 20
## $ executable                  <chr> "/Applications/1Password 6.app/Con...
## $ identifier                  <chr> "com.agilebits.onepassword4", "com...
## $ format                      <chr> "app bundle with Mach-O thin (x86_...
## $ codedirectory_v             <chr> "20200 size=29594 flags=0x0(none) ...
## $ hash_type                   <chr> "sha256 size=32", "sha256 size=32"...
## $ candidatecdhash_sha1        <chr> "e21ac2a66473feec6276b448fc518678d...
## $ candidatecdhash_sha256      <chr> "8cf4cb4bdbea3b4d4f9e293e1aee1edb7...
## $ hash_choices                <chr> "sha1,sha256", "sha1,sha256", "sha...
## $ cdhash                      <chr> "8cf4cb4bdbea3b4d4f9e293e1aee1edb7...
## $ signature_size              <chr> "8915", "8936", "4610", "8528", "4...
## $ authority                   <chr> "Apple Root CA", "Apple Root CA", ...
## $ timestamp                   <chr> "Jul 17, 2017, 10:18:00 AM", "Jul ...
## $ info_plist_entries          <chr> "32", "31", "30", "17", "30", "18"...
## $ teamidentifier              <chr> "2BUA8C4S2C", "XZZXE9SED4", "94KV3...
## $ sealed_resources_version    <chr> "2 rules=12 files=2440", "2 rules=...
## $ internal_requirements_count <chr> "1 size=220", "1 size=192", "1 siz...
## $ signed_time                 <chr> NA, NA, "Jul 13, 2017, 6:40:13 PM"...
## $ library_validation_warning  <chr> NA, NA, NA, "OS X SDK version befo...
## $ platform_identifier         <chr> NA, NA, NA, NA, "2", NA, NA, "2", ...
## $ short_name                  <chr> "1Password 6", "Alfred 3", "Amazon...

# this is macOS atekeeper whitelist db
db <- src_sqlite("/private/var/db/gkopaque.bundle/Contents/Resources/gkopaque.db")

db
## src:  sqlite 3.19.3 [/private/var/db/gkopaque.bundle/Contents/Resources/gkopaque.db]
## tbls: conditions, merged, whitelist
whitelist <- tbl(db, "whitelist")

# they are binary blobs, but we need them as text, so use SQLite's "QUOTE()" function
mutate(whitelist, current = QUOTE(current)) %>% 
  select(current) %>% 
  collect() %>% 
  mutate(
    cdhash = stri_replace_first_fixed(current, "X'", "") %>% 
      stri_replace_last_fixed("'", "") %>% 
      stri_trans_tolower()
  ) -> gatekeeper_whitelist_hashes

my_apps <- left_join(my_apps, gatekeeper_whitelist_hashes)

# Unsigned apps
filter(my_apps, is.na(cdhash)) %>% 
  select(short_name)

# you can see your own output here ;-)

# App organization
select(my_apps, identifier) %>% 
  mutate(
    identifier = stri_split_fixed(identifier, ".", simplify = TRUE)[,2],
    identifier = ifelse(identifier == "", "UNSPECIFIED", identifier)
  ) %>% 
  count(identifier, sort=TRUE)
## # A tibble: 47 x 2
##        identifier     n
##             <chr> <int>
##  1          apple    37
##  2    UNSPECIFIED    10
##  3      microsoft     6
##  4  eclecticlight     3
##  5         google     3
##  6         amazon     2
##  7      agilebits     1
##  8 antonycourtney     1
##  9       appscape     1
## 10   audacityteam     1
## # ... with 37 more rows


# Which apps have code-signing validation warnings
filter(my_apps, !is.na(library_validation_warning)) %>% 
  select(short_name, library_validation_warning)
## # A tibble: 7 x 2
##        short_name
##             <chr>
## 1    Amazon Music
## 2        Audacity
## 3         firefox
## 4         RStudio
## 5             VLC
## 6 Webcam Settings
## 7     WordService
## # ... with 1 more variables: library_validation_warning <chr>

NOTE: One could easily extend this example to look for apps across the filesystem.

The reticulate package provides a very clean & concise interface bridge between R and Python which makes it handy to work with modules that have yet to be ported to R (going native is always better when you can do it). This post shows how to use reticulate to create parquet files directly from R using reticulate as a bridge to the pyarrow module, which has the ability to natively create parquet files.

Now, you can create parquet files through R with Apache Drill — and, I’ll provide another example for that here — but, you may have need to generate such files and not have the ability to run Drill.

The Python parquet process is pretty simple since you can convert a pandas DataFrame directly to a pyarrow Table which can be written out in parquet format with pyarrow.parquet. We just need to follow this process through reticulate in R:

library(reticulate)

pd <- import("pandas", "pd")
pa <- import("pyarrow", "pa")
pq <- import("pyarrow.parquet", "pq")

mtcars_py <- r_to_py(mtcars)
mtcars_df <- pd$DataFrame$from_dict(mtcars_py)
mtcars_tab <- pa$Table$from_pandas(mtcars_df)

pq$write_table(mtcars_tab, path.expand("~/Data/mtcars_python.parquet"))

I wouldn’t want to do that for ginormous data frames, but it should work pretty well for modest use cases (you’re likely using Spark, Drill, Presto or other “big data” platforms for creation of larger parquet structures). Here’s how we’d do that with Drill via the sergeant package:

readr::write_csv(mtcars, "~/Data/mtcars_r.csvh")
dc <- drill_connection("localhost")
drill_query(dc, "CREATE TABLE dfs.tmp.`/mtcars_r.parquet` AS SELECT * FROM dfs.root.`/Users/bob/Data/mtcars_r.csvh`")

Without additional configuration parameters, the reticulated-Python version (above) generates larger parquet files and also has an index column since they’re needed in Python DataFrames (ugh), but small-ish data frames will end up in a single file whereas the Drill created ones will be in a directory with an additional CRC file (and, much smaller by default). NOTE: You can use preserve_index=False on the call to Table.from_pandas to get rid of that icky index.

It’s fairly efficient even for something like nycflights13::flights which has ~330K rows and 19 columns:

system.time(
  r_to_py(nycflights13::flights) %>% 
  pd$DataFrame$from_dict() %>% 
  pa$Table$from_pandas() %>% 
  pq$write_table(where = "/tmp/flights.parquet")
)
##    user  system elapsed 
##   1.285   0.108   1.398 

If you need to generate parquet files in a pinch, reticulate seems to be a good way to go.

UPDATE (2018-01-25)

API’s change and while the above still works, there’s a slightly simpler way, now:

library(reticulate)

pd <- import("pandas", "pd")

mtcars_py <- r_to_py(mtcars)
mtcars_df <- pd$DataFrame$from_dict(mtcars_py)

city_wx_df$to_parquet(path.expand("~/Data/mtcars_python.parquet"), "pyarrow")

One of my tweets that referenced an excellent post about the ethics of web scraping garnered some interest:

If you load that up that tweet and follow the thread, you’ll see a really good question by @kennethrose82 regarding what an appropriate setting should be for a delay between crawls.

The answer is a bit nuanced as there are some written and unwritten “rules” for those who would seek to scrape web site content. For the sake of brevity in this post, we’ll only focus on “best practices” (ugh) for being kind to web site resources when it comes to timing requests, after a quick mention that “Step 0” must be to validate that the site’s terms & conditions or terms of service allow you to scrape & use data from said site.

Robot Roll Call

The absolute first thing you should do before scraping a site should be to check out their robots.txt file. What’s that? Well, I’ll let you read about it first from the vignette of the package we’re going to use to work with it.

Now that you know what such a file is, you also likely know how to peruse it since the vignette has excellent examples. But, we’ll toss one up here for good measure, focusing on one field that we’re going to talk about next:

library(tidyverse)
library(rvest)

robotstxt::robotstxt("seobook.com")$crawl_delay %>% 
  tbl_df()
## # A tibble: 114 x 3
##          field       useragent value
##          <chr>           <chr> <chr>
##  1 Crawl-delay               *    10
##  2 Crawl-delay        asterias    10
##  3 Crawl-delay BackDoorBot/1.0    10
##  4 Crawl-delay       BlackHole    10
##  5 Crawl-delay    BlowFish/1.0    10
##  6 Crawl-delay         BotALot    10
##  7 Crawl-delay   BuiltBotTough    10
##  8 Crawl-delay    Bullseye/1.0    10
##  9 Crawl-delay   BunnySlippers    10
## 10 Crawl-delay       Cegbfeieh    10
## # ... with 104 more rows

I chose that site since it has many entries for the Crawl-delay field, which defines the number of seconds a given site would like your crawler to wait between scrapes. For the continued sake of brevity, we’ll assume you’re going to be looking at the * entry when you perform your own scraping tasks (even though you should be setting your own User-Agent string). Let’s make a helper function for retrieving this value from a site, adding in some logic to provide a default value if no Crawl-Delay entry is found and to optimize the experience a bit (note that I keep varying the case of crawl-delay when I mention it to show that the field key is case-insensitive; be thankful robotstxt normalizes it for us!):

.get_delay <- function(domain) {
  
  message(sprintf("Refreshing robots.txt data for %s...", domain))
  
  cd_tmp <- robotstxt::robotstxt(domain)$crawl_delay
  
  if (length(cd_tmp) > 0) {
    star <- dplyr::filter(cd_tmp, useragent=="*")
    if (nrow(star) == 0) star <- cd_tmp[1,]
    as.numeric(star$value[1])
  } else {
    10L
  }

}

get_delay <- memoise::memoise(.get_delay)

The .get_delay() function could be made a bit more bulletproof, but I have to leave some work for y’all to do on your own. So, why both .get_delay() and the get_delay() functions, and what is this memoise? Well, even though robotstxt::robotstxt() will ultimately cache (in-memory, so only in the active R session) the robots.txt file it retrieved (if it retrieved one) we don’t want to do the filter/check/default/return all the time since it just wastes CPU clock cycles. The memoise() operation will check which parameter was sent and return the value that was computed vs going through that logic again. We can validate that on the seobook.com domain:

get_delay("seobook.com")
## Refreshing robots.txt data for seobook.com...
## [1] 10

get_delay("seobook.com")
## [1] 10

You can now use get_delay() in a Sys.sleep() call before your httr:GET() or rvest::read_html() operations.

Not So Fast…

Because you’re a savvy R coder and not some snake charmer, gem hoarder or go-getter, you likely caught the default 10L return value in .get_delay() and thought “Hrm… Where’d that come from?”.

I’m glad you asked!

I grabbed the first 400 robots.txt WARC files from the June 2017 Common Crawl, which ends up being ~1,000,000 sites. That sample ended up having ~80,000 sites with one or more CRAWL-DELAY entries. Some of those sites had entries that were not valid (in an attempt to either break, subvert or pwn a crawler) or set to a ridiculous value. I crunched through the data and made saner bins for the values to produce the following:

Sites seem to either want you to wait 10 seconds (or less) or about an hour between scraping actions. I went with the lower number purely for convenience, but would caution that this decision was based on the idea that your intention is to not do a ton of scraping (i.e. less than ~50-100 HTML pages). If you’re really going to do more than that, I strongly suggest you reach out to the site owner. Many folks are glad for the contact and could even arrange a better method for obtaining the data you seek.

FIN

So, remember:

  • check site ToS/T&C before scraping
  • check robots.txt before scraping (in general and for Crawl-Delay)
  • contact the site owner if you plan on doing a large amount of scraping
  • introduce some delay between page scrapes, even if the site does not have a specific crawl-delay entry), using the insights gained from the Common Crawl analysis to inform your decision

I’ll likely go through all the Common Crawl robots.txt WARC archives to get a fuller picture of the distribution of values and either update this post at a later date or do a quick new post on it.

(You also might want to run robotstxt::get_robotstxt("rud.is") #justsayin :-)

It’s no secret that I’m a fan of Apache Drill. One big strength of the platform is that it normalizes the access to diverse data sources down to ANSI SQL calls, which means that I can pull data from parquet, Hie, HBase, Kudu, CSV, JSON, MongoDB and MariaDB with the same SQL syntax. This also means that I get access to all those platforms in R centrally through the sergeant package that rests atop d[b]plyr. However, it further means that when support for a new file type is added, I get that same functionality without any extra effort.

Why am I calling this out?

Well, the intrepid Drill developers are in the process of finalizing the release candidate for version 1.11.0 and one feature they’ve added is the ability to query individual and entire directories full of PCAP files from within Drill. While I provided a link to the Wikipedia article on PCAP files, the TL;DR on them is that it’s an optimized binary file format for recording network activity. If you’re on macOS or a linux-ish system go do something like this:

sudo tcpdump -ni en0 -s0 -w capture01.pcap

And, wait a bit.

NOTE: Some of you may have to change the en0 to your main network interface name (a quick google for that for your platform should get you to the right one to use).

That command will passively record all network activity on your system until you ctrl-c it. The longer it goes the larger it gets.

When you’ve recorded a minute or two of packets, ctrl-c the program and then try to look at the PCAP file. It’s a binary mess. You can re-read it with tcpdump or Wireshark and there are many C[++] libraries and other utilities that can read them. You can even convert them to CSV or XML, but the PCAP itself requires custom tools to work with them effectively. I had started creating crafter to work with these files but my use case/project dried up and haven’t gone back to it.

Adding the capability into Drill means I don’t really have to work any further on that specialized package as I can do this:

library(sergeant)
library(iptools)
library(tidyverse)
library(cymruservices)

db <- src_drill("localhost")

my_pcaps <- tbl(db, "dfs.caps.`/capture02.pcap`")

glimpse(my_pcaps)
## Observations: 25
## Variables: 12
## $ src_ip          <chr> "192.168.10.100", "54.159.166.81", "192.168.10...
## $ src_port        <int> 60025, 443, 60025, 443, 60025, 58976, 443, 535...
## $ tcp_session     <dbl> -2.082796e+17, -2.082796e+17, -2.082796e+17, -...
## $ packet_length   <int> 129, 129, 66, 703, 66, 65, 75, 364, 65, 65, 75...
## $ data            <chr> "...g9B..c.<..O..@=,0R.`........K..EzYd=.........
## $ src_mac_address <chr> "78:4F:43:77:02:00", "D4:8C:B5:C9:6C:1B", "78:...
## $ dst_port        <int> 443, 60025, 443, 60025, 443, 443, 58976, 5353,...
## $ type            <chr> "TCP", "TCP", "TCP", "TCP", "TCP", "UDP", "UDP...
## $ dst_ip          <chr> "54.159.166.81", "192.168.10.100", "54.159.166...
## $ dst_mac_address <chr> "D4:8C:B5:C9:6C:1B", "78:4F:43:77:02:00", "D4:...
## $ network         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ timestamp       <dttm> 2017-07-27 23:54:58, 2017-07-27 23:54:59, 201...

summarise(my_pcaps, max = max(timestamp), min = min(timestamp)) %>% 
  collect() %>% 
  summarise(max - min)
## # A tibble: 1 x 1
##     `max - min`
##          <time>
## 1 1.924583 mins

count(my_pcaps, type)
## # Source:   lazy query [?? x 2]
## # Database: DrillConnection
##    type     n
##   <chr> <int>
## 1   TCP  4974
## 2   UDP   774

filter(my_pcaps, type=="TCP") %>% 
  count(dst_port, sort=TRUE)
## # Source:     lazy query [?? x 2]
## # Database:   DrillConnection
## # Ordered by: desc(n)
##    dst_port     n
##       <int> <int>
##  1      443  2580
##  2    56202   476
##  3    56229   226
##  4    56147   169
##  5    56215   103
##  6    56143    94
##  7    56085    89
##  8    56203    56
##  9    56205    39
## 10    56209    39
## # ... with more rows

filter(my_pcaps, type=="TCP") %>% 
  count(dst_ip, sort=TRUE) %>% 
  collect() -> dst_ips

filter(dst_ips, !is.na(dst_ip)) %>%
  left_join(ips_in_cidrs(.$dst_ip, c("10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16")),
            by = c("dst_ip"="ips")) %>%
  filter(!in_cidr) %>%
  left_join(distinct(bulk_origin(.$dst_ip), ip, .keep_all=TRUE), c("dst_ip" = "ip")) %>%
  select(dst_ip, n, as_name)
## # A tibble: 37 x 3
##            dst_ip     n                              as_name
##             <chr> <int>                                <chr>
##  1   104.244.42.2   862           TWITTER - Twitter Inc., US
##  2 104.244.46.103   556           TWITTER - Twitter Inc., US
##  3  104.20.60.241   183 CLOUDFLARENET - CloudFlare, Inc., US
##  4     31.13.80.8   160        FACEBOOK - Facebook, Inc., US
##  5  52.218.160.76   100     AMAZON-02 - Amazon.com, Inc., US
##  6  104.20.59.241    79 CLOUDFLARENET - CloudFlare, Inc., US
##  7  52.218.160.92    66     AMAZON-02 - Amazon.com, Inc., US
##  8  199.16.156.81    58           TWITTER - Twitter Inc., US
##  9 104.244.42.193    47           TWITTER - Twitter Inc., US
## 10  52.86.113.212    42    AMAZON-AES - Amazon.com, Inc., US
## # ... with 27 more rows

No custom R code. No modification to the sergeant package. Just query it like any other data source.

One really cool part of this is that — while similar functionality has been available in various Hadoop contexts for a few years — we’re doing this query from a local file system outside of a Hadoop context.

I had to add "pcap": { "type": "pcap" } to the formats section of the dfs storage configuration (#ty to the Drill community for helping me figure that out) and, I setup a directory that defaults to the pcap type. But after that, it just works.

Well, kinda.

The Java code that the plugin is based on doesn’t like busted PCAP files (which we get quite a bit of in infosec- & honeypot-lands) and it seems to bork on IPv6 packets a bit. And, my sergeant package (for now) can’t do much with the data component (neither can Drill-proper, either). But, it’s a great start and I can use it to do bulk parquet file creation of basic protocols & connection information or take a quick look at some honeypot captures whenever I need to, right from R, without converting them first.

Drill 1.11.0 is only at RC0 right now, so some of these issues may be gone by the time the full release is baked. Some fixes may have to wait for 1.12.0. And, much work needs to be done on the UDF-side and sergeant side to help make the data element more useful.

Even with the issues and limitations, this is an amazing new feature that’s been added to an incredibly useful tool and much thanks goes out to the Drill dev team for sneaking this in to 1.11.0.

If you have cause to work with PCAP files, give this a go and see if it helps speed up parts of your workflow.

Matt @stiles is a spiffy data journalist at the @latimes and he posted an interesting chart on U.S. Attorneys General longevity (given that the current US AG is on thin ice):

I thought it would be neat (since Matt did the data scraping part already) to look at AG tenure distribution by party, while also pointing out where Sessions falls.

Now, while Matt did scrape the data, it’s tucked away into a javascript variable in an iframe on the page that contains his vis.

It’s still easier to get it from there vs re-scrape Wikipedia (like Matt did) thanks to the V8 package by @opencpu.

The following code:

  • grabs the vis iframe
  • extracts and evaluates the target javascript to get a nice data frame
  • performs some factor re-coding (for better grouping and to make it easier to identify Sessions)
  • plots the distributions using the beeswarm quasirandom alogrithm
library(V8)
library(rvest)
library(ggbeeswarm)
library(hrbrthemes)
library(tidyverse)

pg <- read_html("http://mattstiles.org/dailygraphics/graphics/attorney-general-tenure-20172517/child.html?initialWidth=840&childId=pym_0&parentTitle=Chart%3A%20If%20Ousted%2C%20Jeff%20Sessions%20Would%20Have%20a%20Historically%20Short%20Tenure%20%7C%20The%20Daily%20Viz&parentUrl=http%3A%2F%2Fthedailyviz.com%2F2017%2F07%2F25%2Fchart-if-ousted-jeff-sessions-would-have-a-historically-short-tenure%2F")

ctx <- v8()
ctx$eval(html_nodes(pg, xpath=".//script[contains(., 'DATA')]") %>% html_text())

ctx$get("DATA") %>% 
  as_tibble() %>% 
  readr::type_convert() %>% 
  mutate(party = ifelse(is.na(party), "Other", party)) %>% 
  mutate(party = fct_lump(party)) %>% 
  mutate(color1 = case_when(
    party == "Democratic" ~ "#313695",
    party == "Republican" ~ "#a50026",
    party == "Other" ~ "#4d4d4d")
  ) %>% 
  mutate(color2 = ifelse(grepl("Sessions", label), "#2b2b2b", "#00000000")) -> ags

ggplot() + 
  geom_quasirandom(data = ags, aes(party, amt, color = color1)) +
  geom_quasirandom(data = ags, aes(party, amt, color = color2), 
                   fill = "#ffffff00", size = 4, stroke = 0.25, shape = 21) +
  geom_text(data = data_frame(), aes(x = "Republican", y = 100, label = "Jeff Sessions"), 
            nudge_x = -0.15, family = font_rc, size = 3, hjust = 1) +
  scale_color_identity() +
  scale_y_comma(limits = c(0, 4200)) +
  labs(x = "Party", y = "Tenure (days)", 
       title = "U.S. Attorneys General",
       subtitle = "Distribution of tenure in office, by days & party: 1789-2017",
       caption = "Source data/idea: Matt Stiles <bit.ly/2vXAHTM>") +
  theme_ipsum_rc(grid = "XY")

I turned the data into a CSV and stuck it in this gist if folks want to play w/o doing the js scraping.