Skip navigation

Category Archives: R

Nowadays (I’ve seen that word used so much in journal articles lately that I could not resist using it) I’m using world tile grids more frequently as the need arises to convey the state of exposure of various services at a global (country) scale. Given that necessity fosters invention it seemed that having a ggplot2 geom for world tile grids would make my life easier and also make work more efficient.

To that end, there’s a nascent ggplot2 extension package for making world tile grids called (uncreatively) worldtilegrid?. It’s also at GitHub if you’re more comfortable working from code from there.

The README has examples but we’ll walk through another one here and work with life expectancy data curated by Our World in Data. Rather than draw out this post to a less tenable length with an explanation of how to pull the XHR JSON data into R, just grab the CSV linked with the visualization on that page and substitute the path in the code for where you stored it.

We do need to clean up this data a bit since it has some issues. Let’s do that and carve out some slices into two new data frames so we can work with the most recent curated year and some historical data:

library(hrbrthemes) # gitlab/github :: hrbrmstr/hrbrthemes (use devtools installers)
library(worldtilegrid) # gitlab/github :: hrbrmstr/worldtilegrid (use devtools installers)
library(tidyverse)

cols(
  Entity = col_character(),
  Code = col_character(),
  Year = col_integer(),
  `Life expectancy (Clio-Infra up to 1949; UN Population Division for 1950 to 2015)` = col_double()
) -> le_cols

read_csv("~/Downloads/life-expectancy.csv", col_types = le_cols) %>%
  set_names(c("country", "iso3c", "year", "life_expectancy")) -> lifexp

# clean it up a bit since it's not great data and bust it into groups
# that match the vis on the site vs use continuous raw data, esp since
# the data is really just estimates
filter(lifexp, !is.na(iso3c)) %>%
  filter(nchar(iso3c) == 3) %>%
  mutate(
    grp = cut(
      x = life_expectancy,
      breaks = c(10, 20, 30, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85),
      include.lowest = TRUE
    )
  ) -> lifexp

last_year <-  filter(lifexp, year == last(year))
a_few_years <- filter(lifexp, year %in% c(seq(1900, 2015, 20), 2015))

Now we can use it to make some cartograms.

Using the World Tile Grid geom

You can reference a previous post for what it would normally take to create these tile grids. The idea was to simplify the API down to the caller just needing to specify some country names/ISO[23]Cs (to a country aesthetic) and a value (to a fill aesthetic) and let the package do the rest. To that end, here's how to display the life expectancy data for 2015:

ggplot(last_year) +
  geom_wtg(aes(country = iso3c, fill = grp), border_col = "#2b2b2b") + # yep, you can sum up the entire blog post to this one line
  viridis::scale_fill_viridis(
    name = "Life Expectancy", discrete = TRUE, na.value=alpha(ft_cols$gray, 1/10),
    drop = FALSE
  ) +
  coord_equal() +
  labs(
    x=NULL, y=NULL,
    title = "Life expectancy, 2015",
    subtitle = "Shown is period life expectancy at birth. This corresponds to an estimate of the average number\nof years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to\nstay the same throughout its life",
    caption = "Data source: "

  ) +
  theme_ft_rc(grid="") +
  theme(axis.text=element_blank()) +
  theme(legend.position = "bottom")

You can add labels via geom_text() and use the wtg stat for it as it provides a number of computed variables you can work with:

  • x,y: the X,Y position of the tile
  • name: Country name (e.g. Afghanistan)
  • country.code: ISO2C country code abbreviation (e.g. AF)
  • iso_3166.2: Full ISO 3166 2-letter abbreviation code (e.g. ISO 3166-2:AF)
  • region: Region name (e.g. Asia)
  • sub.region: Sub-region name (e.g. Southern Asia)
  • region.code: Region code (e.g. 142)
  • sub.region.code: Sub-region code (e.g. 034)

Labeling should be a deliberate decision and used sparingly/with care.

Easier World Tile Grid Facets

Making faceted charts is also straightforward. Just use ggplot2's faceting subsytem:

ggplot(a_few_years) +
  geom_wtg(aes(country = iso3c, fill = grp), border_col = "#2b2b2b") +
  viridis::scale_fill_viridis(
    name = "Life Expectancy", discrete = TRUE, na.value=alpha(ft_cols$gray, 1/10),
    drop = FALSE
  ) +
  coord_equal() +
  labs(
    x=NULL, y=NULL,
    title = "Life expectancy, 1900-2015 (selected years)",
    subtitle = "Shown is period life expectancy at birth. This corresponds to an estimate of the average number of years a newborn infant\nwould live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life",
    caption = "Data source: "
  ) +
  facet_wrap(~year) +
  theme_ft_rc(grid="") +
  theme(axis.text=element_blank()) +
  theme(legend.position = "bottom")

(You may want to open that one up in a separate tab/window.)

Animation might have been a better choice than facets (which is an exercise left to the reader ?).

FIN

The package API is still in "rapidly changing" mode, but feel free to kick the tyres and file issues or PRs on your community coding platform of choice as needed.

As noted in the package, the world tile grid concept credit goes to Jon Schwabish and the CSV data used to create the package exposed wtg data frame is the work of Maarten Lambrechts.

Apple has run the death bell on 32-bit macOS apps and, if you’re running a recent macOS version on your Mac (which you should so you can get security updates) you likely see this alert from time-to-time:

If you’re like me, you click through that and keep working but later ponder just how many of those apps you have. They are definitely going away, so knowing if your favourite app is on the chopping block is likely a good idea.

You can get this information via the “About This Mac”͢”System Report” app and sorting via one of the table fields.

R folks are data folks and we know we can do better than that. But, first we need to get the data. Thankfully, we can get this via the system_profiler command-line utility since it can both display user-friendly information in the terminal and also generate an XML version of the information to work with. We won’t need to head to the terminal for this work, though, since there are many ways to execute the command from R and read the generated output.

Executing System Calls from R

Base R provides two core methods for issuing a system call:

  • system()
  • system2()

Note that there are other functions provided with a base R installation that can also issue system commands and process the “piped” output, but we’ll focus on these deliberate invocation ones.

Functions in two other packages can also assist with this task and we’ll include a look at:

  • processx::run()
  • sys::exec_internal()

as well.

Why leave base R for this task? Truthfully, we really don’t need to, but both sys and processx have other tasks which do make them handy tools in your package toolbox. Having said that, keep reading since we’re going to end up not choosing the built-in functions for this task.

This is the command line we need to execute:

system_profiler -xml -detailLevel full SPApplicationsDataType

Let’s load up all the packages we’ll be needing and execute this command-line all four ways, then briefly discuss the differences:

library(sys)
library(processx)
library(microbenchmark)
library(xml2)
library(tidyverse)

system(
  command = "system_profiler -xml -detailLevel full SPApplicationsDataType",
  intern = TRUE
) -> apps_system

str(apps_system)
##  chr [1:10665] "" ...

system2(
  command = "system_profiler",
  args = c("-xml", "-detailLevel", "full", "SPApplicationsDataType"),
  stdout = TRUE
) -> apps_system2

str(apps_system2)
##  chr [1:10665] "" ...

processx::run(
  command = "system_profiler",
  args = c("-xml", "-detailLevel", "full", "SPApplicationsDataType"),
  spinner = TRUE
) -> apps_processx_run

str(apps_processx_run)
## List of 4
##  $ status : int 0
##  $ stdout : chr "XML STRING THAT prism.js won't let me show"
##  $ stderr : chr ""
##  $ timeout: logi FALSE

sys::exec_internal(
  cmd = "system_profiler",
  args = c("-xml", "-detailLevel", "full", "SPApplicationsDataType")
) -> apps_sys_exec_internal

str(apps_sys_exec_internal)
## List of 3
##  $ status: int 0
##  $ stdout: raw [1:331133] 3c 3f 78 6d ...
##  $ stderr: raw(0) 

The core difference between system() and the rest is that you need to shQuote()? for system() whereas that’s taken care of for you by the others (so they’re a bit safer by default since you’re more than likely going to forget to shQuote()).

You can definitely notice the main differences in return objects. The built-in functions just give us the character data from the standard output stream (stdout) and the last two return a more structured object that provides more explicit information about the job we just executed. The base ones can provide this detail, but it’s a twisty maze of remembering which options do what vs the more (IMO) straightforward approach both processx and sys take.

You’ll also notice a difference in stdout between processx and sys with the latter giving us a raw vector vs a character vector. This gives us a great deal of power and flexibility. It also turns out to be a great choice for processing command-line-generated XML data. Here’s why:

microbenchmark(
  sys = xml2::read_xml(apps_sys_exec_internal$stdout),
  processx = xml2::read_xml(apps_processx_run$stdout)
)
## Unit: milliseconds
##      expr      min       lq      mean    median        uq      max neval
##       sys 4.086492  4.60078  9.085143  5.508814  5.906942 207.6495   100
##  processx 9.510356 10.98282 14.275499 12.054810 13.292234 163.9870   100

It turns out xml2::read_xml() makes much quicker work of the raw vector data (though, I mean, really—are we really going to care about those ~5ms IRL?).

We’ll move on to the real reason for the post, but definitely explore both sys and processx since they are both super-handy packages.

“Can we please just find the 32-bit apps already?”

No problem. Well, actually, there is a minor annoyance. These are property list XML files and I’ll confess that I truly hate this format. There are “dictionary arrays” of key and value nodes, but those nodes are siblings vs directly associated pairs. So, we have to use the sibling relationship to work with them. It’s not hard, per se, just (again, IMO) suboptimal.

Let’s take a look at it:

apps <- read_xml(apps_sys_exec_internal$stdout)

xml_find_all(apps, "//array/dict")
## {xml_nodeset (476)}
##  [1] \n  _SPCommandLineArguments\n  \n     ...
##  [2] \n  _name\n  Sublime Text\n  h ...
##  [3] \n  _name\n  System Preferences\n   ...
##  [4] \n  _name\n  Google Chrome Canary\n ...
##  [5] \n  _name\n  Google Chrome\n   ...
##  [6] \n  _name\n  Dropbox\n  has64B ...
##  [7] \n  _name\n  Keypad\n  has64Bi ...
##  [8] \n  _name\n  Garmin WebUpdater\n  < ...
##  [9] \n  _name\n  LaTeXiT\n  has64B ...
## [10] \n  _name\n  CocoaPacketAnalyzer\n  ...
## [11] \n  _name\n  Janetter\n  has64 ...
## [12] \n  _name\n  VMware Fusion\n   ...
## [13] \n  _name\n  Photo Library Migration Utility ...
## [14] \n  _name\n  Setup Assistant\n  \n  _name\n  Siri\n  has64BitI ...
## [16] \n  _name\n  Software Update\n  \n  _name\n  Spotlight\n  has6 ...
## [18] \n  _name\n  Stocks\n  has64Bi ...
## [19] \n  _name\n  SystemUIServer\n  \n  _name\n  UniversalAccessControl ...
## ...

Let’s look at a sample record using xml_view() from the htmltidy package:

xml_find_all(apps, "//array/dict[key='_name']")[1] %>% 
  htmltidy::xml_view()

Be wary of using xml_view() on giant XML structures since it’ll freeze up RStudio for a bit and even slows down Chrome since the resultant, composed DOM object can get ginormous.

Now we know we can use has64BitIntelCode for filtering once we get to the data. Let’s read in all the apps, cherry-picking the fields and then just look at the 32-bit apps:

xml_find_all(apps, "//array/dict[key='_name']") %>% 
  map_df(~{
    list(
      name = xml_find_first(.x, ".//string") %>% xml_text(),
      path = xml_find_first(.x, ".//key[.='path']/following-sibling::string") %>% xml_text(),
      is_64bit = xml_find_first(.x, ".//key[.='has64BitIntelCode']/following-sibling::string") %>% xml_text() 
    )
  }) %>% 
  filter(is_64bit == "no") %>% 
  arrange(name) %>% 
  select(-is_64bit)
## # A tibble: 30 x 2
##    name                      path                                           
##                                                                   
##  1 AAM Registration Notifier /Applications/Utilities/Adobe Application Mana…
##  2 AAM Registration Notifier /Applications/Utilities/Adobe Application Mana…
##  3 AAM Updates Notifier      /Applications/Utilities/Adobe Application Mana…
##  4 AAMLauncherUtil           /Applications/Utilities/Adobe Application Mana…
##  5 ACR_9_10                  /Library/Application Support/Adobe/Uninstall/A…
##  6 Adobe Application Manager /Applications/Utilities/Adobe Application Mana…
##  7 adobe_licutil             /Applications/Utilities/Adobe Application Mana…
##  8 Audacity                  /Applications/Audacity.app                     
##  9 COCM_1_0_32               /Library/Application Support/Adobe/Uninstall/C…
## 10 COPS_1_0_32               /Library/Application Support/Adobe/Uninstall/C…
## # ... with 20 more rows

The Adobe helper apps are longstanding 32-bit “offenders”. Many of these death-row apps fall into the “helper” category and will hopefully get some attention by their developers. I do find it amusing that Apple kinda wants us to prod the developers to get their collective acts together.

FIN

This exhaustive search finds all of the 32-bit apps residing on your system. If you just want to see the one’s you’ve executed and macOS has kept track of, you can drop to a command-line and do:

sudo Rscript -e 'knitr::kable((dplyr::select(dplyr::tbl(dplyr::src_sqlite("/var/db/SystemPolicyConfiguration/ExecPolicy"), "legacy_exec_history_v3"), responsible_path)))'

You need elevated privileges to access those files, so please read that whole line to make sure I’m not having you rm -rf /.

Remember to take some time to explore the sys and processx packages and perhaps bundle up the salient bits of this post into a script so you can occasionally check to see the 32-bit eradication progress.

It’s been over a year since Headless Chrome was introduced and it has matured greatly over that time and has acquired a pretty large user base. The TLDR on it is that you can now use Chrome as you would any command-line interface (CLI) program and generate PDFs, images or render javascript-interpreted HTML by supplying some simple parameters. It has a REPL mode for interactive work and can be instrumented through a custom websockets protocol.

R folks have had the decapitated? package available almost since the launch day of Headless Chrome. It provides a basic wrapper to the CLI. The package has been updated more recently to enable the downloading of a custom Chromium binary to use instead of the system Chrome installation (which is a highly recommended practice).

However, that nigh-mundane addition is not the only new feature in decapitated.

Introducing gepetto

While it would have been possible to create an R wrapper for the Headless Chrome websockets API, the reality is (and this is just my opinion) that it is better to integrate with a more robust and community supported interface to Headless Chrome instrumentation dubbed puppeteer?. Puppeteer is a javascript module that adds high level functions on top of the lower-level API and has a massive amount of functionality that can be easily tapped into.

Now, Selenium works really well with Headless Chrome and there’s little point in trying to reinvent that wheel. Rather, I wanted a way to interact with Headless Chrome the way one can with ScrapingHub’s Splash service. That is, a simple REST API. To that end, I’ve started a project called gepetto? which aims to do just that.

Gepetto is a Node.js application which uses puppeteer for all the hard work. After seeing that such a REST API interface was possible via the puppetron proof of concept I set out to build a framework which will (eventually) provide the same feature set that Splash has, substituting puppeteer-fueled javascript for the Lua interface.

A REST API has a number of advantages over repeated CLI calls. First, each CLI call means more more system() call to start up a new process. You also need to manage Chrome binaries in that mode and are fairly limited in what you can do. With a REST API, Chrome loads once and then pages can be created at-will with no process startup overhead. Plus (once the API is finished) you’ll have far more control over what you can do. Again, this is not going to cover the same ground as Selenium, but should be of sufficient utility to add to your web-scraping toolbox.

Installing gepetto

There are instructions over at the repo on installing gepetto but R users can try a shortcut by grabbing the latest version of decapitated from Git[La|Hu]b and running decapitated::install_gepetto() which should (hopefully) go as smoothly as this provided you have a fairly recent version of Node.js installed along with npm:

The installer provides some guidance should thing go awry. You’ll notice gepetto installs a current version of Chromium for your platform along with it, which helps to ensure smoother sailing than using the version of Chrome you may use for browsing.

Working with gepetto

Before showing off the R interface, it’s worth a look at the (still minimal) web interface. Bring up a terminal/command prompt and enter gepetto. You should see something like this:

$ gepetto
? Launch browser!
? gepetto running on: http://localhost:3000

NOTE: You can use a different host/port by setting the HOST and PORT environment variables accordingly before startup.

You can then go to http://localhost:3000 in your browser and should see this:

Enter a URL into the input field and press the buttons! You can do quite a bit just from the web interface.

If you select “API Docs” (http://localhost:3000/documentation) you’ll get the Swagger-gen’d API documentation for all the API endpoints:

The Swagger definition JSON is also at http://localhost:3000/swagger.json.

The API documentation will be a bit more robust as the module’s corners are rounded out.

“But, this is supposed to be an R post…”

Yes. Yes it is.

If you followed along in the previous section and started gepetto from a command-line interface, kill the running service and fire up your favourite R environment and let’s scrape some content!

library(rvest)
library(decapitated)
library(tidyverse)

gpid <- start_gepetto()

gpid
## PROCESS 'gepetto', running, pid 60827.

gepetto() %>% 
  gep_active()
## [1] TRUE

Anything other than a “running” response means there’s something wrong and you can use the various processx methods on that gpid object to inspect the error log. If you were able to run gepetto from the command line then it should be fine in R, too. The gep() function build a connection object and gep_active() tests an API endpoint to ensure you can communicate with the server.

Now, let’s try hitting a website that requires javascript. I’ll borrow an example from Brooke Watson. The data for http://therapboard.com/ loads via javascript and will not work with xml2::read_html().

gepetto() %>% 
  gep_render_html("http://therapboard.com/") -> doc

html_nodes(doc, xpath=".//source[contains(@src, 'mp3')]") %>%  
  html_attr("src") %>% 
  head(10)
## [1] "audio/2chainz_4.mp3"        "audio/2chainz_yeah2.mp3"   
## [3] "audio/2chainz_tellem.mp3"   "audio/2chainz_tru.mp3"     
## [5] "audio/2chainz_unh3.mp3"     "audio/2chainz_watchout.mp3"
## [7] "audio/2chainz_whistle.mp3"  "audio/2pac_4.mp3"          
## [9] "audio/2pac_5.mp3"           "audio/2pac_6.mp3"

Even with a Node.js and npm dependency, I think that’s a bit friendlier than interacting with phantomjs.

We can render a screenshot of a site as well. Since we’re not stealing content this way, I’m going to cheat a bit and grab the New York Times front page:

gepetto() %>% 
  gep_render_magick("https://nytimes.com/")
##   format width height colorspace matte filesize density
## 1    PNG  1440   6828       sRGB  TRUE        0   72x72

Astute readers will notice it returns a magick object so you can work with it immediately.

I’m still working out the interface for image capture and will also be supporting capturing the image of a CSS selector target. I mention that since the gep_render_magick() actually captured the entire page which you can see for yourself (the thumbnail doesn’t do it justice).

Testing gep_render_pdf() is an exercise left to the reader.

FIN

The gepetto REST API is at version 0.1.0 meaning it’s new, raw and likely to change (quickly, too). Jump on board in whatever repo you’re more comfortable with and kick the tyres + file issues or PRs (on either or both projects) as you are wont to do.

Apache Drill 1.14.0 was recently released, bringing with it many new features and a temporary incompatibility with the current rev of the MapR ODBC drivers. The Drill community expects new ODBC drivers to arrive shortly. The sergeant? is an alternative to ODBC for R users as it provides a dplyr interface to the REST API along with a JDBC interface and functions to work directly with the REST API in a more programmatic fashion.

First-class dplyr-citizen support for the sergeant JDBC interface

I’ve been primarily using the ODBC interface for a while, now, since it’s dead simple to use it with dplyr (as has been noted in my still-unfinished, short cookbook on wiring up Drill and R). The ODBC incompatibility is pretty severe since it’s at the protobuf-level, but sergeant::src_drill() is an easy swap out and does not have any issues since it works against the REST API. Unfortunately, the query endpoint of the REST API mangles the field order when it returns query results. This really isn’t too painful since it’s easy to add in a select() call after gathering query results to reorder things. However, it’s painful enough that it facilitated rounding out some of the corners to the JDBC interface.

sergeant::drill_jdbc() now returns a <DrillJDBCConnection> object which was necessary to add dplyr classes for just enough bits to enable smooth operation with the tbl() function (without breaking all your other RJDBC usage in the same session). The next blog section will use the new JDBC interface with dplyr as it introduces one of Drill’s new features.

Query Image Metadata with Apache Drill 1.14.0

There are quite a few R packages for reading image/media metadata. Since that seems to be en vogue, R folks might be interested in Drill’s new image metadata format plugin. Just point drill to a directory of files and you can use a familiar dplyr interface to get the deets on your pirated torrent archivefamily photo inventory.

You first need to follow the directions at the aforelinked resource and add the following format to the formats: section.

formats: {
     "image": {
       "type": "image",
       "extensions": [
         "jpg", "jpeg", "jpe", "tif", "tiff", "dng", "psd", "png", "bmp", "gif",
         "ico", "pcx", "wav", "wave", "avi", "webp", "mov", "mp4", "m4a", "m4p",
         "m4b", "m4r", "m4v", "3gp", "3g2", "eps", "epsf", "epsi", "ai", "arw",
         "crw", "cr2", "nef", "orf", "raf", "rw2", "rwl", "srw", "x3f"
       ],
       "fileSystemMetadata": true,
       "descriptive": true,
       "timeZone": null
     }  
   }

Note that the configuration snippet on Drill’s site (as of the date-stamp on this post) did not have a , after the ] for the extensions array, so copy this one instead.

I created a media workspace and set the defaultInputFormat to image. Here’s a naive first look at what you can get back from a simple query to a jpg directory under it (using the new JDBC interface and dplyr):

library(sergeant)
library(tidyverse)

(con <- drill_jdbc("bigd:2181"))
## 

tbl(con, "dfs.media.`/jpg/*`") %>%
  glimpse()
## Observations: ??
## Variables: 28
## $ FileSize         "4412686 bytes", "4737696 bytes", "4253912 byt...
## $ FileDateTime     "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16...
## $ Format           "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"...
## $ PixelWidth       "4032", "4032", "4032", "4032", "4032", "4032"...
## $ PixelHeight      "3024", "3024", "3024", "3024", "3024", "3024"...
## $ BitsPerPixel     "24", "24", "24", "24", "24", "24", "24", "24"...
## $ DPIWidth         "72", "72", "72", "72", "72", "72", "72", "72"...
## $ DPIHeight        "72", "72", "72", "72", "72", "72", "72", "72"...
## $ Orientaion       "Unknown (0)", "Unknown (0)", "Unknown (0)", "...
## $ ColorMode        "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB...
## $ HasAlpha         "false", "false", "false", "false", "false", "...
## $ Duration         "00:00:00", "00:00:00", "00:00:00", "00:00:00"...
## $ VideoCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ FrameRate        "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ AudioSampleSize  "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioSampleRate  "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ JPEG             "{\"CompressionType\":\"Baseline\",\"DataPreci...
## $ JFIF             "{\"Version\":\"1.1\",\"ResolutionUnits\":\"no...
## $ ExifIFD0         "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus...
## $ ExifSubIFD       "{\"ExposureTime\":\"1/2227 sec\",\"FNumber\":...
## $ AppleMakernote   "{\"UnknownTag(0x0001)\":\"5\",\"UnknownTag(0x...
## $ GPS              "{\"GPSLatitudeRef\":\"N\",\"GPSLatitude\":\"4...
## $ XMP              "{\"XMPValueCount\":\"4\",\"Photoshop\":{\"Dat...
## $ Photoshop        "{\"CaptionDigest\":\"48 89 11 77 33 105 192 3...
## $ IPTC             "{\"CodedCharacterSet\":\"UTF-8\",\"Applicatio...
## $ Huffman          "{\"NumberOfTables\":\"4 Huffman tables\"}", "...
## $ FileType         "{\"DetectedFileTypeName\":\"JPEG\",\"Detected...

That’s quite a bit of metadata, but the Drill format plugin page kinda fibs a bit about column types since we see many chrs there. You may be quick to question the sergeant package but this isn’t using the REST interface and we can use DBI calls to ask Drill what’s it’s sending us:

dbSendQuery(con, "SELECT * FROM dfs.media.`/jpg/*`") %>%
  dbColumnInfo()
##         field.name        field.type data.type            name
## 1         FileSize CHARACTER VARYING character        FileSize
## 2     FileDateTime CHARACTER VARYING character    FileDateTime
## 3           Format CHARACTER VARYING character          Format
## 4       PixelWidth CHARACTER VARYING character      PixelWidth
## 5      PixelHeight CHARACTER VARYING character     PixelHeight
## 6     BitsPerPixel CHARACTER VARYING character    BitsPerPixel
## 7         DPIWidth CHARACTER VARYING character        DPIWidth
## 8        DPIHeight CHARACTER VARYING character       DPIHeight
## 9       Orientaion CHARACTER VARYING character      Orientaion
## 10       ColorMode CHARACTER VARYING character       ColorMode
## 11        HasAlpha CHARACTER VARYING character        HasAlpha
## 12        Duration CHARACTER VARYING character        Duration
## 13      VideoCodec CHARACTER VARYING character      VideoCodec
## 14       FrameRate CHARACTER VARYING character       FrameRate
## 15      AudioCodec CHARACTER VARYING character      AudioCodec
## 16 AudioSampleSize CHARACTER VARYING character AudioSampleSize
## 17 AudioSampleRate CHARACTER VARYING character AudioSampleRate
## 18            JPEG               MAP character            JPEG
## 19            JFIF               MAP character            JFIF
## 20        ExifIFD0               MAP character        ExifIFD0
## 21      ExifSubIFD               MAP character      ExifSubIFD
## 22  AppleMakernote               MAP character  AppleMakernote
## 23             GPS               MAP character             GPS
## 24             XMP               MAP character             XMP
## 25       Photoshop               MAP character       Photoshop
## 26            IPTC               MAP character            IPTC
## 27         Huffman               MAP character         Huffman
## 28        FileType               MAP character        FileType

We can still work with the results, but there’s also a pretty key element missing: the media filename. The reason it’s not in the listing is that filename is an implicit column that we have to ask for. So, we need to modify our query to be something like this:

tbl(con, sql("SELECT filename AS fname, * FROM dfs.media.`/jpg/*`")) %>%
  glimpse()
## Observations: ??
## Variables: 29
## $ fname            "IMG_0778.jpg", "IMG_0802.jpg", "IMG_0793.jpg"...
## $ FileSize         "4412686 bytes", "4737696 bytes", "4253912 byt...
## $ FileDateTime     "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16...
## $ Format           "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"...
## $ PixelWidth       "4032", "4032", "4032", "4032", "4032", "4032"...
## $ PixelHeight      "3024", "3024", "3024", "3024", "3024", "3024"...
## $ BitsPerPixel     "24", "24", "24", "24", "24", "24", "24", "24"...
## $ DPIWidth         "72", "72", "72", "72", "72", "72", "72", "72"...
## $ DPIHeight        "72", "72", "72", "72", "72", "72", "72", "72"...
## $ Orientaion       "Unknown (0)", "Unknown (0)", "Unknown (0)", "...
## $ ColorMode        "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB...
## $ HasAlpha         "false", "false", "false", "false", "false", "...
## $ Duration         "00:00:00", "00:00:00", "00:00:00", "00:00:00"...
## $ VideoCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ FrameRate        "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ AudioSampleSize  "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ AudioSampleRate  "0", "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ JPEG             "{\"CompressionType\":\"Baseline\",\"DataPreci...
## $ JFIF             "{\"Version\":\"1.1\",\"ResolutionUnits\":\"no...
## $ ExifIFD0         "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus...
## $ ExifSubIFD       "{\"ExposureTime\":\"1/2227 sec\",\"FNumber\":...
## $ AppleMakernote   "{\"UnknownTag(0x0001)\":\"5\",\"UnknownTag(0x...
## $ GPS              "{\"GPSLatitudeRef\":\"N\",\"GPSLatitude\":\"4...
## $ XMP              "{\"XMPValueCount\":\"4\",\"Photoshop\":{\"Dat...
## $ Photoshop        "{\"CaptionDigest\":\"48 89 11 77 33 105 192 3...
## $ IPTC             "{\"CodedCharacterSet\":\"UTF-8\",\"Applicatio...
## $ Huffman          "{\"NumberOfTables\":\"4 Huffman tables\"}", "...
## $ FileType         "{\"DetectedFileTypeName\":\"JPEG\",\"Detected...

We could work with the “map” columns with Drill’s SQL, but this is just metadata and even if there are many files, most R folks have sufficient system memory these days to collect it all and work with it locally. There’s nothing stopping you from working on the SQL side, though, and it may be a better choice if you’ll be using this to process huge archives. But, we’ll do this in R and convert a bunch of field types along the way:

from_map <- function(x) { map(x, jsonlite::fromJSON)}

tbl(con, sql("SELECT filename AS fname, * FROM dfs.media.`/jpg/*`")) %>%
  collect() %>%
  mutate_at(
    .vars = vars(
      JPEG, JFIF, ExifSubIFD, AppleMakernote, GPS, XMP, Photoshop, IPTC, Huffman, FileType
    ),
    .funs=funs(from_map)
  ) %>%
  mutate_at(
    .vars = vars(
      PixelWidth, PixelHeight, DPIWidth, DPIHeight, FrameRate, AudioSampleSize, AudioSampleRate
    ),
    .funs=funs(as.numeric)
  ) %>%
  glimpse() -> media_df
## Observations: 11
## Variables: 29
## $ fname            "IMG_0778.jpg", "IMG_0802.jpg", "IMG_0793.jpg"...
## $ FileSize         "4412686 bytes", "4737696 bytes", "4253912 byt...
## $ FileDateTime     "Thu Aug 16 03:04:16 -04:00 2018", "Thu Aug 16...
## $ Format           "JPEG", "JPEG", "JPEG", "JPEG", "JPEG", "JPEG"...
## $ PixelWidth       4032, 4032, 4032, 4032, 4032, 4032, 3024, 4032...
## $ PixelHeight      3024, 3024, 3024, 3024, 3024, 3024, 4032, 3024...
## $ BitsPerPixel     "24", "24", "24", "24", "24", "24", "24", "24"...
## $ DPIWidth         72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72
## $ DPIHeight        72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72
## $ Orientaion       "Unknown (0)", "Unknown (0)", "Unknown (0)", "...
## $ ColorMode        "RGB", "RGB", "RGB", "RGB", "RGB", "RGB", "RGB...
## $ HasAlpha         "false", "false", "false", "false", "false", "...
## $ Duration         "00:00:00", "00:00:00", "00:00:00", "00:00:00"...
## $ VideoCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ FrameRate        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ AudioCodec       "Unknown", "Unknown", "Unknown", "Unknown", "U...
## $ AudioSampleSize  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ AudioSampleRate  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ JPEG             [["Baseline", "8 bits", "3024 pixels", "4032 ...
## $ JFIF             [["1.1", "none", "72 dots", "72 dots", "0", "...
## $ ExifIFD0         "{\"Make\":\"Apple\",\"Model\":\"iPhone 7 Plus...
## $ ExifSubIFD       [["1/2227 sec", "f/1.8", "Program normal", "2...
## $ AppleMakernote   [["5", "[558 values]", "[104 values]", "1", "...
## $ GPS              [["N", "44° 19' 6.3\"", "W", "-68° 11' 22.39\...
## $ XMP              [["4", ["2017-06-22T14:28:04"], ["2017-06-22T...
## $ Photoshop        [["48 89 11 77 33 105 192 33 170 252 63 34 43...
## $ IPTC             [["UTF-8", "2", "14:28:04", "2017:06:22", "20...
## $ Huffman          [["4 Huffman tables"], ["4 Huffman tables"], ...
## $ FileType         [["JPEG", "Joint Photographic Experts Group",...

Now, we can do anything with the data, including getting the average file size:

mutate(media_df, FileSize = str_replace(FileSize, " bytes", "") %>% as.numeric()) %>%
  summarise(mean(FileSize))
## # A tibble: 1 x 1
##   `mean(FileSize)`
##              
## 1         3878963.

FIN

The enhancements to the JDBC interface have only been given a light workout but seem to be doing well so-far. Kick the tyres and file an issue if you have any problems. ODBC users should not have to wait long for new drivers and src_drill() aficionados can keep chugging along as before without issue.

For those new to Apache Drill, there’s now an official Docker image for it so you can get up and running without adding too much cruft to your local systems. I may add support for spinning up and managing a Drill container to the sergeant package, so keep your eyes on pushes to the repo.

Also keep an eye on the mini-cookbook as I’ll be modifying to to account for the new package changes and introduce additional, new Drill features.

The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and/or Splash: the latter is a lightweight alternative to tools like Selenium and the former is an R interface to it. Unlike xml2::read_html(), splashr renders a URL exactly as a browser does (because it uses a virtual browser) and can return far more than just the HTML from a web page. Splash does need to be running and it’s best to use it in a Docker container.

If you have a large number of sites to scrape, working with splashr and Splash “as-is” can be a bit frustrating since there’s a limit to what a single instance can handle. Sure, it’s possible to setup your own highly available, multi-instance Splash cluster and use it, but that’s work. Thankfully, the folks behind TeamHG-Memex created Aquarium which uses docker and docker-compose to stand up a multi-Splash instance behind a pre-configured HAProxy instance so you can take advantage of parallel requests the Splash API. As long as you have docker and docker-compose handy (and Python) following the steps on the aforelinked GitHub page should have you up and running with Aquarium in minutes. You use the same default port (8050) to access the Splash API and you get a bonus port of 8036 to watch in your browser (the HAProxy stats page).

This works well when combined with furrr? which is an R package that makes parallel operations very tidy.

One way to use purrr, splashr and Aquarium might look like this:

library(splashr)
library(HARtools)
library(urltools)
library(furrr)
library(tidyverse)

list_of_urls_with_unique_urls <- c("http://...", "http://...", ...)

make_a_splash <- function(org_url) {
  splash(
    host = "ip/name of system you started aquarium on", 
    user = "your splash api username", 
    pass = "your splash api password"
  ) %>% 
    splash_response_body(TRUE) %>% # we want to get all the content 
    splash_user_agent(ua_win10_ie11) %>% # splashr has many pre-configured user agents to choose from 
    splash_go(org_url) %>% 
    splash_wait(5) %>% # pick a reasonable timeout; modern web sites with javascript are bloated
    splash_har()
}

safe_splash <- safely(make_a_splash) # splashr/Splash work well but can throw errors. Let's be safe

plan(multiprocess, workers=5) # don't overwhelm the default setup or your internet connection

future_map(sites, ~{
  
  org <- safe_splash(.x) # go get it!
  
  if (is.null(org$result)) {
    sprintf("Error retrieving %s (%s)", .x, org$error$message) # this gives us good error messages
  } else {
    
    HARtools::writeHAR( # HAR format saves *everything*. the files are YUGE
      har = org$result, 
      file = file.path("/place/to/store/stuff", sprintf("%s.har", domain(.x))) # saved with the base domain; you may want to use a UUID via uuid::UUIDgenerate()
    )
    
    sprintf("Successfully retrieved %s", .x)
    
  }
  
}) -> results

(Those with a keen eye will grok why splashr supports Splash API basic authentication, now)

The parallel iterator will return a list we can flatten to a character vector (I don’t do that by default since it’s safer to get a list back as it can hold anything and map_chr() likes to check for proper objects) to check for errors with something like:

flatten_chr(results) %>% 
  keep(str_detect, "Error")
## [1] "Error retrieving www.1.example.com (Service Unavailable (HTTP 503).)"
## [2] "Error retrieving www.100.example.com (Gateway Timeout (HTTP 504).)"
## [3] "Error retrieving www.3000.example.com (Bad Gateway (HTTP 502).)"
## [4] "Error retrieving www.a.example.com (Bad Gateway (HTTP 502).)"
## [5] "Error retrieving www.z.examples.com (Gateway Timeout (HTTP 504).)"

Timeouts would suggest you may need to up the timeout parameter in your Splash call. Service unavailable or bad gateway errors may suggest you need to tweak the Aquarium configuration to add more workers or reduce your plan(…). It’s not unusual to have to create a scraping process that accounts for errors and retries a certain number of times.

If you were stuck in the splashr/Splash slow-lane before, give this a try to help save you some time and frustration.

I had to processes a bunch of emails for a $DAYJOB task this week and my “default setting” is to use R for pretty much everything (this should come as no surprise). Treating mail as data is not an uncommon task and many R packages exist that can reach out and grab mail from servers or work directly with local mail archives.

Mbox’in off the rails on a crazy tm1

This particular mail corpus is in mbox? format since it was saved via Apple Mail. It’s one big text file with each message appearing one after the other. The format has been around for decades, and R’s tm package — via the tm.plugin.mail plugin package — can process these mbox files.

To demonstrate, we’ll use an Apple Mail archive excerpt from a set of R mailing list messages as they are not private/sensitive:

library(tm)
library(tm.plugin.mail)

# point the tm corpus machinery to the mbox file and let it know the timestamp format since it varies
VCorpus(
  MBoxSource("~/Data/test.mbox/mbox"),
  readerControl = list(
    reader = readMail(DateFormat = "%a, %e %b %Y %H:%M:%S %z")
  )
) -> mbox

str(unclass(mbox), 1)
## List of 3
##  $ content:List of 198
##  $ meta   : list()
##   ..- attr(*, "class")= chr "CorpusMeta"
##  $ dmeta  :'data.frame': 198 obs. of  0 variables

str(unclass(mbox[[1]]), 1)
## List of 2
##  $ content: chr [1:476] "Try this:" "" "> library(lubridate)" "> library(tidyverse)" ...
##  $ meta   :List of 9
##   ..- attr(*, "class")= chr "TextDocumentMeta"

str(unclass(mbox[[1]]$meta), 1)
## List of 9
##  $ author       : chr "jim holtman "
##  $ datetimestamp: POSIXlt[1:1], format: "2018-08-01 15:01:17"
##  $ description  : chr(0) 
##  $ heading      : chr "Re: [R] read txt file - date - no space"
##  $ id           : chr ""
##  $ language     : chr "en"
##  $ origin       : chr(0) 
##  $ header       : chr [1:145] "Delivered-To: bob@rud.is" "Received: by 2002:ac0:e681:0:0:0:0:0 with SMTP id b1-v6csp950182imq;" "        Wed, 1 Aug 2018 08:02:23 -0700 (PDT)" "X-Google-Smtp-Source: AAOMgpcdgBD4sDApBiF2DpKRfFZ9zi/4Ao32Igz9n8vT7EgE6InRoa7VZelMIik7OVmrFCRPDBde" ...
##  $              : NULL

We’re using unclass() since the str() output gets a bit crowded with all of the tm class attributes stuck in the output display.

The tm suite is designed for text mining. My task had nothing to do with text mining and I really just needed some header fields and body content in a data frame. If you’ve been working with R for a while, some things in the str() output will no doubt cause a bit of angst. For instance:

  • datetimestamp: POSIXlt[1:1], : POSIXlt ? and data frames really don’t mix well
  • description : chr(0) / origin : chr(0): zero-length character vectors ☹️
  • $ : NULL : Blank element name with a NULL value…I Don’t Even ??‍♀️2

The tm suite is also super opinionated and “helpfully” left out a ton of headers (though it did keep the source for the complete headers around). Still, we can roll up our sleeves and turn that into a data frame:

# helper function for cleaner/shorter code
`%|0|%` <- function(x, y) { if (length(x) == 0) y else x }

# might as well stay old-school since we're using tm
do.call(
  rbind.data.frame,
  lapply(mbox, function(.x) {

    # we have a few choices, but this one is pretty explicit abt what it does
    # so we'll likely be able to decipher it quickly in 2 years when/if we come
    # back to it

    data.frame(
      author = .x$meta$author %|0|% NA_character_,
      datetimestamp = as.POSIXct(.x$meta$datetimestamp %|0|% NA),
      description = .x$meta$description %|0|% NA_character_,
      heading = .x$meta$heading %|0|% NA_character_,
      id = .x$meta$id %|0|% NA_character_,
      language = .x$meta$language %|0|% NA_character_,
      origin = .x$meta$origin %|0|% NA_character_,
      header = I(list(.x$meta$header %|0|% NA_character_)),
      body = I(list(.x$content %|0|% NA_character_)),
      stringsAsFactors = FALSE
    )

  })
) %>%
  glimpse()
## Observations: 198
## Variables: 9
## $ author         "jim holtman ", "PIKAL Petr ...
## $ datetimestamp  2018-08-01 15:01:17, 2018-08-01 13:09:18, 2018-...
## $ description    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ heading        "Re: [R] read txt file - date - no space", "Re: ...
## $ id             " "en", "en", "en", "en", "en", "en", "en", "en", ...
## $ origin         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ header         Delivere...., Delivere...., Delivere...., De...
## $ body           Try this...., SGkNCg0K...., Dear Pik...., De... 

That wasn’t a huge effort, but we would now have to re-process the headers and/or write a custom version of tm.plugin.mail::readMail() (the function source is very readable and extendable) to get any extra data out. Here’s what that might look like:

# Custom msg reader
read_mail <- function(elem, language, id) {

  # extract header val
  hdr_val <- function(src, pat) {
    gsub(
      sprintf("%s: ", pat), "",
      grep(sprintf("^%s:", pat), src, "", value = TRUE, useBytes = TRUE)
    ) %|0|% NA
  }

  mail <- elem$content

  index <- which(mail == "")[1]
  header <- mail[1:index]
  mid <- hdr_val(header, "Message-ID")

  PlainTextDocument(
    x = mail[(index + 1):length(mail)],
    author = hdr_val(header, "From"),

    spam_score = hdr_val(header, "X-Spam-Score"), ### <<==== an extra header!

    datetimestamp = as.POSIXct(hdr_val(header, "Date"), format = "%a, %e %b %Y %H:%M:%S %z", tz = "GMT"),
    description = NA_character_,
    header = header,
    heading = hdr_val(header, "Subject"),
    id = if (length(mid)) mid[1] else id,
    language = language,
    origin = hdr_val(header, "Newsgroups"),
    class = "MailDocument"
  )

}

VCorpus(
  MBoxSource("~/Data/test.mbox/mbox"),
  readerControl = list(reader = read_mail)
) -> mbox

str(unclass(mbox[[1]]$meta), 1)
## List of 9
##  $ author       : chr "jim holtman "
##  $ datetimestamp: POSIXct[1:1], format: "2018-08-01 15:01:17"
##  $ description  : chr NA
##  $ heading      : chr "Re: [R] read txt file - date - no space"
##  $ id           : chr ""
##  $ language     : chr "en"
##  $ origin       : chr NA
##  $ spam_score   : chr "-3.631"
##  $ header       : chr [1:145] "Delivered-To: bob@rud.is" "Received: by 2002:ac0:e681:0:0:0:0:0 with SMTP id b1-v6csp950182imq;" "        Wed, 1 Aug 2018 08:02:23 -0700 (PDT)" "X-Google-Smtp-Source: AAOMgpcdgBD4sDApBiF2DpKRfFZ9zi/4Ao32Igz9n8vT7EgE6InRoa7VZelMIik7OVmrFCRPDBde" ...

If we wanted all the headers, there are even more succinct ways to solve for that use case.

Packaging up emails with a reticulated message.mbox

Since the default functionality of tm.plugin.mail::readMail() forced us to work a bit to get what we needed there’s some justification in seeking out an alternative path. I’ve written about reticulate before and am including it in this post as the Python standard library module mailbox? can also make quick work of mbox files.

Two pieces of advice I generally reiterate when I talk about reticulate is that I highly recommend using Python 3 (remember, it’s a fragmented ecosystem) and that I prefer specifying the specific target Python to use via the RETICULATE_PYTHON environment variable that I have in ~/.Renviron as RETICULATE_PYTHON=/usr/local/bin/python3.

Let’s bring the mailbox module into R:

library(reticulate)
library(tidyverse)

mailbox <- import("mailbox")

If you're unfamiliar with a Python module or object, you can get help right in R via reticulate::py_help(). Et sequitur3: py_help(mailbox) will bring up the text help for that module and py_help(mailbox$mbox) (remember, we swap out dots for dollars when referencing Python object components in R) will do the same for the mailbox.mbox class.

Text help is great and all, but we can also render it to HTML with this helper function:

py_doc <- function(x) {
  require("htmltools")
  require("reticulate")
  pydoc <- reticulate::import("pydoc")
  htmltools::html_print(
    htmltools::HTML(
      pydoc$render_doc(x, renderer=pydoc$HTMLDoc())
    )
  )
}

Here's what the text and HTML help for mailbox.mbox look like side-by-side:

We can also use a helper function to view the online documentation:

readthedocs <- function(obj, py_ver=3, check_keywords = "yes") {
  require("glue")
  query <- obj$`__name__`
  browseURL(
    glue::glue(
      "https://docs.python.org/{py_ver}/search.html?q={query}&check_keywords={check_keywords}"
    )
  )
}

Et sequitur: readthedocs(mailbox$mbox) will take us to this results page

Going back to the task at hand, we need to cycle through the messages and make a data frame for the bits we (well, I) care about). The reticulate package does an amazing job making Python objects first-class citizens in R, but Python objects may feel "opaque" to R users since we have to use the $ syntax to get to methods and values and — very often — familiar helpers such as str() are less than helpful on these objects. Let's try to look at the first message (remember, Python is 0-indexed):

msg1 <- mbox$get(0)

str(msg1)

msg1

The output for those last two calls is not shown because they both are just a large text dump of the message source. #unhelpful

We can get more details, and we'll wrap some punctuation-filled calls in two, small helper functions that have names that will sound familiar:

pstr <- function(obj, ...) { str(obj$`__dict__`, ...) } # like 'str()`

pnames <- function(obj) { import_builtins()$dir(obj) } # like 'names()' but more complete

Lets see them in action:

pstr(msg1, 1) # we can pass any params str() will take
## List of 10
##  $ _from        : chr "jholtman@gmail.com Wed Aug 01 15:02:23 2018"
##  $ policy       :Compat32()
##  $ _headers     :List of 56
##  $ _unixfrom    : NULL
##  $ _payload     : chr "Try this:\n\n> library(lubridate)\n> library(tidyverse)\n> input <- read.csv(text =3D \"date,str1,str2,str3\n+ "| __truncated__
##  $ _charset     : NULL
##  $ preamble     : NULL
##  $ epilogue     : NULL
##  $ defects      : list()
##  $ _default_type: chr "text/plain"

pnames(msg1)
##  [1] "__bytes__"                 "__class__"                
##  [3] "__contains__"              "__delattr__"              
##  [5] "__delitem__"               "__dict__"                 
##  [7] "__dir__"                   "__doc__"                  
##  [9] "__eq__"                    "__format__"               
## [11] "__ge__"                    "__getattribute__"         
## [13] "__getitem__"               "__gt__"                   
## [15] "__hash__"                  "__init__"                 
## [17] "__init_subclass__"         "__iter__"                 
## [19] "__le__"                    "__len__"                  
## [21] "__lt__"                    "__module__"               
## [23] "__ne__"                    "__new__"                  
## [25] "__reduce__"                "__reduce_ex__"            
## [27] "__repr__"                  "__setattr__"              
## [29] "__setitem__"               "__sizeof__"               
## [31] "__str__"                   "__subclasshook__"         
## [33] "__weakref__"               "_become_message"          
## [35] "_charset"                  "_default_type"            
## [37] "_explain_to"               "_from"                    
## [39] "_get_params_preserve"      "_headers"                 
## [41] "_payload"                  "_type_specific_attributes"
## [43] "_unixfrom"                 "add_flag"                 
## [45] "add_header"                "as_bytes"                 
## [47] "as_string"                 "attach"                   
## [49] "defects"                   "del_param"                
## [51] "epilogue"                  "get"                      
## [53] "get_all"                   "get_boundary"             
## [55] "get_charset"               "get_charsets"             
## [57] "get_content_charset"       "get_content_disposition"  
## [59] "get_content_maintype"      "get_content_subtype"      
## [61] "get_content_type"          "get_default_type"         
## [63] "get_filename"              "get_flags"                
## [65] "get_from"                  "get_param"                
## [67] "get_params"                "get_payload"              
## [69] "get_unixfrom"              "is_multipart"             
## [71] "items"                     "keys"                     
## [73] "policy"                    "preamble"                 
## [75] "raw_items"                 "remove_flag"              
## [77] "replace_header"            "set_boundary"             
## [79] "set_charset"               "set_default_type"         
## [81] "set_flags"                 "set_from"                 
## [83] "set_param"                 "set_payload"              
## [85] "set_raw"                   "set_type"                 
## [87] "set_unixfrom"              "values"                   
## [89] "walk"

names(msg1)
##  [1] "add_flag"                "add_header"             
##  [3] "as_bytes"                "as_string"              
##  [5] "attach"                  "defects"                
##  [7] "del_param"               "epilogue"               
##  [9] "get"                     "get_all"                
## [11] "get_boundary"            "get_charset"            
## [13] "get_charsets"            "get_content_charset"    
## [15] "get_content_disposition" "get_content_maintype"   
## [17] "get_content_subtype"     "get_content_type"       
## [19] "get_default_type"        "get_filename"           
## [21] "get_flags"               "get_from"               
## [23] "get_param"               "get_params"             
## [25] "get_payload"             "get_unixfrom"           
## [27] "is_multipart"            "items"                  
## [29] "keys"                    "policy"                 
## [31] "preamble"                "raw_items"              
## [33] "remove_flag"             "replace_header"         
## [35] "set_boundary"            "set_charset"            
## [37] "set_default_type"        "set_flags"              
## [39] "set_from"                "set_param"              
## [41] "set_payload"             "set_raw"                
## [43] "set_type"                "set_unixfrom"           
## [45] "values"                  "walk"

# See the difference between pnames() and names()

setdiff(pnames(msg1), names(msg1))
##  [1] "__bytes__"                 "__class__"                
##  [3] "__contains__"              "__delattr__"              
##  [5] "__delitem__"               "__dict__"                 
##  [7] "__dir__"                   "__doc__"                  
##  [9] "__eq__"                    "__format__"               
## [11] "__ge__"                    "__getattribute__"         
## [13] "__getitem__"               "__gt__"                   
## [15] "__hash__"                  "__init__"                 
## [17] "__init_subclass__"         "__iter__"                 
## [19] "__le__"                    "__len__"                  
## [21] "__lt__"                    "__module__"               
## [23] "__ne__"                    "__new__"                  
## [25] "__reduce__"                "__reduce_ex__"            
## [27] "__repr__"                  "__setattr__"              
## [29] "__setitem__"               "__sizeof__"               
## [31] "__str__"                   "__subclasshook__"         
## [33] "__weakref__"               "_become_message"          
## [35] "_charset"                  "_default_type"            
## [37] "_explain_to"               "_from"                    
## [39] "_get_params_preserve"      "_headers"                 
## [41] "_payload"                  "_type_specific_attributes"
## [43] "_unixfrom"

Using just names() excludes the "hidden" builtins for Python objects, but knowing they are there and what they are can be helpful, depending on the program context.

Let's continue on the path to our messaging goal and see what headers are available. We'll use some domain knowledge about the _headers component, though we won't end up going that route to build a data frame:

map_chr(msg1$`_headers`, ~.x[[1]])
##  [1] "Delivered-To"               "Received"                  
##  [3] "X-Google-Smtp-Source"       "X-Received"                
##  [5] "ARC-Seal"                   "ARC-Message-Signature"     
##  [7] "ARC-Authentication-Results" "Return-Path"               
##  [9] "Received"                   "Received-SPF"              
## [11] "Authentication-Results"     "Received"                  
## [13] "X-Virus-Scanned"            "Received"                  
## [15] "Received"                   "Received"                  
## [17] "X-Virus-Scanned"            "X-Spam-Flag"               
## [19] "X-Spam-Score"               "X-Spam-Level"              
## [21] "X-Spam-Status"              "Received"                  
## [23] "Received"                   "Received"                  
## [25] "Received"                   "DKIM-Signature"            
## [27] "X-Google-DKIM-Signature"    "X-Gm-Message-State"        
## [29] "X-Received"                 "MIME-Version"              
## [31] "References"                 "In-Reply-To"               
## [33] "From"                       "Date"                      
## [35] "Message-ID"                 "To"                        
## [37] "X-Tag-Only"                 "X-Filter-Node"             
## [39] "X-Spam-Level"               "X-Spam-Status"             
## [41] "X-Spam-Flag"                "Content-Disposition"       
## [43] "Subject"                    "X-BeenThere"               
## [45] "X-Mailman-Version"          "Precedence"                
## [47] "List-Id"                    "List-Unsubscribe"          
## [49] "List-Archive"               "List-Post"                 
## [51] "List-Help"                  "List-Subscribe"            
## [53] "Content-Type"               "Content-Transfer-Encoding" 
## [55] "Errors-To"                  "Sender"

The mbox object does provide a get() method to retrieve header values so we'll go that route to build our data frame but we'll make yet-another helper since doing something like msg1$get("this header does not exist") will return NULL just like list(a=1)$b would. We'll actually make two new helpers since we want to be able to safely work with the payload content and that means ensuring it's in UTF-8 encoding (mail systems are horribly diverse beasts and the R community is international and, remember, we're using R mailing list messages):

# execute an object's get() method and return a character string or NA if no value was present for the key
get_chr <- function(.x, .y) { as.character(.x[["get"]](.y)) %|0|% NA_character_ }

# get the object's value as a valid UTF-8 string
utf8_decode <- function(.x) { .x[["decode"]]("utf-8", "ignore") %|0|% NA_character_ }

We're also doing this because I get really tired of using the $ syntax.

We also want the message content or payload. Modern mail messages can be really complex structures with many multiple part entities. To put it a different way, there may be HTML, RTF and plaintext versions of a message all in the same envelope. We want the plaintext ones so we'll have to iterate through any multipart messages to (hopefully) get to a plaintext version. Since this post is already pretty long and we ignored errors in the tm portion, I'll refrain from including any error handling code here as well.

map_df(1:py_len(mbox), ~{

  m <- mbox$get(.x-1) # python uses 0-index lists

  list(
    date = as.POSIXct(get_chr(m, "date"), format = "%a, %e %b %Y %H:%M:%S %z"),
    from = get_chr(m, "from"),
    to = get_chr(m, "to"),
    subj = get_chr(m, "subject"),
    spam_score = get_chr(m, "X-Spam-Score")
  ) -> mdf

  content_type <-  m$get_content_maintype() %|0|% NA_character_

  if (content_type[1] == "text") { # we don't want images
    while (m$is_multipart()) m <- m$get_payload()[[1]] # cycle through until we get to something we can use
    mtmp <- m$get_payload(decode = TRUE) # get the message text
    mdf$body <- utf8_decode(mtmp) # make it safe to use
  }

  mdf

}) -> mbox_df

glimpse(mbox_df)
## Observations: 198
## Variables: 7
## $ date          2018-08-01 11:01:17, 2018-08-01 09:09:18, 20...
## $ from          "jim holtman ", "PIKAL Pe...
## $ to            "diego.avesani@gmail.com, R mailing list  "Re: [R] read txt file - date - no space", "R...
## $ spam_score    "-3.631", "-3.533", "-3.631", "-3.631", "-3.5...
## $ content_type  "text", "text", "text", "text", "text", "text...
## $ body          "Try this:\n\n library(lubridate)\n library...

FIN

By now, you've likely figured out this post really had nothing to do with reading mbox files. I mean, it did — and this was a task I had to do this week — but the real goal was to use a fairly basic task to help R folks edge a bit closer to becoming more friendly with Python in R. There hundreds of thousands of Python packages out there and, while I'm one to wax poetic about having R or C[++]-backed R-native packages — and am wont to point out Python's egregiously prolific flaws — sometimes you just need to get something done quickly and wish to avoid reinventing the wheel. The reticulate package makes that eminently possible.

I'll be wrapping up some of the reticulate helper functions into a small package soon, so keep your eyes on RSS.


: You might want to read this even if you're not interested in mbox files. FIN (right above this note) might have some clues as to why.
1: yes, the section title was a stretch
2: am I doing this right, Mara? ;-)
3: Make Latin Great Again

After reading this interesting analysis of “How Often Are Americans’ Accounts Breached?” by Gaurav Sood (which we need more of in cyber-land) I gave in to the impulse to do some gg-doodling with the “Have I Been Pwnd” JSON data he used.

It’s just some basic data manipulation with some heavy ggplot2 styling customization, so no real need for exposition beyond noting that there are many other ways to view the data. I just settled on centered segments early on and went from there. If you do a bit of gg-doodling yourself, drop a note in the comments with a link.

You can see a full-size version of the image via this link.

library(hrbrthemes) # use github or gitlab version
library(tidyverse)

# get the data

dat_url <- "https://raw.githubusercontent.com/themains/pwned/master/data/breaches.json"

jsonlite::fromJSON(dat_url) %>% 
  mutate(BreachDate = as.Date(BreachDate)) %>% 
  tbl_df() -> breaches

# selected breach labels df
group_by(breaches, year = lubridate::year(BreachDate)) %>% 
  top_n(1, wt=PwnCount) %>% 
  ungroup() %>% 
  filter(year %in% c(2008, 2015, 2016, 2017)) %>% # pick years where labels will fit nicely
  mutate(
    lab = sprintf("%s\n%sM accounts", Name, as.integer(PwnCount/1000000))
  ) %>% 
  arrange(year) -> labs

# num of known breaches in that year for labels
count(breaches, year = lubridate::year(BreachDate)) %>% 
  mutate(nlab = sprintf("n=%s", n)) %>% 
  mutate(lab_x = as.Date(sprintf("%s-07-02", year))) -> year_cts

mutate(breaches, p_half = PwnCount/2) %>% # for centered segments
  ggplot() +
  geom_segment( # centered segments
    aes(BreachDate, p_half, xend=BreachDate, yend=-p_half), 
    color = ft_cols$yellow, size = 0.3
  ) +
  geom_text( # selected breach labels
    data = labs, aes(BreachDate, PwnCount/2, label = lab),
    lineheight = 0.875, size = 3.25, family = font_rc,
    hjust = c(0, 1, 1, 0), vjust = 1, nudge_x = c(25, -25, -25, 25),
    nudge_y = 0,  color = ft_cols$slate
  ) +
  geom_text( # top year labels
    data = year_cts, aes(lab_x, Inf, label = year), family = font_rc, 
    size = 4, vjust = 1, lineheight = 0.875, color = ft_cols$gray
  ) +
  geom_text( # bottom known breach count totals
    data = year_cts, aes(lab_x, -Inf, label = nlab, size = n), 
    vjust = 0, lineheight = 0.875, color = ft_cols$peach,
    family = font_rc, show.legend = FALSE
  ) +
  scale_x_date( # break on year
    name = NULL, date_breaks = "1 year", date_labels = "%Y"
  ) +
  scale_y_comma(name = NULL, limits = c(-450000000, 450000000)) + # make room for labels
  scale_size_continuous(range = c(3, 4.5)) + # tolerable font sizes 
  labs(
    title = "HIBP (Known) Breach Frequency & Size",
    subtitle = "Segment length is number of accounts; n=# known account breaches that year",
    caption = "Source: HIBP via "
  ) +
  theme_ft_rc(grid="X") +
  theme(axis.text.y = element_blank()) +
  theme(axis.text.x = element_blank())

Continuing the blog’s UDF theme of late, there are two new UDF kids in town:

Now, if you’re an Apache Drill fanatic, you’re likely thinking “Hey hrbrmstr: don’t you know that Drill has a parse_url()? function already?” My answer is “Sure, but it’s based on java.net.URL which is fundamentally broken.”

Slicing & dicing URLs and IDNs is a large part of the $DAYJOB and they go together pretty well, hence the joint UDF release.

Rather than just use boring SQL for an example, we’ll start with some SQL and use R for a decent example of working with the two, new UDFs.

Counting Lying Lock Icons

SSL/TLS is all the craze these days, so let’s see how many distinct sites in the GDELT Global Front Page (GFG) data set use port 443 vs port 80 (a good indicator, plus it will help show how the URL tools pick up ports even when they’re not there).

If you go to the aforementioned URL it instructs us that the most current GFG dataset URL can be retrieved by inspecting the contents of this metadata URL

There are over a million records in that data set but — as we’ll see — not nearly as many distinct hosts.

Let’s get the data:

library(sergeant)
library(tidyverse)

read_delim(
  file = "http://data.gdeltproject.org/gdeltv3/gfg/alpha/lastupdate.txt", 
  delim = " ", 
  col_names = FALSE,
  col_types = "ccc"
) -> gfg_update

dl_path <- file.path("~/Data/gfg_links.tsv.gz")

if (!file.exists(dl_path)) download.file(gfg_update$X3[1], dl_path)

Those operations have placed the GFG data set in a place where my local Drill instance can get to them. It's a tab separated file (TSV) which — while not a great data format — is workable with Drill.

Now we'll setup a SQL query that will parse the URLs and domains, giving us a nice rectangular structure for R & dbplyr. We'll use the second column since a significant percentage of the URLs in column 6 are malformed:

db <- src_drill()

tbl(db, "(
SELECT 
  b.host,
  port,
  b.rec.hostname AS hostname,
  b.rec.assigned AS assigned,
  b.rec.tld AS tld,
  b.rec.subdomain AS subdomain
FROM
  (SELECT
    host, port, suffix_extract(host) AS rec             -- break the hostname into components
  FROM
    (SELECT
      a.rec.host AS host, a.rec.port AS port
    FROM
      (SELECT 
        columns[1] AS url, url_parse(columns[1]) AS rec -- break the URL into components
      FROM dfs.d.`/gfg_links.tsv.gz`) a
    WHERE a.rec.port IS NOT NULL                        -- filter out URL parsing failures
    )
  ) b
WHERE b.rec.tld IS NOT NULL                             -- filter out domain parsing failures
)") -> gfg_df

gfg_df
## # Database: DrillConnection
##    hostname  port host              subdomain assigned      tld  
##                                    
##  1 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  2 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  3 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  4 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  5 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  6 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  7 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  8 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
##  9 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
## 10 www         80 www.eestikirik.ee NA        eestikirik.ee ee   
## # ... with more rows

While we could have done it all in SQL, we saved some bits for R:

distinct(gfg_df, assigned, port) %>% 
  count(port) %>% 
  collect() -> port_counts

port_counts
# A tibble: 2 x 2
   port     n
*  
1    80 20648
2   443 22178

You'd think more news-oriented sites would be HTTPS by default given the current global political climate (though those lock icons are no safety panacea by any stretch of the imagination).

FIN

Now, R can do URL & IDN slicing, but Drill can operate at-scale. That is, R's urltools package may be fine for single-node, in-memory ops, but Drill can process billions of URLs when part of a cluster.

I'm not 100% settled on the galimatias library for URL parsing (I need to do some extended testing) and I may add some less-strict IDN slicing & dicing functions as well.

Kick the tyres & file issues & PRs as necessary.