Skip navigation

The R language and RStudio IDE are a powerful combination for “getting stuff done”, and one aspect of R itself that makes it especially useful is the ability to use it with other programming languages via a robust foreign language interface capability1. The term “foreign language” refers to another programming language such as C, C++, Fortran, Java, Python, Rust, etc. A common way of referring the this idiom of using functionality written in another programming language from directly within R is “wrapping” since we’re putting an R “shell” around the code from the other language. Another term you may see used is “extending” (hence the title of the “Writing R Extensions” R manual).

While R supports using this this extension mechanism from any R script leaving tiny trails of R and other language source and binary files all across your filesystem is not exactly the best way to keep these components organized and creates other challenges when you come across the need to use them in other projects or share them with others. Thankfully, the R Core team, along with many individual contributors over the years, has made it pretty straightforward to incorporate this extension capability into R packages which are much easier (honest!) to organize and share.

The goal of this blog series — which will have a {bookdown} book companion along with some screencasts — is to help you get up to speed using R and RStudio to write R packages that wrap code from many different languages to help you “get stuff done” with as little friction as possible.

Base Requirements

It is assumed that readers are familiar with the R programming language, RStudio IDE, and are comfortable installing and using packages. Since this work is about extending R with other programming languages, you should also have some knowledge of one or more of the target languages being covered.

To follow along with the series you’ll need to ensure you have the necessary components installed along the way. Rather than overwhelm you with all of them up front, each new section will introduce requirements specific to the language or situation being covered. However, there are some fundamentals you’ll need to ensure are available.

  • An R2 environment, preferably R 3.6.x which is what was used for this series.
  • RStudio3, as we’ll be using many of the features provided in it to help reduce development friction
  • The {pkgbuild}4 package installed

Once you’ve gotten through those steps, you should fire up RStudio and run:

pkgbuild::check_build_tools(debug = TRUE)

which will help you make sure your particular system is ready to build packages.

After performing the build tool check and/or installation of the necessary core tools, you will then need to install the {devtools}5 package, which will help ensure that the remaining core packages required are installed.

We’re also going to use the git6 source code version control system. The git ecosystem is not “GitHub”, which is just a public (or, potentially somewhat private) place to house source code repositories, just like other hosted services such as Bitbucket, GitLab, or SourceHut. You can use the excellent “Happy Git with R”7 resource to help ensure you’re source code control environment is also ready to use.

Supplemental References

It may be helpful to create a browser bookmark folder for supplemental reference material that will be referred to from time-to-time across the sections (we’ll be adding to this list in each chapter, too):

Up Next

If you’ve been a user of “development versions” of R packages or have authored R packages you likely made quick work of this first installment. Those new to creating packages with R, those who tend to only use fully-baked CRAN versions of R packages, and/or those who have not worked with git before likely had to do quite a bit of work to get down to this point (if this describes you, you definitely deserve both a break and kudos for getting this far!).

In the next installment we’ll make sure the package building infrastructure is ready to roll by creating a basic R package that we’ll use as a building block for future work.


  1. “Writing R Extensions”; Chapter 5, “System and foreign language interfaces”; (https://cran.r-project.org/doc/manuals/r-release/R-exts.html#System-and-foreign-language-interfaces
  2. R Project Home (https://www.r-project.org/
  3. RStudio Home (http://rstudio.com/
  4. {pkgbuild} CRAN page (https://cran.rstudio.com/web/packages/pkgbuild/
  5. {devtools} Home (https://devtools.r-lib.org/
  6. git Home (https://git-scm.com/
  7. Happy Git with R (https://happygitwithr.com/

I woke up to Axios’ “1 Big Thing” ridgeline chart showing the crazy that was the 2019 news cycle:

and, I decided to reproduce it in {ggplot2}.

Getting The Data

First, I had to find the data. The Axios chart is interactive, so I assumed the visualization was built on-load. It was, but the data was embedded in a javascript file vs loaded as JSON via an XHR request:

which was easy enough to turn into JSON anyone can use.

NOTE: The # hrbrmstr/hrbrthemes is an indication you may need to use the version of {hrbrthemes} from my gitea/sourcehut/gitlab/bitbucket/github. That package has instructions for installing fonts needed. Sub out theme_ipsum_es() with theme_ipsum(), theme_ipsum_rc() or just use theme_bw() and tweak aesthetics manually.

library(ggalt)
library(hrbrthemes) # hrbrmstr/hrbrthemes
library(tidyverse)

jsonlite::fromJSON("https://rud.is/dl/2019-axios-news.json") %>% 
  as_tibble() -> xdf

xdf
## # A tibble: 31 x 3
##    name                    avg data      
##    <chr>                 <dbl> <list>    
##  1 Gov't shutdown        20.5  <int [51]>
##  2 Mexico-U.S. border    22.8  <int [51]>
##  3 Green New Deal        11.3  <int [51]>
##  4 Blackface              9.61 <int [51]>
##  5 N. Korea-Hanoi Summit 11.2  <int [51]>
##  6 Boeing 737 Max         4.79 <int [51]>
##  7 Brexit                28.5  <int [51]>
##  8 Israel                42.1  <int [51]>
##  9 SpaceX                24.1  <int [51]>
## 10 Game of Thrones       16.8  <int [51]>
## # … with 21 more rows

This is pretty tidy already, but we’ll need to expand the data column and give each week an index:

unnest(xdf, data) %>% 
  group_by(name) %>% 
  mutate(idx = 1:n()) %>% 
  ungroup() %>% 
  mutate(name = fct_inorder(name)) -> xdf # making a factor foe strip/panel ordering 

xdf
## # A tibble: 1,581 x 4
##    name             avg  data   idx
##    <fct>          <dbl> <int> <int>
##  1 Gov't shutdown  20.5    69     1
##  2 Gov't shutdown  20.5   100     2
##  3 Gov't shutdown  20.5    96     3
##  4 Gov't shutdown  20.5   100     4
##  5 Gov't shutdown  20.5    19     5
##  6 Gov't shutdown  20.5     9     6
##  7 Gov't shutdown  20.5    17     7
##  8 Gov't shutdown  20.5     3     8
##  9 Gov't shutdown  20.5     2     9
## 10 Gov't shutdown  20.5     1    10
## # … with 1,571 more rows

We’ll take this opportunity to find the first week of each month (via rle()) so we can have decent axis labels:

# get index placement for each month axis label
sprintf("2019-%02s-1", 1:51) %>% 
  as.Date(format = "%Y-%W-%w") %>% 
  format("%b") %>% 
  rle() -> mons

mons
## Run Length Encoding
##   lengths: int [1:12] 4 4 4 5 4 4 5 4 5 4 ...
##   values : chr [1:12] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" ...

month_idx <- cumsum(mons$lengths)-3

month_idx
##  [1]  1  5  9 14 18 22 27 31 36 40 44 48

We’ve got all we need to make a {ggplot2} version of the chart. Here’s the plan:

  • use geom_area() and map colour and fill to avg (like Axios did), using an medium alpha value so we can still see below the overlapped areas
  • also use an xspline() stat with geom_area() so we get smooth lines vs pointy ones
  • use geom_hline() vs an axis line so we can map a colour aesthetic to avg as well
  • make a custom x-axis scale so we can place the labels we just made
  • expand the y-axis upper limit to avoid cutting off any part of the geoms
  • use the inferno viridis palette, but not the extremes of it
  • make facets/panels on the name, positioning the labels on the left
  • finally, tweak strip positioning so we get overlapped charts
ggplot(xdf, aes(idx, data)) +
  geom_area(alpha = 1/2, stat = "xspline", aes(fill = avg, colour = avg)) +
  geom_hline(
    data = distinct(xdf, name, avg),
    aes(yintercept = 0, colour = avg), size = 0.5
  ) +
  scale_x_continuous(
    expand = c(0,0.125), limits = c(1, 51),
    breaks = month_idx, labels = month.abb
  ) +
  scale_y_continuous(expand = c(0,0), limits = c(0, 105)) +
  scale_colour_viridis_c(option = "inferno", direction = -1, begin = 0.1, end = 0.9) +
  scale_fill_viridis_c(option = "inferno", direction = -1, begin = 0.1, end = 0.9) +
  facet_wrap(~name, ncol = 1, strip.position = "left", dir = "h") +
  labs(
    x = NULL, y = NULL, fill = NULL, colour = NULL,
    title = "1 big thing: The insane news cycles of 2019",
    subtitle = "Height is search interest in a given topic, indexed to 100.\nColor is average search interest between Dec. 30, 2018–Dec. 20, 2019",
    caption = "Source: Axios <https://www.axios.com/newsletters/axios-am-1d9cd913-6142-43b8-9186-4197e6da7669.html?chunk=0#story0>\nData: Google News Lab. Orig. Chart: Danielle Alberti/Axios"
  ) +
  theme_ipsum_es(grid="X", axis = "") +
  theme(strip.text.y = element_text(angle = 180, hjust = 1, vjust = 0)) +
  theme(panel.spacing.y = unit(-0.5, "lines")) +
  theme(axis.text.y = element_blank()) +
  theme(legend.position = "none")

To produce this finished product:

FIN

The chart could be tweaked a bit more to get even closer to the Axios finished product.

Intrepid readers can also try to use {plotly} to make an interactive version.

Somehow, I get the feeling 2020 will have an even more frenetic news cycle.

The right jolly old elves over at Alteryx created a “Santalytics” challenge back in 2016 to see if their community members could help Santa deliver presents to kids all across the globe.

They posted data for four challenges along with solutions and I’ve made a git repo & RStudio project with the challenges and solves for two of the four (I was going to try to have all four done but December has been a cruel master when it comes to allowing for free time).

Most of tasks are pretty straightforward and range from basic joining and grouping to some spatial optimizations (but all very do-able with a little elbow grease). The featured image at the top of the blog is one solution to finding “distribution hubs” for all the presents.

You can find the starter Rmd and data files over at your favorite social coding site:

FIN

Give Santa a hand and blog your approach to solving each challenge!

All four of our offspring are home for Christmas this year (w00t!!!) so this is likely the last blog post of 2019. Many blessings to all as you celebrate this time of year and catch y’all in 2020!

I had need to present a wall-of-text to show off a giant list of SSL certificate alternate names and needed the entire list to fit on one slide (not really for reading in full, but to show just how many there were in a way that a simple count would not really convey).

Keynote, PowerPoint, and gslides all let you make tables or draw boxes but I really didn’t want to waste time fiddling as much as I’d need to with those tools just for this one slide.

Thankfully, I remembered that HTML5 <div> elements can be styled with a column-count attribute and we can use {htmltools} to make quick work of this task.

To show it off, first we’ll need some words, so let’s make some using stringi::stri_rand_lipsum():

library(stringi)
library(htmltools)
library(tidyverse)

set.seed(201912)

stri_rand_lipsum(5) %>%
  stri_paste(collapse = " ") %>%
  stri_split_boundaries() %>%
  flatten_chr() %>%
  stri_trim_both() -> words

head(words)
## [1] "Lorem"  "ipsum"  "dolor"  "sit"    "amet,"  "sapien"

length(words)
## [1] 514

Now, we’ll make a function — columnize() — that we can reuse in the future and have it take in a character vector, the column count we want and some CSS styling, then use some {htmltools} tag functions to make quick work of this task:

columnize <- function(words, ncol = 5,
                      style = "p { font-family:'Roboto Condensed';font-size:12pt;line-height:12.5pt;padding:0;margin:0}") {

  tagList(
    tags$style(style[1]),
    tags$div(
      words %>%
        map(tags$p) %>%
        tagList(),
      style = sprintf("column-count:%d", as.integer(ncol[1]))
    )
  )

}

In this function we turn the style param into a <style> section in the generated HTML, then turn words into <p> tags wrapped in <div>.

This function can be used in a R Markdown code block (set block parameters to results='markup') to have the columns appear automagically in the resultant HTML document output. You can also use it in standalone fashion by using html_print() on the results:

html_print(
  columnize(words, 10)
)

10 columns of example text made with R

The above is an image just for easier blog display purposes. You can test out a working example from a spun R script over at https://rud.is/rpubs/columnize.html that has some different column count examples. Grow and shrink the browser width to see how the columns shrink and grow with it.

FIN

Hopefully this helps others save time and effort like it did for me today. You can experiment with making the columnize() function more robust by having it work with all the other column-formatting properties:

  • column-count: Specifies the number of columns an element should be divided into
  • column-fill: Specifies how to fill columns
  • column-gap: Specifies the gap between the columns
  • column-rule: A shorthand property for setting all the column-rule-* properties
  • column-rule-color: Specifies the color of the rule between columns
  • column-rule-style: Specifies the style of the rule between columns
  • column-rule-width: Specifies the width of the rule between columns
  • column-span: Specifies how many columns an element should span across
  • column-width: Specifies a suggested, optimal width for the columns
  • columns: A shorthand property for setting column-width and column-count

You can find out more about these properties (and play with some examples) over at https://www.w3schools.com/css/css3_multiple_columns.asp.

POST-FIN

I robustified the function a bit:

#' Make a responsive columnar text div
#'
#' @param words character vector of text to present in a columnar div
#' @param div_id tag `id` attribute to assign to the `<div>` (which can help you style it with the `style` param).
#' @param div_class tag `class` attribute to assign to the `<div>` (which can help you style it with the `style` param)
#' @param ncol number of columns
#' @param width  specifies the column width; one of "`auto`" (the default) which leaves it up to the
#'        browser implementation, a _length_ CSS size value that specifies the width of the columns.
#'        The number of columns will be the minimum number of columns needed to show all the content
#'        across the element., "`initial`" or "`inherit`" (see `fill` for descriptions of those).
#' @param fill how to fill columns, balanced or not. One of "`balance`", "`auto`", "`initial`", "`inherit`".
#'        Defaults to "`balance`" which fills each column with about the same amount of content, but will not
#'        allow the columns to be taller than the height (so, columns might be shorter than the height as the
#'        browser distributes the content evenly horizontally). "`auto`" fills each column until it reaches
#'        the height, and do this until it runs out of content (so, this value will not necessarily fill all
#'        the columns nor fill them evenly). "`initial`" sets this property to its default value; and
#'        "`inherit`" inherits this property from its parent element.
#' @param gap either a textual value (e.g. "`10px`") for the spacing gap between columns, or "`normal`"
#'        (the default) which uses a `1em` gap on most browsers, "`initial`" or "`inherit`" (see `fill` for descriptions
#'        of those).
#' @param rule_color specifies the CSS color value of the rule between columns; also can be "`initial`" or "`inherit`"
#'        (see `fill` for descriptions of those).
#' @param rule_style specifies the style of the rule between columns; valid values are "`none`" (the default) for no
#'        rule, "`hidden`", "`dotted`", "`dashed`", "`solid`", "`double`", "`groove`" for 3D grooved rule, "`ridge`"
#'        for a 3D ridged rule, "`inset`" for a 3D inset rule, "`outset`" for a 3D outset rule, "`initial`" or
#'        "`inherit`" (see `fill` for descriptions of those).
#' @param rule_width specifies the width of the rule between columns; one of "`medium`" (the default), "`thin`", "`thick`",
#'        a _length_ CSS size value, `initial`" or "`inherit`" (see `fill` for descriptions of those).
#' @param span specifies how many columns an element should span across; one of "`none`" (the default) so the element spans
#'        across one column, "`all`" (spans across all columns), "`initial`" or "`inherit`"
#'        (see `fill` for descriptions of those).
#' @param style CSS style properties (complete text spec) that will be put into an `{htmltools}` `tags$style()` call that
#'        will come along for the ride with the `<div>`; useful for specifying `<p>` properties for each item of the
#'        `words` vector
#' @note No validation is done on inputs
#' @export
#' @examples
#' columnize(state.name, ncol = 3, rule_color = "black", rule_width = "0.5px")
columnize <- function(words,
                      div_id = NULL,
                      div_class = NULL,
                      ncol = 5,
                      width = "auto",
                      fill = "balance",
                      gap = "normal",
                      rule_color = "initial",
                      rule_style = "none",
                      rule_width = "medium",
                      span = "none",
                      style = "p {font-family:'Roboto Condensed';font-size:12pt;line-height:12.5pt;padding:0;margin:0}") {

  tagList(
    tagList(
     do.call(tags$style, as.list(style)),
    ),
    tags$div(
      id = div_id,
      class = div_class,
      words %>%
        map(tags$p) %>%
        tagList(),
      style = sprintf(
        paste0(c(
          "column-count:%s",
          "column-fill: %s",
          "column-gap: %s",
          "column-rule-color: %s",
          "column-rule-style: %s",
          "column-rule-width: %s",
          "column-span: %s",
          "column-width: %s"
        ), collapse = ";"),
        ncol, fill, gap, rule_color, rule_style, rule_width, span, width
      )
    )
  )

}

So now you can do something like:

columnize(
  div_id = "states",
  words = state.name, 
  ncol = 3, 
  rule_color = "black", 
  rule_style = "solid", 
  rule_width = "2px",
  style = c(
    "#states { width: 50%; text-align: center };\np {font-family:'Roboto Condensed'}",
    "p { font-family: 'sans-serif'}"
  )
) %>% 
  htmltools::html_print()

and get:

Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming

Apple has brought Screen Time to macOS for some time now and that means it has to store this data somewhere. Thankfully, Sarah Edwards has foraged through the macOS filesystem for us and explained where these bits of knowledge are in her post, Knowledge is Power! Using the macOS/iOS knowledgeC.db Database to Determine Precise User and Application Usage, which ultimately reveals the data lurks in ~/Library/Application Support/Knowledge/knowledgeC.db. Sarah also has a neat little Python utility dubbed APOLLO (Apple Pattern of Life Lazy Output’er) which has a smattering of knowledgeC.db canned SQL queries that cover a myriad of tracked items.

Today, we’ll show how to work with this database in R and the {tidyverse} to paint our own pictures of application usage.

There are quite a number of tables in the knowledgeC.db SQLite 3 database:

That visual schema was created in OmniGraffle via a small R script that uses the OmniGraffle automation framework. The OmniGraffle source files are also available upon request.

Most of the interesting bits (for any tracking-related spelunking) are in the ZOBJECT table and to get a full picture of usage we’ll need to join it with some other tables that are connected via a few foreign keys:

There are a few ways to do this in {tidyverse} R. The first is an extended straight SQL riff off of one of Sarah’s original queries:

library(hrbrthemes) # for ggplot2 machinations
library(tidyverse)

# source the knowledge db
kdb <- src_sqlite("~/Library/Application Support/Knowledge/knowledgeC.db")

tbl(
  kdb, 
  sql('
SELECT
  ZOBJECT.ZVALUESTRING AS "app", 
    (ZOBJECT.ZENDDATE - ZOBJECT.ZSTARTDATE) AS "usage",  
    CASE ZOBJECT.ZSTARTDAYOFWEEK 
      WHEN "1" THEN "Sunday"
      WHEN "2" THEN "Monday"
      WHEN "3" THEN "Tuesday"
      WHEN "4" THEN "Wednesday"
      WHEN "5" THEN "Thursday"
      WHEN "6" THEN "Friday"
      WHEN "7" THEN "Saturday"
    END "dow",
    ZOBJECT.ZSECONDSFROMGMT/3600 AS "tz",
    DATETIME(ZOBJECT.ZSTARTDATE + 978307200, \'UNIXEPOCH\') as "start_time", 
    DATETIME(ZOBJECT.ZENDDATE + 978307200, \'UNIXEPOCH\') as "end_time",
    DATETIME(ZOBJECT.ZCREATIONDATE + 978307200, \'UNIXEPOCH\') as "created_at", 
    CASE ZMODEL
      WHEN ZMODEL THEN ZMODEL
      ELSE "Other"
    END "source"
  FROM
    ZOBJECT 
    LEFT JOIN
      ZSTRUCTUREDMETADATA 
    ON ZOBJECT.ZSTRUCTUREDMETADATA = ZSTRUCTUREDMETADATA.Z_PK 
    LEFT JOIN
      ZSOURCE 
    ON ZOBJECT.ZSOURCE = ZSOURCE.Z_PK 
    LEFT JOIN
      ZSYNCPEER
    ON ZSOURCE.ZDEVICEID = ZSYNCPEER.ZDEVICEID
  WHERE
    ZSTREAMNAME = "/app/usage"'
  )) -> usage

usage
## # Source:   SQL [?? x 8]
## # Database: sqlite 3.29.0 [/Users/johndoe/Library/Application Support/Knowledge/knowledgeC.db]
##    app                      usage dow         tz start_time          end_time            created_at         source       
##    <chr>                    <int> <chr>    <int> <chr>               <chr>               <chr>              <chr>        
##  1 com.bitrock.appinstaller    15 Friday      -4 2019-10-05 01:11:27 2019-10-05 01:11:42 2019-10-05 01:11:… MacBookPro13…
##  2 com.tinyspeck.slackmacg…  4379 Tuesday     -4 2019-10-01 13:19:24 2019-10-01 14:32:23 2019-10-01 14:32:… Other        
##  3 com.tinyspeck.slackmacg…  1167 Tuesday     -4 2019-10-01 18:19:24 2019-10-01 18:38:51 2019-10-01 18:38:… Other        
##  4 com.tinyspeck.slackmacg…  1316 Tuesday     -4 2019-10-01 19:13:49 2019-10-01 19:35:45 2019-10-01 19:35:… Other        
##  5 com.tinyspeck.slackmacg… 12053 Thursday    -4 2019-10-03 12:25:18 2019-10-03 15:46:11 2019-10-03 15:46:… Other        
##  6 com.tinyspeck.slackmacg…  1258 Thursday    -4 2019-10-03 15:50:16 2019-10-03 16:11:14 2019-10-03 16:11:… Other        
##  7 com.tinyspeck.slackmacg…  2545 Thursday    -4 2019-10-03 16:24:30 2019-10-03 17:06:55 2019-10-03 17:06:… Other        
##  8 com.tinyspeck.slackmacg…   303 Thursday    -4 2019-10-03 17:17:10 2019-10-03 17:22:13 2019-10-03 17:22:… Other        
##  9 com.tinyspeck.slackmacg…  9969 Thursday    -4 2019-10-03 17:33:38 2019-10-03 20:19:47 2019-10-03 20:19:… Other        
## 10 com.tinyspeck.slackmacg…  2813 Thursday    -4 2019-10-03 20:19:52 2019-10-03 21:06:45 2019-10-03 21:06:… Other        
## # … with more rows

Before explaining what that query does, let’s rewrite it {dbplyr}-style:

tbl(kdb, "ZOBJECT") %>% 
  mutate(
    created_at = datetime(ZCREATIONDATE + 978307200, "UNIXEPOCH", "LOCALTIME"),
    start_dow = case_when(
      ZSTARTDAYOFWEEK == 1 ~ "Sunday",
      ZSTARTDAYOFWEEK == 2 ~ "Monday",
      ZSTARTDAYOFWEEK == 3 ~ "Tuesday",
      ZSTARTDAYOFWEEK == 4 ~ "Wednesday",
      ZSTARTDAYOFWEEK == 5 ~ "Thursday",
      ZSTARTDAYOFWEEK == 6 ~ "Friday",
      ZSTARTDAYOFWEEK == 7 ~ "Saturday"
    ),
    start_time = datetime(ZSTARTDATE + 978307200, "UNIXEPOCH", "LOCALTIME"),
    end_time = datetime(ZENDDATE + 978307200, "UNIXEPOCH", "LOCALTIME"),
    usage = (ZENDDATE - ZSTARTDATE),
    tz = ZSECONDSFROMGMT/3600 
  ) %>% 
  left_join(tbl(kdb, "ZSTRUCTUREDMETADATA"), c("ZSTRUCTUREDMETADATA" = "Z_PK")) %>% 
  left_join(tbl(kdb, "ZSOURCE"), c("ZSOURCE" = "Z_PK")) %>% 
  left_join(tbl(kdb, "ZSYNCPEER"), "ZDEVICEID") %>% 
  filter(ZSTREAMNAME == "/app/usage")  %>% 
  select(
    app = ZVALUESTRING, created_at, start_dow, start_time, end_time, usage, tz, source = ZMODEL
  ) %>% 
  mutate(source = ifelse(is.na(source), "Other", source)) %>% 
  collect() %>% 
  mutate_at(vars(created_at, start_time, end_time), as.POSIXct) -> usage

What we’re doing is pulling out the day of week, start/end usage times & timezone info, app bundle id, source of the app interactions and the total usage time for each entry along with when that entry was created. We need to do some maths since Apple stores time-y whime-y info in its own custom format, plus we need to convert numeric DOW to labeled DOW.

The bundle ids are pretty readable, but they’re not really intended for human consumption, so we’ll make a translation table for the bundle id to app name by using the mdls command.

list.files(
  c("/Applications", "/System/Library/CoreServices", "/Applications/Utilities", "/System/Applications"), # main places apps are stored (there are potentially more but this is sufficient for our needs)
  pattern = "\\.app$", 
  full.names = TRUE
) -> apps

x <- sys::exec_internal("mdls", c("-name", "kMDItemCFBundleIdentifier", "-r", apps))

# mdls null (\0) terminates each entry so we have to do some raw surgery to get it into a format we can use
x$stdout[x$stdout == as.raw(0)] <- as.raw(0x0a)

tibble(
  name = gsub("\\.app$", "", basename(apps)),
  app = read_lines(x$stdout) 
) -> app_trans

app_trans
## # A tibble: 270 x 2
##    name                    app                                    
##    <chr>                   <chr>                                  
##  1 1Password 7             com.agilebits.onepassword7             
##  2 Adium                   com.adiumX.adiumX                      
##  3 Agenda                  com.momenta.agenda.macos               
##  4 Alfred 4                com.runningwithcrayons.Alfred          
##  5 Amazon Music            com.amazon.music                       
##  6 Android File Transfer   com.google.android.mtpviewer           
##  7 Awsaml                  com.rapid7.awsaml                      
##  8 Bartender 2             com.surteesstudios.Bartender           
##  9 BBEdit                  com.barebones.bbedit                   
## 10 BitdefenderVirusScanner com.bitdefender.BitdefenderVirusScanner
## # … with 260 more rows

The usage info goes back ~30 days, so let’s do a quick summary of the top 10 apps and their total usage (in hours):

usage %>% 
  group_by(app) %>% 
  summarise(first = min(start_time), last = max(end_time), total = sum(usage, na.rm=TRUE)) %>% 
  ungroup() %>% 
  mutate(total = total / 60 / 60) %>% # hours
  arrange(desc(total)) %>% 
  left_join(app_trans) -> overall_usage

overall_usage %>% 
  slice(1:10) %>% 
  left_join(app_trans) %>%
  mutate(name = fct_inorder(name) %>% fct_rev()) %>%
  ggplot(aes(x=total, y=name)) + 
  geom_segment(aes(xend=0, yend=name), size=5, color = ft_cols$slate) +
  scale_x_comma(position = "top") +
  labs(
    x = "Total Usage (hrs)", y = NULL,
    title = glue::glue('App usage in the past {round(as.numeric(max(usage$end_time) - min(usage$start_time), "days"))} days')
  ) +
  theme_ft_rc(grid="X")

There’s a YUGE flaw in the current way macOS tracks application usage. Unlike iOS where apps really don’t run simultaneously (with iPadOS they kinda can/do, now), macOS apps are usually started and left open along with other apps. Apple doesn’t do a great job identifying only active app usage activity so many of these usage numbers are heavily inflated. Hopefully that will be fixed by macOS 10.15.

We have more data at our disposal, so let’s see when these apps get used. To do that, we’ll use segments to plot individual usage tracks and color them by weekday/weekend usage (still limiting to top 10 for blog brevity):

usage %>% 
  filter(app %in% overall_usage$app[1:10]) %>% 
  left_join(app_trans) %>%
  mutate(name = factor(name, levels = rev(overall_usage$name[1:10]))) %>% 
  ggplot() +
  geom_segment(
    aes(
      x = start_time, xend = end_time, y = name, yend = name, 
      color = ifelse(start_dow %in% c("Saturday", "Sunday"), "Weekend", "Weekday")
    ),
    size = 10,
  ) +
  scale_x_datetime(position = "top") +
  scale_colour_manual(
    name = NULL,
    values = c(
      "Weekend" = ft_cols$light_blue, 
      "Weekday" = ft_cols$green
    )
  ) +
  guides(
    colour = guide_legend(override.aes = list(size = 1))
  ) +
  labs(
    x = NULL, y = NULL,
    title = glue::glue('Top 10 App usage on this Mac in the past {round(as.numeric(max(usage$end_time) - min(usage$start_time), "days"))} days'),
    subtitle = "Each segment represents that app being 'up' (Open to Quit).\nUnfortunately, this is what Screen Time uses for its calculations on macOS"
  ) +
  theme_ft_rc(grid="X") +
  theme(legend.position = c(1, 1.25)) +
  theme(legend.justification = "right")

I’m not entirely sure “on this Mac” is completely accurate since I think this syncs across all active Screen Time devices due to this (n is in seconds):

count(usage, source, wt=usage, sort=TRUE)
## # A tibble: 2 x 2
##   source               n
##   <chr>            <int>
## 1 Other          4851610
## 2 MacBookPro13,2 1634137

The “Other” appears to be the work-dev Mac but it doesn’t have the identifier mapped so I think that means it’s the local one and that the above chart is looking at Screen Time across all devices. I literally (right before this sentence) enabled Screen Time on my iPhone so we’ll see if that ends up in the database and I’ll post a quick update if it does.

We’ll take one last look by day of week and use a heatmap to see the results:

count(usage, start_dow, app, wt=usage/60/60) %>% 
  left_join(app_trans) %>%
  filter(app %in% overall_usage$app[1:10]) %>% 
  mutate(name = factor(name, levels = rev(overall_usage$name[1:10]))) %>% 
  mutate(start_dow = factor(start_dow, c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))) %>% 
  ggplot() +
  geom_tile(aes(start_dow, name, fill = n), color = "#252a32", size = 0.75) +
  scale_x_discrete(expand = c(0, 0.5), position = "top") +
  scale_y_discrete(expand = c(0, 0.5)) +
  scale_fill_viridis_c(direction = -1, option = "magma", name = "Usage (hrs)") +
  labs(
    x = NULL, y = NULL,
    title = "Top 10 App usage by day of week"
  ) +
  theme_ft_rc(grid="")

I really need to get into the habit of using the RStudio Server access features of RSwitch over Chrome so I can get RSwitch into the top 10, but some habits (and bookmarks) die hard.

FIN

Apple’s Screen Time also tracks “category”, which is something we can pick up from each application’s embedded metadata. We’ll do that in a follow-up post along with seeing whether we can capture iOS usage now that I’ve enabled Screen Time on those devices as well.

Keep spelunking the knowledgeC.db table(s) and blog about or reply in the comments with any interesting nuggets you find.

I posted a visualization of email safety status (a.k.a. DMARC) of the Fortune 500 (2017 list) the other day on Twitter and received this spiffy request from @MarkAltosaar:

There are many ways to achieve this result. I’ll show one here and walk through the process starting with the data (this is the 2018 DMARC evaluation run):

library(hrbrthemes) # CRAN or fav social coding site using hrbrmstr/pkgname
library(ggchicklet) # fav social coding site using hrbrmstr/pkgname
library(tidyverse)

f500_dmarc <- read_csv("https://rud.is/dl/f500-industry-dmarc.csv.gz", col_types = "cc")

f500_dmarc
## # A tibble: 500 x 2
##    industry               p         
##    <chr>                  <chr>     
##  1 Retailing              Reject    
##  2 Technology             None      
##  3 Health Care            Reject    
##  4 Wholesalers            None      
##  5 Retailing              Quarantine
##  6 Motor Vehicles & Parts None      
##  7 Energy                 None      
##  8 Wholesalers            None      
##  9 Retailing              None      
## 10 Telecommunications     Quarantine
## # … with 490 more rows

The p column is the DMARC classification for each organization (org names have been withheld to protect the irresponsible) and comes from the p=… value in the DMARC DNS TXT record field. It has a limited set of values, so let’s enumerate them and assign some colors:

dmarc_levels <- c("No DMARC", "None", "Quarantine", "Reject")
dmarc_cols <- set_names(c(ft_cols$slate, "#a6dba0", "#5aae61", "#1b7837"), dmarc_levels)

We want the aggregate value of each p, thus we need to do count counting:

(dmarc_summary <- count(f500_dmarc, industry, p))
## # A tibble: 63 x 3
##    industry            p              n
##    <chr>               <chr>      <int>
##  1 Aerospace & Defense No DMARC       9
##  2 Aerospace & Defense None           3
##  3 Aerospace & Defense Quarantine     1
##  4 Apparel             No DMARC       4
##  5 Apparel             None           1
##  6 Business Services   No DMARC       9
##  7 Business Services   None           7
##  8 Business Services   Reject         4
##  9 Chemicals           No DMARC      12
## 10 Chemicals           None           2
## # … with 53 more rows

We’re also going to want to sort the industries by those with the most DMARC (sorted bars/chicklets FTW!). We’ll need a factor for that, so let’s make one:

(dmarc_summary %>% 
  filter(p != "No DMARC") %>% # we don't care abt this `p` value
  count(industry, wt=n, sort=TRUE) -> industry_levels)
## # A tibble: 21 x 2
##    industry                      n
##    <chr>                     <int>
##  1 Financials                   54
##  2 Technology                   25
##  3 Health Care                  24
##  4 Retailing                    23
##  5 Wholesalers                  16
##  6 Energy                       12
##  7 Transportation               12
##  8 Business Services            11
##  9 Industrials                   8
## 10 Food, Beverages & Tobacco     6
## # … with 11 more rows

Now, we can make the chart:

dmarc_summary %>% 
  mutate(p = factor(p, levels = rev(dmarc_levels))) %>% 
  mutate(industry = factor(industry, rev(industry_levels$industry))) %>% 
  ggplot(aes(industry, n)) +
  geom_chicklet(aes(fill = p)) +
  scale_fill_manual(name = NULL, values = dmarc_cols) +
  scale_y_continuous(expand = c(0,0), position = "right") +
  coord_flip() +
  labs(
    x = NULL, y = NULL,
    title = "DMARC Status of Fortune 500 (2017 List; 2018 Measurement) Primary Email Domains"
  ) +
  theme_ipsum_rc(grid = "X") +
  theme(legend.position = "top")

Doh! We rly want them to be 100% width. Thankfully, {ggplot2} has a position_fill() we can use instead of position_dodge():

dmarc_summary %>% 
  mutate(p = factor(p, levels = rev(dmarc_levels))) %>% 
  mutate(industry = factor(industry, rev(industry_levels$industry))) %>% 
  ggplot(aes(industry, n)) +
  geom_chicklet(aes(fill = p), position = position_fill()) +
  scale_fill_manual(name = NULL, values = dmarc_cols) +
  scale_y_continuous(expand = c(0,0), position = "right") +
  coord_flip() +
  labs(
    x = NULL, y = NULL,
    title = "DMARC Status of Fortune 500 (2017 List; 2018 Measurement) Primary Email Domains"
  ) +
  theme_ipsum_rc(grid = "X") +
  theme(legend.position = "top")

Doh! Even though we forgot to use reverse = TRUE in the call to position_fill() everything is out of order. Kinda. It’s in the order we told it to be in, but that’s not right b/c we need it ordered by the in-industry percentages. If each industry had the same number of organizations, there would not have been an issue. Unfortunately, the folks who make up these lists care not about our time. Let’s re-compute the industry factor by computing the percents:

(dmarc_summary %>% 
  group_by(industry) %>% 
  mutate(pct = n/sum(n)) %>% 
  ungroup() %>% 
  filter(p != "No DMARC") %>% 
  count(industry, wt=pct, sort=TRUE) -> industry_levels)
## # A tibble: 21 x 2
##    industry               n
##    <chr>              <dbl>
##  1 Transportation     0.667
##  2 Technology         0.641
##  3 Wholesalers        0.615
##  4 Financials         0.614
##  5 Health Care        0.6  
##  6 Business Services  0.55 
##  7 Food & Drug Stores 0.5  
##  8 Retailing          0.5  
##  9 Industrials        0.444
## 10 Telecommunications 0.375
## # … with 11 more rows

Now, we can go back to using position_fill() as before:

dmarc_summary %>% 
  mutate(p = factor(p, levels = rev(dmarc_levels))) %>% 
  mutate(industry = factor(industry, rev(industry_levels$industry))) %>% 
  ggplot(aes(industry, n)) +
  geom_chicklet(aes(fill = p), position = position_fill(reverse = TRUE)) +
  scale_fill_manual(name = NULL, values = dmarc_cols) +
  scale_y_percent(expand = c(0, 0.001), position = "right") +
  coord_flip() +
  labs(
    x = NULL, y = NULL,
    title = "DMARC Status of Fortune 500 (2017 List; 2018 Measurement) Primary Email Domains"
  ) +
  theme_ipsum_rc(grid = "X") +
  theme(legend.position = "top")

FIN

As noted, this is one way to handle this situation. I’m not super happy with the final visualization here as it doesn’t have the counts next to the industry labels and I like to have the ordering by both count and more secure configuration (so, conditional on higher prevalence of Quarantine or Reject when there are ties). That is an exercise left to the reader 😎.

An RSwitch user, lcolladotor filed a most-welcome issue letting me know that the core functionality of the switcher was busted 😱. After testing out the 1.5.1 release candidate I had made a “harmless” & “clever” change to reduce some redundancy in the code that handled with switching which resulted in busted symbolic link creation. Tis fixed, now.

To somewhat make amends for said error James Balamuta’s excellent “R Compiler Tools for Rcpp on macOS” resource (https://thecoatlessprofessor.com/programming/cpp/r-compiler-tools-for-rcpp-on-macos/) as been added to the available web resources links.

I’ve also setup a mailing list for RSwitch over at sourcehut where you can signup directly w/a (free) sourcehut account (signup by just email) and see archives. The RSwitch menu has a new link to the mailing list.

There’s also a new blog category for RSwitch which has it’s own RSS feed (https://rud.is/b/category/rswitch/feed/) to make it easier to keep up with RSwitch-only updates.

“Check for updates” will get you to the new release or you can grab it directly from the RSwitch site.

This past week @propublica linked to a really spiffy resource for getting an overview of a Twitter user’s profile and activity called accountanalysis. It has a beautiful interface that works as well on mobile as it does in a real browser. It also is fully interactive and supports cross-filtering (zoom in on the timeline and the other graphs change). It’s especially great if you’re not a coder, but if you are, @kearneymw’s {rtweet} can get you all this info and more, putting the power of R behind data frames full of tweet inanity.

While we covered quite a bit of {rtweet} ground in the 21 Recipes book, summarizing an account to the degree that accountanalysis does is not in there. To rectify this oversight, I threw together a static clone of accountanalysis that can make standalone HTML reports like this one.twitter account analysis header

It’s a fully parameterized R markdown document, meaning you can run it as just a function call (or change the parameter and knit it by hand):

rmarkdown::render(
  input = "account-analysis.Rmd", 
  params = list(
    username = "propublica"
  ), 
  output_file = "~/Documents/propublica-analysis.html"
)

It will also, by default, save a date-stamped copy of the user info and retrieved timeline into the directory you generate the report from (add a prefix path to the save portion in the Rmd to store it in a better place).

With all the data available, you can dig in and extract all the information you want/need.

FIN

You can get the Rmd at your favorite social coding service: