Skip navigation

Category Archives: R

Researching “the internet” (i.e. $DAYJOB) means having to deal with a ton of “unique” (I’m being kind) data formats. This is ultimately a tale of how I performed full-text searches across one of them.

It all started off innocently enough. This past week I need to be able to do full-text searches across metadata about who is using which parts of the internet. Normally I don’t need to do that at scale and can just go to RIPE’s excellent resource and manage to find what I need on the first page. However, this time I needed all the resultant info and noticed an interesting foible on that full text search interface. To reproduce it. Enter something like “domino's” (for the record, I’m not researching Domino’s Pizza — nor would I ever consume it — but a Twitter ad happened to fly by for Domino’s and I just typed it for kicks) into the field and page around, keeping an eye on the results. I think they still use Solr for indexing/searching and aren’t passing in all they need to keep session context or something. Anyway, suffice it to say it was fairly useless (I filed a bug report, so I’m not just complaining, and I wish more sites had the same easy error reporting filing capability the RIPE folks do).

If it were just searching for precise data in one field, that’s not really an issue since we have ALL THE WHOIS IP THINGS in Parquet. But:

  • I really hate giving Amazon money (even if it’s $WORK money) for Athena queries
  • Full text search across all columns is not one of Parquet’s strengths
  • This is a third bullet b/c I feel compelled to have a minimum of three points in bullet lists likely thanks to an overbearing middle-school English teacher

Since I have a modest analytics server setup at home, I figured I’d take the opportunity to re-brush-up on either Elasticsearch or Couchbase since both are pretty great at free text searching JSON data. Except…this isn’t JSON data, It’s records formatted like this:

#
# The contents of this file are subject to 
# RIPE Database Terms and Conditions
#
# http://www.ripe.net/db/support/db-terms-conditions.pdf
#

as-block:       AS7 - AS7
descr:          RIPE NCC ASN block
remarks:        These AS Numbers are assigned to network operators in the RIPE NCC service region.
mnt-by:         RIPE-NCC-HM-MNT
created:        2018-11-22T15:27:05Z
last-modified:  2018-11-22T15:27:05Z
source:         RIPE
remarks:        ****************************
remarks:        * THIS OBJECT IS MODIFIED
remarks:        * Please note that all data that is generally regarded as personal
remarks:        * data has been removed from this object.
remarks:        * To view the original object, please query the RIPE Database at:
remarks:        * http://www.ripe.net/whois
remarks:        ****************************

as-block:       AS28 - AS28
descr:          RIPE NCC ASN block
remarks:        These AS Numbers are assigned to network operators in the RIPE NCC service region.
mnt-by:         RIPE-NCC-HM-MNT
created:        2018-11-22T15:27:05Z
last-modified:  2018-11-22T15:27:05Z
source:         RIPE
remarks:        ****************************
remarks:        * THIS OBJECT IS MODIFIED
remarks:        * Please note that all data that is generally regarded as personal
remarks:        * data has been removed from this object.
remarks:        * To view the original object, please query the RIPE Database at:
remarks:        * http://www.ripe.net/whois
remarks:        ****************************

They “keys” (the colon-ified line prefixes) vary and there are other record types (which I don’t need) that have other prefixes in them plus those #-prefixed comments are not necessarily only at the top. But, after judicious use of stringi::stri::stri_enc_toutf8(), stringi::stri_split_regex() and some vectorized record targeting they’re pretty easily converted to lovely ndjson data like this (random selection further in the conversion):

{"descr":"Reseau Teleinformatique de l'Education Nationale Educational and research network for Luxembourg","admin_c":"DUMY-RIPE","as_set":"AS-RESTENA","members":"AS2602, AS42909, AS51966, AS49624","mnt_by":"AS2602-MNT","notify":"noc@restena.lu","tech_c":"DUMY-RIPE"}
{"descr":"CWIX ASes announced to EBONE","admin_c":"DUMY-RIPE","as_set":"AS-TMPEBONECWIX","members":"AS3727, AS4445, AS4610, AS4624, AS4637, AS4654, AS4655, AS4656, AS4659 AS4681, AS4696, AS4714, AS4849, AS5089, AS5090, AS5532, AS5551, AS5559 AS5655, AS6081, AS6255, AS6292, AS6618, AS6639","mnt_by":"EBONE-MNT","notify":"staff@ebone.net","tech_c":"DUMY-RIPE"}
{"descr":"ASs accepted by DFN from the University of Cologne","admin_c":"DUMY-RIPE","as_set":"AS-DFNFROMCOLOGNE","members":"AS5520 AS6733","mnt_by":"DFN-MNT","tech_c":"DUMY-RIPE"}
{"descr":"NetMatters UK","admin_c":"DUMY-RIPE","as_set":"AS-NETMATTERS","members":"AS6765 AS3344","mnt_by":"AS8407-MNT","tech_c":"DUMY-RIPE"}

I went with Couchbase since it handles ndjson import by default and — as you know since you read the comparison in the aforelinked article — it can easily index all fields by default without you having to do virtually anything. Plus, Couchbase has been around long enough that it generally installs without pain and has a fairly decent web admin panel. Here’s a snapshot of the final import:

and here’s the config for the “all” full text index:

{
  "type": "fulltext-index",
  "name": "all",
  "uuid": "481bc7ed642dddfb",
  "sourceType": "couchbase",
  "sourceName": "ripe",
  "sourceUUID": "3ffbbe0c0923f233ffe0fc96c652262d",
  "planParams": {
    "maxPartitionsPerPIndex": 171
  },
  "params": {
    "doc_config": {
      "docid_prefix_delim": "",
      "docid_regexp": "",
      "mode": "type_field",
      "type_field": "type"
    },
    "mapping": {
      "analysis": {},
      "default_analyzer": "standard",
      "default_datetime_parser": "dateTimeOptional",
      "default_field": "_all",
      "default_mapping": {
        "dynamic": true,
        "enabled": true
      },
      "default_type": "_default",
      "docvalues_dynamic": true,
      "index_dynamic": true,
      "store_dynamic": false,
      "type_field": "_type"
    },
    "store": {
      "indexType": "scorch",
      "kvStoreName": ""
    }
  },
  "sourceParams": {}
}

You Said This Is A Post With R Code

Very true! We’ll get to that in a minute.

Going with Couchbase introduced a different problem: there’s almost no R support for Couchbase. Sure, Couchbase has a gnarly, two-year old, raw httr::-prefixed bit of a tutorial post but that’s not really as cool as if there were a library(couchbase). I mean, you can check GitUgh or CRAN or a more general search yourself if you’d like but it’s going to come up bupkis.

If you were expecting a big reveal, right now, that I’ve got a feature-packed, full R Couchbase package ready to roll…you didn’t actually read the title of the post. What I do have is a set of functions that — given server/connection metadata, a bucket, a full text index, and a query — will return all matching documents (I still do not like that term for “record”) for said set of parameters:

# function code is in: https://paste.sr.ht/~hrbrmstr/051f5d5400644952a3ad2cf8664b84e2cbb9ac6b

cb_fts("domino's", "all", "ripe")
## # A tibble: 120 x 9
##    admin_c   country descr                      inetnum                  mnt_by      netname  status    tech_c  notify         
##    <chr>     <chr>   <chr>                      <chr>                    <chr>       <chr>    <chr>     <chr>   <chr>          
##  1 DUMY-RIPE FR      OPEN IP DOMINO'S PIZZA     79.141.8.44 - 79.141.8.… ALPHALINK-… OPEN-IP  ASSIGNED… DUMY-R… NA             
##  2 DUMY-RIPE NL      Domino's Pizza TILBURG     62.21.176.160 - 62.21.1… AS286-MNT   OTS2634… ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  3 DUMY-RIPE NL      Domino's Pizza EINDHOVEN   62.132.252.168 - 62.132… AS286-MNT   OTS2270… ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  4 DUMY-RIPE NL      Domino's Pizza SPYKENISSE  194.123.233.232 - 194.1… AS286-MNT   OTS69259 ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  5 DUMY-RIPE NL      Domino's AMSTERDAM         37.74.38.188 - 37.74.38… AS286-MNT   OTS6103… ASSIGNED… DUMY-R… kpn-ip-office@…
##  6 DUMY-RIPE NL      Domino's Pizza VOORSCHOTEN 92.66.116.136 - 92.66.1… AS286-MNT   OTS1914… ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  7 DUMY-RIPE NL      Domino's Pizza Doetinchem… 212.241.42.136 - 212.24… AS286-MNT   OTS2301… ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  8 DUMY-RIPE NL      Domino's Pizza AMSTERDAM   194.120.45.224 - 194.12… AS286-MNT   OTS82906 ASSIGNED… DUMY-R… ip-reg@kpn.net 
##  9 DUMY-RIPE NL      Domino's Pizza [Woerden] … 62.41.228.80 - 62.41.22… AS286-MNT   OTS2024… ASSIGNED… DUMY-R… ip-reg@kpn.net 
## 10 DUMY-RIPE NL      Domino's Pizza GRONINGEN   188.203.128.0 - 188.203… AS286-MNT   OTS3767… ASSIGNED… DUMY-R… kpn-ip-office@…
## # … with 110 more rows

It’s not fancy.

It’s meets the needs of a narrow use-case.

It’s not in a standalone package (which is triggering my R code OCD something fierce).

But, it’s seriously fast, got me back to “work mode” with a minimum of hassle, and now there’s some google-able Couchbase R code that isn’t just bare httr calls that may help someone else who’s on a quest for how to work with Couchbase in R.

The first primary function – cb_fts() — uses the /api/index/{index-name}/query API endpoint to paginate through results of the full text search and retrieves all matching record doc id keys, then calls the last primary function — cb_get_records_from_keys() — which uses the /query/service API endpoint, issues a SELECT * FROM {bucket} USE KEYS {keys} query with all the found document (record) key ids and returns the result set. Nothing more fancy than that.

FIN

While I do not have these functions in a standalone, Couchbase-focused package I do have them in the package associated with this particular project. If you do know of a Couchbase R package (please don’t link to JDBC/ODBC drivers as I’m not going to buy) please link to them in the comments.

If you have other strategies for how to deal with these “un-packages”, please blog about it and post a link as well! I’m curious how others balance the package/not-a-package/un-package tension, especially when you may need to depend on a series of functions across projects.

@ted_dunning recently updated the t-Digest algorithm he created back in 2013. What is this “t-digest”? Fundamentally, it is a probabilistic data structure for estimating any percentile of distributed/streaming data. Ted explains it quite elegantly in this short video:

Said video has a full transcript as well.

T-digests have been baked into many “big data” analytics ecosystems for a while but I hadn’t seen any R packages for them (ref any in a comment if you do know of some) so I wrapped one of the low-level implementation libraries by ajwerner into a diminutive R package boringly, but appropriately named tdigest:

There are wrappers for the low-level accumulators and quantile/value extractors along with vectorised functions for creating t-digest objects and retrieving quantiles from them (including a tdigest S3 method for stats::quantile()).

This:

install.packages("tdigest", repos="https://cinc.rud.is/")

will install from source or binaries onto your system(s).

Basic Ops

The low-level interface is more useful in “streaming” operations (i.e. accumulating input over time):

set.seed(2019-04-03)

td <- td_create()

for (i in 1:100000) {
  td_add(td, sample(100, 1), 1)
}

quantile(td)
## [1]   1.00000  25.62222  53.09883  74.75522 100.00000

More R-like Ops

Vectorisation is the name of the game in R and we can use tdigest() to work in a vectorised manner:

set.seed(2019-04-03)

x <- sample(100, 1000000, replace=TRUE)

td <- tdigest(x)

quantile(td)
## [1]   1.00000  25.91914  50.79468  74.76439 100.00000

Need for Speed

The t-digest algorithm was designed for both streaming operations and speed. It’s pretty, darned fast:

microbenchmark::microbenchmark(
  tdigest = tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)),
  r_quantile = quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
)
## Unit: microseconds
##        expr      min         lq        mean    median       uq       max neval
##     tdigest    22.81    26.6525    48.70123    53.355    63.31    151.29   100
##  r_quantile 57675.34 59118.4070 62992.56817 60488.932 64731.23 160130.50   100

Note that “accurate” is not the same thing as “precise”, so regular quantile ops in R will be close to what t-digest computes, but not always exactly the same.

FIN

This was a quick (but, complete) wrapper and could use some tyre kicking. I’ve a mind to add serialization to the C implementation so I can then enable [de]serialization on the R-side since that would (IMO) make t-digest ops more useful in an R-context, especially since you can merge two different t-digests.

As always, code/PR where you want to and file issues with any desired functionality/enhancements.

Also, whomever started the braces notation for package names (e.g. {ggplot2}): brilliant!

I saw a second post on turning htmlwidgets into interactive Twitter Player cards and felt somewhat compelled to make creating said entities a bit easier so posited the following:

I figured 40+ 💙 could not be wrong, so thus begat widgetcard:

To make this post as short as possible, the TLDR is that you just pass in an htmlwidget and some required parameters and you get back a deployable interactive Twitter Player card as an archive file and local directory. The example code is almost as short since we’re cheating and using the immensely helpful plotly package to turn a ggplot2 vis into something interactive.

First, make the vis:

library(ssh)
library(plotly)
library(ggplot2)
library(widgetcard)

ggplot(mtcars, aes(wt, mpg)) +
  geom_point() -> gg

Now, we create a local preview image for the plot we just made since we need one for the card:

preview <- gg_preview(gg)

NOTE that you can use any image you want. This function streamlines the process for plotly plots created from ggplot2 plots. There are links to image sizing guidelines in the package help files.

Now, we convert our ggplot2 object to a plotly object and create the Twitter Player card. Note that Twitter really doesn’t like standalone widgets being used as Twitter Player card links due to their heavyweight size. Therefore, card_widget() creates a non-standalone widget but bundles everything up into a single directory and deployable archive.

ggplotly(gg) %>% 
  card_widget(
    output_dir = "~/widgets/tc",
    name_prefix = "tc",
    preview_img = preview,
    html_title = "A way better title",
    card_twitter_handle = "@hrbrmstr",
    card_title = "Basic ggplot2 example",
    card_description = "This is a sample caRd demonstrating card_widget()",
    card_image_url_prefix = "https://rud.is/vis/tc/",
    card_player_url_prefix = "https://rud.is/vis/tc/",
    card_player_width = 480,
    card_player_height = 480
  ) -> arch_fil

Here’s what the resulting directory structure looks like:

tc
├── tc.html
├── tc.png
└── tc_files
    ├── crosstalk-1.0.0
    │   ├── css
    │   │   └── crosstalk.css
    │   └── js
    │       ├── crosstalk.js
    │       ├── crosstalk.js.map
    │       ├── crosstalk.min.js
    │       └── crosstalk.min.js.map
    ├── htmlwidgets-1.3
    │   └── htmlwidgets.js
    ├── jquery-1.11.3
    │   ├── jquery-AUTHORS.txt
    │   ├── jquery.js
    │   ├── jquery.min.js
    │   └── jquery.min.map
    ├── plotly-binding-4.8.0
    │   └── plotly.js
    ├── plotly-htmlwidgets-css-1.39.2
    │   └── plotly-htmlwidgets.css
    ├── plotly-main-1.39.2
    │   └── plotly-latest.min.js
    ├── pymjs-1.3.2
    │   ├── pym.v1.js
    │   └── pym.v1.min.js
    └── typedarray-0.1
        └── typedarray.min.js

(There’s also a tc.tgz at the same level as the tc directory.)

The widget is iframe’d using widgetframe and then saved out using htmlwidgets::saveWidget().

Now, for deploying this to a web server, one could use a method like this to scp the deployable archive:

sess <- ssh_connect(Sys.getenv("SSH_HOST"))

invisible(scp_upload(
  sess, files = arch_fil, Sys.getenv("REMOTE_VIS_DIR"), verbose = FALSE
))

ssh_exec_wait(
  sess,
  command = c(
    sprintf("cd %s", Sys.getenv("REMOTE_VIS_DIR")),
    sprintf("tar -xzf %s", basename(arch_fil))
  )
)

Alternatively, you can use other workflows to transfer and expand the archive or copy output to your static blog host.

Make sure to test anything you build with Twitter’s validator before tweeting it out.

FIN

This works but is super nascent and could use some serious IRL tyre kicking and brutal feedback. Pick the least offensive social coding site you prefer and file issues & PR’s at-will.

There’s been alot of talk about “dependencies” in the R universe of late. This is not really a post about that but more of a “really, don’t do this” if you decide you want to poke the dependency bear by trying to build a deeply flawed model off of CRAN package metadata.

CRAN packages undergo checks. Here’s one for akima (I :heart: me some gridded interpolation functions, plus this package is not in any hot-button R tribe right now):

Flavor Version Tinstall Tcheck Ttotal Status Flags
r-devel-linux-x86_64-debian-clang 0.6-2 9.83 32.67 42.50 OK
r-devel-linux-x86_64-debian-gcc 0.6-2 8.53 26.56 35.09 OK
r-devel-linux-x86_64-fedora-clang 0.6-2 53.33 NOTE
r-devel-linux-x86_64-fedora-gcc 0.6-2 51.03 NOTE
r-devel-windows-ix86+x86_64 0.6-2 53.00 76.00 129.00 OK
r-patched-linux-x86_64 0.6-2 9.26 28.32 37.58 OK
r-patched-solaris-x86 0.6-2 66.20 OK
r-release-linux-x86_64 0.6-2 8.59 28.25 36.84 OK
r-release-windows-ix86+x86_64 0.6-2 39.00 69.00 108.00 OK
r-release-osx-x86_64 0.6-2 OK
r-oldrel-windows-ix86+x86_64 0.6-2 28.00 71.00 99.00 OK
r-oldrel-osx-x86_64 0.6-2 OK
Check Details
Version: 0.6-2
  Check: compiled code
 Result: NOTE

    File ‘akima/libs/akima.so’:
      Found no call to: ‘R_useDynamicSymbols’

    It is good practice to register native routines and to disable symbol search.

    See ‘Writing portable packages’ in the ‘Writing R Extensions’ manual.

The Status field can be “OK”, “NOTE”, “WARN[ING]”, “ERROR”, or “FAIL”.

You’ll also note that there are checks for a future, even cooler R (“devel”), spiffy R (“release”/”patched”) and :yawn: R (“oldrel”). Remember those, they are important.

Now, let’s say you wanted to perform an honest appraisal of whether packages with more dependencies are more likely to have one or more “bad” CRAN check conditions. You’d likely lump “NOTE” with “OK” and not mark that particular check against the package. That leaves “WARN[ING]” (the reason for the [ING] is that different check RDS files include/forego the [ING]…yay consistency?) “ERROR” and “FAIL”. Obviously we can just use those without further concern, right?

WARN[ING] Will Robinson!

You can get a copy of the check details at https://cran.r-project.org/web/checks/check_details.rds. I happen to have a local copy (used “Mar 18 02:47” version) now that my CRAN mirror is re-humming along nicely again. Let’s make sure it’s being read in OK:

library(tidyverse)

det <- as_tibble(readRDS("check_details.rds")) # it's got tons of classes & I like readable data frame prints

nrow(distinct(det, Package))
## [1] 13094

OK, that number tracks with the count as of the last rsync. So, what do these WARN[ING]s look like?

filter(det, Status == "WARNING") %>% 
  select(Output)
## # A tibble: 2,299 x 1
##    Output                                                             
##    <chr>                                                              
##  1 "Found the following significant warnings:\n  Warning: unable to r…
##  2 "Found the following significant warnings:\n  Warning: unable to r…
##  3 "Found the following significant warnings:\n  Warning: unable to r…
##  4 "Warning in parse(file = files, n = -1L) :\n  invalid input found …
##  5 "Warning in parse(file = files, n = -1L) :\n  invalid input found …
##  6 "Found the following significant warnings:\n  Warning: unable to r…
##  7 "Found the following significant warnings:\n  Warning: unable to r…
##  8 "Found the following significant warnings:\n  Warning: unable to r…
##  9 "Found the following significant warnings:\n  track_methods.cpp:22…
## 10 "Found the following significant warnings:\n  Warning: unable to r…
## # … with 2,289 more rows

EEK! Well, actually not really. The checks are automated and we can use some substring machinations to try to get better groups:

filter(det, Status == "WARNING") %>% 
  mutate(bits = substring(trimws(Output), 1, 30)) %>% 
  count(bits, sort=TRUE) %>% 
  mutate(pct = n/sum(n))
## # A tibble: 29 x 3
##    bits                                  n     pct
##    <chr>                             <int>   <dbl>
##  1 Found the following significan     1316 0.572  
##  2 Error in re-building vignettes      702 0.305  
##  3 Error(s) in re-building vignet       90 0.0391 
##  4 Output from running autoreconf       56 0.0244 
##  5 Warning in parse(file = files,       37 0.0161 
##  6 Missing link or links in docum       15 0.00652
##  7 "network.dyadcount:\n  function("    10 0.00435
##  8 Found the following executable        9 0.00391
##  9 dyld: Library not loaded: /Bui        8 0.00348
## 10 Errors in running code in vign        8 0.00348
## # … with 19 more rows

OK, so some “eeking” is warranted for those “significant” ones but thirty percent of these findings are about vignettes. Sure, vignettes are important and ideally they build fine but there are tons of reasons they don’t on CRAN’s ever-changing infrastructure. I say they need to be excluded. Drop a note in the comments with a different opinion since this is an analyst’s opinion. But I happen to know CRAN really well and would seriously suggest that in the context of the question regarding high dependency package efficacy that this should be ignored unless further investigated individually.

So, where are these “significant” WARN[ING]s?

filter(det, Status == "WARNING") %>% 
  filter(grepl("significant", Output, ignore.case = TRUE)) %>%
  mutate(flavor_flav = case_when(
    grepl("devel", Flavor) ~ "devel",
    grepl("oldrel", Flavor) ~ "oldrel",
    TRUE ~ "current"
  )) %>% 
  count(flavor_flav, sort=TRUE) %>% 
  mutate(pct = n/sum(n))
## # A tibble: 3 x 3
##   flavor_flav     n   pct
##   <chr>       <int> <dbl>
## 1 devel         904 0.686
## 2 current       280 0.213
## 3 oldrel        133 0.101

I posit that if the goal really is to create a model to help decide whether you should take on the risk of packages with multiple++ dependencies you cannot include “devel”. Nobody sane runs “devel” in production and that’s the real goal: a safe production environment. So you literally have to throw out 68% of these, too (some folks are stuck on :yawn: “oldrel” R in orgs with draconian IT practices or fragile workflow systems). We’re not at “0” yet so what are some of these issues?

filter(det, Status == "WARNING") %>% 
  filter(grepl("significant", Output, ignore.case = TRUE)) %>%
  mutate(flavor_flav = case_when(
    grepl("devel", Flavor) ~ "devel",
    grepl("oldrel", Flavor) ~ "oldrel",
    TRUE ~ "current"
  )) %>% 
  filter(flavor_flav != "devel") %>% 
  mutate(Output = gsub("Found the following significant warnings:\n  ", "", trimws(Output))) %>% 
  mutate(bits = substring(trimws(Output), 1, 50)) %>% 
  count(bits, sort=TRUE) %>% 
  mutate(pct = n/sum(n))
## # A tibble: 110 x 3
##    bits                                                      n     pct
##    <chr>                                                 <int>   <dbl>
##  1 "Warning: S3 methods '[.fun_list', '[.grouped_df', "    158 0.383  
##  2 Warning: 'rgl_init' failed, running with rgl.useNU       60 0.145  
##  3 "Found the following significant warnings:\n\n  Warn…     9 0.0218 
##  4 Warning: package ‘dplyr’ was built under R version        9 0.0218 
##  5 "bgc_hmm.c:241:31: warning: ‘, ’ directive writing "      4 0.00969
##  6 driver.c:381:26: warning: cast from pointer to int        4 0.00969
##  7 hash.c:144:5: warning: ‘strncpy’ specified bound 2        4 0.00969
##  8 RngStream.c:347:4: warning: ‘strncpy’ specified bo        4 0.00969
##  9 Warning: ‘__var_1_mmb.offset’ is used uninitialize        4 0.00969
## 10 /home/hornik/tmp/R.check/r-patched-gcc/Work/build/        3 0.00726
## # … with 100 more rows

Fun fact: a re-run of this with a 2019-03-19 RDS pulled from CRAN shows 79 vs 158 (and those 79 packages were’t magically re-submitted). This is usually a CRAN check “hiccup” on Windows:

filter(det, Status == "WARNING") %>% 
  filter(grepl("significant", Output, ignore.case = TRUE)) %>%
  mutate(flavor_flav = case_when(
    grepl("devel", Flavor) ~ "devel",
    grepl("oldrel", Flavor) ~ "oldrel",
    TRUE ~ "current"
  )) %>% 
  filter(flavor_flav != "devel") %>% 
  mutate(Output = gsub("Found the following significant warnings:\n  ", "", trimws(Output))) %>% 
  mutate(bits = substring(trimws(Output), 1, 50)) %>% 
  filter(grepl("S3 methods", bits)) %>% 
  count(bits, Flavor, sort=TRUE) %>% 
  mutate(pct = n/sum(n))
## # A tibble: 6 x 4
##   bits                               Flavor                  n     pct
##   <chr>                              <chr>               <int>   <dbl>
## 1 "Warning: S3 methods '[.fun_list'… r-oldrel-windows-i…    79 0.485  
## 2 "Warning: S3 methods '[.fun_list'… r-release-windows-…    79 0.485  
## 3 Warning: S3 methods 'as_mapper.ch… r-oldrel-windows-i…     2 0.0123 
## 4 Warning: S3 methods '.DollarNames… r-oldrel-windows-i…     1 0.00613
## 5 Warning: S3 methods 'as.promise.F… r-release-windows-…     1 0.00613
## 6 Warning: S3 methods 'format.stati… r-oldrel-windows-i…     1 0.00613

Yep! So, we really have to ignore some portion of these but not many (remember, these are test counts, not package counts).

Perhaps we’ll have better luck ginning up the analysis focusing on “ERROR”!

To ERROR Is Definitely Human When Assumptions Are Flawed

Let’s see about these ERRORs:

filter(det, Status == "ERROR") %>%
  mutate(flavor_flav = case_when(
    grepl("devel", Flavor) ~ "devel",
    grepl("oldrel", Flavor) ~ "oldrel",
    TRUE ~ "current"
  )) %>% 
  filter(flavor_flav != "devel") %>% 
  mutate(bits = substring(trimws(Output), 1, 20)) %>% 
  count(bits, sort=TRUE) %>% 
  print(n=66)
## # A tibble: 66 x 2
##    bits                        n
##    <chr>                   <int>
##  1 Installation failed.      437
##  2 "Running examples in "    424
##  3 Package required but      213
##  4 Running ‘testthat.R’      150
##  5 Packages required bu      102
##  6 Running 'testthat.R'       58
##  7 Package required and       31
##  8 Running ‘test-all.R’       17
##  9 Packages required an       12
## 10 Errors in running co        7
## 11 Running ‘spelling.R’        6
## 12 Running ‘test-that.R        5
## 13 Running 'activate_te        4
## 14 Running 'test-all.R'        4
## 15 Running ‘activate_te        3
## 16 Running 'Bernstein-E        2
## 17 Running 'Class+Meth.        2
## 18 Running 'dist_matrix        2
## 19 Running 'Frechet-tes        2
## 20 Running 'spelling.R'        2
## 21 Running 'test-as-dgC        2
## 22 Running 'valued_fit.        2
## 23 Running ‘allier.R’ [        2
## 24 Running ‘aunitizer.R        2
## 25 Running ‘autoprint.R        2
## 26 "Running ‘bdstest.R’ "      2
## 27 Running ‘build-tools        2
## 28 Running ‘exporting-m        2
## 29 Running ‘restfulr_un        2
## 30 Running ‘run_test.R’        2
## 31 Running ‘test_change        2
## 32 Running ‘test-as-dgC        2
## 33 Running ‘testGBHProc        2
## 34 Running ‘tests-nlgam        2
## 35 Running ‘testthat-pr        2
## 36 Running ‘TimeIn_Data        2
## 37 Running '000.session        1
## 38 Running '001.setupEx        1
## 39 Running 'bug1.R' [1s        1
## 40 "Running 'failure.R' "      1
## 41 Running 'fold.R' [5s        1
## 42 Running 'Rgui.R' [3s        1
## 43 Running 'Rgui.R' [4s        1
## 44 Running 'SP500-ex.R'        1
## 45 Running ‘aggregate.R        1
## 46 Running ‘as.edgelist        1
## 47 "Running ‘bdstest.R’\n"     1
## 48 Running ‘Class+Meth.        1
## 49 "Running ‘config.R’\n "     1
## 50 Running ‘cX-ui-funct        1
## 51 Running ‘DevEvalFile        1
## 52 "Running ‘emTests.R’\n"     1
## 53 "Running ‘group01.R’ "      1
## 54 Running ‘loop_genera        1
## 55 Running ‘LTS-special        1
## 56 Running ‘rsolr_unit_        1
## 57 "Running ‘run-all.R’\n"     1
## 58 Running ‘runTests.R’        1
## 59 "Running ‘sleep.R’\n  "     1
## 60 Running ‘SP500-ex.R’        1
## 61 Running ‘test_bccaq.        1
## 62 Running ‘test_scs.R’        1
## 63 Running ‘test-cluste        1
## 64 "Running ‘test.R’\nRun"     1
## 65 Running ‘tests.R’ [4        1
## 66 Running ‘testthat.r’        1

We need to investigate more so let’s make some groups:

filter(det, Status == "ERROR") %>%
  mutate(flavor_flav = case_when(
    grepl("devel", Flavor) ~ "devel",
    grepl("oldrel", Flavor) ~ "oldrel",
    TRUE ~ "current"
  )) %>% 
  filter(flavor_flav != "devel") %>% 
  mutate(Output = trimws(Output)) %>% 
  mutate(output_grp = case_when(
    grepl("^Running ", Output) ~ "Test/example run",
    grepl("^Installation fail", Output) ~ "Install failed",
    grepl("Package[s]* requir", Output) ~ "Missing pacakge(s)",
    grepl("Errors in running code in vig", Output) ~ "Vignette issue",
    TRUE ~ Output
  )) %>% 
  count(output_grp, sort=TRUE)
## # A tibble: 4 x 2
##   output_grp             n
##   <chr>              <int>
## 1 Test/example run     743
## 2 Install failed       437
## 3 Missing pacakge(s)   358
## 4 Vignette issue         7

Much better. Now, let see where those are:

filter(det, Status == "ERROR") %>%
  mutate(flavor_flav = case_when(
    grepl("devel", Flavor) ~ "devel",
    grepl("oldrel", Flavor) ~ "oldrel",
    TRUE ~ "current"
  )) %>% 
  filter(flavor_flav != "devel") %>% 
  mutate(Output = trimws(Output)) %>% 
  mutate(output_grp = case_when(
    grepl("^Running ", Output) ~ "Test/example run",
    grepl("^Installation fail", Output) ~ "Install failed",
    grepl("Package[s]* requir", Output) ~ "Missing package(s)",
    grepl("Errors in running code in vig", Output) ~ "Vignette issue",
    TRUE ~ Output
  )) %>% 
  filter(!grepl("solaris", Flavor)) %>%  # I love ya Solaris, but you're not relevant anymore
  count(output_grp, Flavor, sort=TRUE) %>% 
  mutate(pct = n/sum(n))
## # A tibble: 16 x 4
##    output_grp         Flavor                            n     pct
##    <chr>              <chr>                         <int>   <dbl>
##  1 Install failed     r-oldrel-osx-x86_64             256 0.197  
##  2 Test/example run   r-oldrel-windows-ix86+x86_64    218 0.168  
##  3 Missing package(s) r-oldrel-osx-x86_64             156 0.120  
##  4 Missing package(s) r-release-osx-x86_64            133 0.102  
##  5 Test/example run   r-oldrel-osx-x86_64             110 0.0847 
##  6 Test/example run   r-release-osx-x86_64             80 0.0616 
##  7 Install failed     r-release-windows-ix86+x86_64    73 0.0562 
##  8 Test/example run   r-patched-linux-x86_64           57 0.0439 
##  9 Test/example run   r-release-linux-x86_64           57 0.0439 
## 10 Install failed     r-oldrel-windows-ix86+x86_64     46 0.0354 
## 11 Test/example run   r-release-windows-ix86+x86_64    46 0.0354 
## 12 Install failed     r-release-osx-x86_64             36 0.0277 
## 13 Missing package(s) r-oldrel-windows-ix86+x86_64     19 0.0146 
## 14 Missing package(s) r-release-windows-ix86+x86_64     5 0.00385
## 15 Vignette issue     r-release-osx-x86_64              4 0.00308
## 16 Vignette issue     r-oldrel-osx-x86_64               3 0.00231

Let’s poke a bit more, but let’s also be aware of the fact that some (many) ERRORs on “oldrel” are due to conditions like this where the package specifies that it can only be used in release++ versions of R. So we kinda have to go all Columbo on every ERROR or exclude “oldrel” (we’ll do the latter since this post is already long) and we should also ignore the missing packages ones since that’s more than likely a CRAN issue.

filter(det, Status == "ERROR") %>%
  mutate(flavor_flav = case_when(
    grepl("devel", Flavor) ~ "devel",
    grepl("oldrel", Flavor) ~ "oldrel",
    TRUE ~ "current"
  )) %>% 
  filter(!(flavor_flav %in% c("oldrel", "devel"))) %>% 
  filter(!grepl("solaris", Flavor)) %>%  
  mutate(Output = trimws(Output)) %>% 
  mutate(output_grp = case_when(
    grepl("^Running ", Output) ~ "Test/example run",
    grepl("^Installation fail", Output) ~ "Install failed",
    grepl("Package[s]* requir", Output) ~ "Missing package(s)",
    grepl("Errors in running code in vig", Output) ~ "Vignette issue",
    TRUE ~ Output
  )) %>% 
  filter(output_grp != "Missing package(s)") %>% 
  distinct(Package)
## # A tibble: 254 x 1
##    Package          
##    <chr>            
##  1 AER              
##  2 archdata         
##  3 atlantistools    
##  4 BAMBI            
##  5 biglmm           
##  6 biglm            
##  7 BIOMASS          
##  8 blockingChallenge
##  9 broom            
## 10 clusternomics    
## # … with 244 more rows

Now we have a target package list (+ ~26 in “FAIL”) that can very likely legitimately have issues. We’ll let more practical data scientists than I am figure out the dependency tree member count for them and then determine proper features and model selection to come up with a far more legitimate “risk metric”.

Just One More Thing

Did you read the bit about the 03-18 RDS having some serious differences from 03-19 one? Yeah, so perhaps any model needs to be run a few times or the data collected over the course of time to ensure we’re working with as clean a dataset as possible. Y’know, ask practical data science questions like:

  • What data is available to me?
  • Will it help me solve the problem?
  • Is it enough?
  • Is the data quality good enough?

FIN

Never contrive an analysis just to fit your preferred message.

Assumptions matter. Analysis setup matters. Domain expertise matters. Dataset knowledge matters.

I can sum that up as mindfulness matters, and if you approach a project that way then go forth and LIBRARY ALL THE THINGS you need to accomplish your goals.

The fine folks over at @PacketTotal bequeathed an API token on me so I cranked out an R package for it to enable more dynamic investigations work (RStudio makes for an amazing incident responder investigations console given that you can script in multiple languages, code in C[++], and write documentation all at the same time using R ‘projects’ with full source code control).

Since I used the DT package my usual “just copy and paste the markdown into WordPress” wasn’t going to work and I wasn’t going to do two saveWidget()s and force two iframes on y’all just for an introductory post, so the inline-iframe for the R markdown output is below and can be frame-busted as well.

You can also find the source for the R code used in the R markdown document here.