Skip navigation

Category Archives: R

The fs package makes it super quick and easy to find out just how much “package hoarding” you’ve been doing:

library(fs)
library(ggalt) # devtools::install_github("hrbrmstr/ggalt")
library(igraph) 
library(ggraph) # devtools::install_github("thomasp85/ggraph")
library(hrbrthemes) # devtools::install_github("hrbrmstr/hrbrthemes")
library(tidyverse)

installed.packages() %>%
  as_data_frame() %>%
  mutate(pkg_dir = sprintf("%s/%s", LibPath, Package)) %>%
  select(pkg_dir) %>%
  mutate(pkg_dir_size = map_dbl(pkg_dir, ~{
    fs::dir_info(.x, all=TRUE, recursive=TRUE) %>%
      summarise(tot_dir_size = sum(size)) %>% 
      pull(tot_dir_size)
  })) %>% 
  summarise(
    total_size_of_all_installed_packages=ggalt::Gb(sum(pkg_dir_size))
  ) %>% 
  unlist()
## total_size_of_all_installed_packages 
##                             "1.6 Gb"

While you can modify the above and peruse the list of packages/directories in tabular format or programmatically, you can also do a bit more work to get a visual overview of package size (click/tap the image for a larger view):

installed.packages() %>%
  as_data_frame() %>%
  mutate(pkg_dir = sprintf("%s/%s", LibPath, Package)) %>%
  mutate(dir_info = map(pkg_dir, fs::dir_info, all=TRUE, recursive=TRUE)) %>% 
  mutate(dir_size = map_dbl(dir_info, ~sum(.x$size))) -> xdf

select(xdf, Package, dir_size) %>% 
  mutate(grp = "ROOT") %>% 
  add_row(grp = "ROOT", Package="ROOT", dir_size=0) %>% 
  select(grp, Package, dir_size) %>% 
  arrange(desc(dir_size)) -> gdf

select(gdf, -grp) %>% 
  mutate(lab = sprintf("%s\n(%s)", Package, ggalt::Mb(dir_size))) %>% 
  mutate(lab = ifelse(dir_size > 1500000, lab, "")) -> vdf

g <- graph_from_data_frame(gdf, vertices=vdf)

ggraph(g, "treemap", weight=dir_size) +
  geom_node_tile(fill="lightslategray", size=0.25) +
  geom_text(
    aes(x, y, label=lab, size=dir_size), 
    color="#cccccc", family=font_ps, lineheight=0.875
  ) +
  scale_x_reverse(expand=c(0,0)) +
  scale_y_continuous(expand=c(0,0)) +
  scale_size_continuous(trans="sqrt", range = c(0.5, 8)) +
  ggraph::theme_graph(base_family = font_ps) +
  theme(legend.position="none")

treemap of package disk consumption

Challenge

Do some wrangling with the above data and turn it into a package “disk explorer” with @timelyportfolio’s d3treeR? package.

(R⁶ == brief, low-expository posts)

@yoniceedee suggested I look at the Cambridge Analytics “whistleblower” testimony proceedings:

I value the resources @yoniceedee tosses my way (they often end me down twisted paths like this one, though :-) but I really dislike spending any amount of time on youtube and can consume text context much faster than even accelerated video playback.

Google auto-generated captions for that video and you can display them by clicking below the video on the right and enabling the transcript which slowly (well, in my frame of reference) loads into the upper-right. That’s still sub-optimal since we need to be on the youtube page to read/scroll. There’s no “export” option and my initial instinct was to go to Developer Tools and look for the https://www.youtube.com/service_ajax?name=getTranscriptEndpoint URL and “Copy the Response” to the clipboard and save it to a file then do some JSON/list wrangling (the transcript JSON URL is in the snippet below):

library(tidyverse)

trscrpt <- jsonlite::fromJSON("https://rud.is/dl/ca-transcript.json")

runs <- trscrpt$data$actions$openTranscriptAction$transcriptRenderer$transcriptRenderer$body$transcriptBodyRenderer$cueGroups[[1]]$transcriptCueGroupRenderer$formattedStartOffset$runs
cues <- trscrpt$data$actions$openTranscriptAction$transcriptRenderer$transcriptRenderer$body$transcriptBodyRenderer$cueGroups[[1]]$transcriptCueGroupRenderer$cues

data_frame(
  mark = map_chr(runs, ~.x$text),
  text = map_chr(cues, ~.x$transcriptCueRenderer$cue$runs[[1]]$text)  
) %>% 
  separate(mark, c("minute", "second"), sep=":", remove = FALSE, convert = TRUE) 
## # A tibble: 3,247 x 4
##    mark  minute second text                                    
##    <chr>  <int>  <int> <chr>                                   
##  1 00:00      0      0 all sort of yeah web of things if it's a
##  2 00:02      0      2 franchise then there's a kind of        
##  3 00:03      0      3 ultimately there's a there's a there's a
##  4 00:05      0      5 coordinator of that franchise or someone
##  5 00:07      0      7 who's a you got a that franchise is well
##  6 00:09      0      9 well when I was there that was Alexander
##  7 00:13      0     13 Nixon Steve banning but that's that's a 
##  8 00:16      0     16 question you should be asking aiq yeah  
##  9 00:18      0     18 yeah and just got to a IQ and the GSR   
## 10 00:24      0     24 state from gts-r that's other Hogan data
## # ... with 3,237 more rows

But, then I remembered YouTube has an API for this and threw together a quick script to grab them that way as well:

# the API needs these scopes

c(
  "https://www.googleapis.com/auth/youtube.force-ssl",
  "https://www.googleapis.com/auth/youtubepartner"
) -> scope_list

# oauth dance

httr::oauth_app(
  appname = "google",
  key = Sys.getenv("GOOGLE_APP_SECRET"),
  secret = Sys.getenv("GOOGLE_APP_KEY")
) -> captions_app

httr::oauth2.0_token(
  endpoint = httr::oauth_endpoints("google"),
  app = captions_app,
  scope = scope_list,
  cache = TRUE
) -> google_token

# list the available captions for this video
# (captions can be in one or more languages)

httr::GET(
  url = "https://www.googleapis.com/youtube/v3/captions",
  query = list(
    part = "snippet",
    videoId = "f2Sxob3fl0k" # the v=string in the YouTube URL
  ),
  httr::config(token = google_token)
) -> caps_list

# I'm cheating since I know there's only one but you'd want
# to introspect `caps_list` before blindly doing this for 
# other videos.

httr::GET(
  url = sprintf(
    "https://www.googleapis.com/youtube/v3/captions/%s",
    httr::content(caps_list)$items[[1]]$id
  ),
  httr::config(token = google_token)
) -> caps

# strangely enough, the JSON response "feels" better than this
# one, though this is a standard format that's parseable quite well.

cat(rawToChar(httr::content(caps)))
## 0:00:00.000,0:00:03.659
## all sort of yeah web of things if it's a
## 
## 0:00:02.490,0:00:05.819
## franchise then there's a kind of
## 
## 0:00:03.659,0:00:07.589
## ultimately there's a there's a there's a
## 
## 0:00:05.819,0:00:09.660
## coordinator of that franchise or someone
## 
## 0:00:07.589,0:00:13.139
## who's a you got a that franchise is well
## 
## 0:00:09.660,0:00:16.230
## well when I was there that was Alexander
## ...

Neither a reflection on active memory nor a quick Duck Duck Go search (I try not to use Google Search anymore) seemed to point to an existing R resource for this, hence the quick post in the event the snippet is helpful to anyone else.

If you do know of an R package/snippet that does this already, please shoot a note into the comments so others can find it.

A quick Friday post to let folks know about three in-development R packages that you’re encouraged to poke the tyres o[fn] and also jump in and file issues or PRs for.

Alleviating aversion to versions

I introduced a “version chart” in a recent post and one key element of tagging years (which are really helpful to get a feel for scope of exposure + technical/cyber-debt) is knowing the dates of product version releases. You can pay for such a database but it’s also possible to cobble one together, and that activity will be much easier as time goes on with the vershist? package.

Here’s a sample:

apache_httpd_version_history()
## # A tibble: 29 x 8
##    vers   rls_date   rls_year major minor patch prerelease build
##    <fct>  <date>        <dbl> <int> <int> <int> <chr>      <chr>
##  1 1.3.0  1998-06-05     1998     1     3     0 ""         ""   
##  2 1.3.1  1998-07-22     1998     1     3     1 ""         ""   
##  3 1.3.2  1998-09-21     1998     1     3     2 ""         ""   
##  4 1.3.3  1998-10-09     1998     1     3     3 ""         ""   
##  5 1.3.4  1999-01-10     1999     1     3     4 ""         ""   
##  6 1.3.6  1999-03-23     1999     1     3     6 ""         ""   
##  7 1.3.9  1999-08-19     1999     1     3     9 ""         ""   
##  8 1.3.11 2000-01-22     2000     1     3    11 ""         ""   
##  9 1.3.12 2000-02-25     2000     1     3    12 ""         ""   
## 10 1.3.14 2000-10-10     2000     1     3    14 ""         ""   
## # ... with 19 more rows

Not all vendored-software uses semantic versioning and many have terrible schemes that make it really hard to create an ordered factor, but when that is possible, you get a nice data frame with an ordered factor you can use for all sorts of fun and useful things.

It has current support for:

  • Apache httpd
  • Apple iOS
  • Google Chrome
  • lighttpd
  • memcached
  • MongoDB
  • MySQL
  • nginx
  • openresty
  • openssh
  • sendmail
  • SQLite

and I’ll add more over time.

Thanks to @bikesRdata there will be a …_latest() function for each vendor and I’ll likely add some helper functions so you only need to call one function with a parameter vs individual ones for each version and will also likely add a caching layer so you don’t have to scrape/clone/munge every time you need versions (seriously: look at the code to see what you have to do to collect some of this data).

And, they all it a MIME…a MIME!

I’ve had the wand? package out for a while but have never been truly happy with it. It uses libmagic on unix-ish systems but requires Rtools on Windows and relies on a system call to file.exe on that platform. Plus the “magic” database is too big to embed in the package and due to the (very, very, very good and necessary) privacy/safety practices of CRAN, writing the boilerplate code to deal with compilation or downloading of the magic database is not something I have time for (and it really needs regular updates for consistent output on all platforms).

A very helpful chap, @VincentGuyader, was lamenting some of the Windows issues which spawned a quick release of simplemagic?. The goal of this package is to be a zero-dependency install with no reliance on external databases. It has built-in support for basing MIME-type “guesses” off of a handful of the more common types folks might want to use this package for and a built-in “database” of over 1,500 file type-to-MIME mappings for guessing based solely on extension.

list.files(system.file("extdat", package="simplemagic"), full.names=TRUE) %>% 
  purrr::map_df(~{
    dplyr::data_frame(
      fil = basename(.x),
      mime = list(simplemagic::get_content_type(.x))
    )
  }) %>% 
  tidyr::unnest()
## # A tibble: 85 x 2
##    fil                        mime                                                                     
##    <chr>                      <chr>                                                                    
##  1 actions.csv                application/vnd.openxmlformats-officedocument.spreadsheetml.sheet        
##  2 actions.txt                application/vnd.openxmlformats-officedocument.spreadsheetml.sheet        
##  3 actions.xlsx               application/vnd.openxmlformats-officedocument.spreadsheetml.sheet        
##  4 test_1.2.class             application/java-vm                                                      
##  5 test_1.3.class             application/java-vm                                                      
##  6 test_1.4.class             application/java-vm                                                      
##  7 test_1.5.class             application/java-vm                                                      
##  8 test_128_44_jstereo.mp3    audio/mp3                                                                
##  9 test_excel_2000.xls        application/msword                                                       
## 10 test_excel_spreadsheet.xml application/xml      
## ...

File issues or PRs if you need more header-magic introspected guesses.

NOTE: The rtika? package could theoretically do a more comprehensive job since Apache Tika has an amazing assortment of file-type introspect-ors. Also, an interesting academic exercise might be to collect a sufficient corpus of varying files, pull the first 512-4096 bytes of each, do some feature generation and write an ML-based classifier for files with a confidence level + MIME-type output.

Site promiscuity detection

urlscan is a fun site since it frees you from the tedium (and expense/privacy-concerns) of using a javascript-enabled scraping setup to pry into the makeup of a target URL and find out all sorts of details about it, including how many sites it lets track you. You can do the same with my splashr? package, but you have the benefit of a third-party making the connection with urlscan.io vs requests coming from your IP space.

I’m waiting on an API key so I can write the “submit a scan request programmatically” function, but—until then—you can retrieve existing sites in their database or manually enter one for later retrieval.

The package is a WIP but has enough bits to be useful now to, say, see just how promiscuous cnn.com makes you:

cnn_db <- urlscan::urlscan_search("domain:cnn.com")

latest_scan_results <- urlscan::urlscan_result(cnn_db$results$`_id`[1], TRUE, TRUE)

latest_scan_results$scan_result$lists$ips
##  [1] "151.101.65.67"   "151.101.113.67"  "2.19.34.83"     
##  [4] "2.20.22.7"       "2.16.186.112"    "54.192.197.56"  
##  [7] "151.101.114.202" "83.136.250.242"  "157.166.238.142"
## [10] "13.32.217.114"   "23.67.129.200"   "2.18.234.21"    
## [13] "13.32.145.105"   "151.101.112.175" "172.217.21.194" 
## [16] "52.73.250.52"    "172.217.18.162"  "216.58.210.2"   
## [19] "172.217.23.130"  "34.238.24.243"   "13.107.21.200"  
## [22] "13.32.159.194"   "2.18.234.190"    "104.244.43.16"  
## [25] "54.192.199.124"  "95.172.94.57"    "138.108.6.20"   
## [28] "63.140.33.27"    "2.19.43.224"     "151.101.114.2"  
## [31] "74.201.198.92"   "54.76.62.59"     "151.101.113.194"
## [34] "2.18.233.186"    "216.58.207.70"   "95.172.94.20"   
## [37] "104.244.42.5"    "2.18.234.36"     "52.94.218.7"    
## [40] "62.67.193.96"    "62.67.193.41"    "69.172.216.55"  
## [43] "13.32.145.124"   "50.31.185.52"    "54.210.114.183" 
## [46] "74.120.149.167"  "64.202.112.28"   "185.60.216.19"  
## [49] "54.192.197.119"  "185.60.216.35"   "46.137.176.25"  
## [52] "52.73.56.77"     "178.250.2.67"    "54.229.189.67"  
## [55] "185.33.223.197"  "104.244.42.3"    "50.16.188.173"  
## [58] "50.16.238.189"   "52.59.88.2"      "52.38.152.125"  
## [61] "185.33.223.80"   "216.58.207.65"   "2.18.235.40"    
## [64] "69.172.216.58"   "107.23.150.218"  "34.192.246.235" 
## [67] "107.23.209.129"  "13.32.145.107"   "35.157.255.181" 
## [70] "34.228.72.179"   "69.172.216.111"  "34.205.202.95"

latest_scan_results$scan_result$lists$countries
## [1] "US" "EU" "GB" "NL" "IE" "FR" "DE"

latest_scan_results$scan_result$lists$domains
##  [1] "cdn.cnn.com"                    "edition.i.cdn.cnn.com"         
##  [3] "edition.cnn.com"                "dt.adsafeprotected.com"        
##  [5] "pixel.adsafeprotected.com"      "securepubads.g.doubleclick.net"
##  [7] "tpc.googlesyndication.com"      "z.moatads.com"                 
##  [9] "mabping.chartbeat.net"          "fastlane.rubiconproject.com"   
## [11] "b.sharethrough.com"             "geo.moatads.com"               
## [13] "static.adsafeprotected.com"     "beacon.krxd.net"               
## [15] "revee.outbrain.com"             "smetrics.cnn.com"              
## [17] "pagead2.googlesyndication.com"  "secure.adnxs.com"              
## [19] "0914.global.ssl.fastly.net"     "cdn.livefyre.com"              
## [21] "logx.optimizely.com"            "cdn.krxd.net"                  
## [23] "s0.2mdn.net"                    "as-sec.casalemedia.com"        
## [25] "errors.client.optimizely.com"   "social-login.cnn.com"          
## [27] "invocation.combotag.com"        "sb.scorecardresearch.com"      
## [29] "secure-us.imrworldwide.com"     "bat.bing.com"                  
## [31] "jadserve.postrelease.com"       "ssl.cdn.turner.com"            
## [33] "cnn.sdk.beemray.com"            "static.chartbeat.com"          
## [35] "native.sharethrough.com"        "www.cnn.com"                   
## [37] "btlr.sharethrough.com"          "platform-cdn.sharethrough.com" 
## [39] "pixel.moatads.com"              "www.summerhamster.com"         
## [41] "mms.cnn.com"                    "ping.chartbeat.net"            
## [43] "analytics.twitter.com"          "sharethrough.adnxs.com"        
## [45] "match.adsrvr.org"               "gum.criteo.com"                
## [47] "www.facebook.com"               "d3qdfnco3bamip.cloudfront.net" 
## [49] "connect.facebook.net"           "log.outbrain.com"              
## [51] "serve2.combotag.com"            "rva.outbrain.com"              
## [53] "odb.outbrain.com"               "dynaimage.cdn.cnn.com"         
## [55] "data.api.cnn.io"                "aax.amazon-adsystem.com"       
## [57] "cdns.gigya.com"                 "t.co"                          
## [59] "pixel.quantserve.com"           "ad.doubleclick.net"            
## [61] "cdn3.optimizely.com"            "w.usabilla.com"                
## [63] "amplifypixel.outbrain.com"      "tr.outbrain.com"               
## [65] "mab.chartbeat.com"              "data.cnn.com"                  
## [67] "widgets.outbrain.com"           "secure.quantserve.com"         
## [69] "static.ads-twitter.com"         "amplify.outbrain.com"          
## [71] "tag.bounceexchange.com"         "adservice.google.com"          
## [73] "adservice.google.com.ua"        "www.googletagservices.com"     
## [75] "cdn.adsafeprotected.com"        "js-sec.indexww.com"            
## [77] "ads.rubiconproject.com"         "c.amazon-adsystem.com"         
## [79] "www.ugdturner.com"              "a.postrelease.com"             
## [81] "cdn.optimizely.com"             "cnn.com"

O_o

FIN

Again, kick the tyres, file issues/PRs and drop a note if you’ve found something interesting as a result of any (or all!) of the packages.

(FWIW I think I even caused myself pain due to the title of this blog post).

Kaiser Fung (@junkcharts) did a makeover post on this chart about U.S. steel tariffs:

Kaiser’s makeover is good (Note: just because I said “good” does not mean I’m endorsing the use of pie charts):

But, I’m curious as to what others would do with the data. Here’s my stab at a single-geom makeover:

library(waffle)
library(viridis)
library(tidyverse)

data_frame(
  country = c("Rest of World", "Canada*", "Brazil*", "South Korea", "Mexico", 
              "Russia", "Turkey", "Japan", "Taiwan", "Germany", "India"),
  pct = c(22, 16, 13, 10, 9, 9, 7, 5, 4, 3, 2)
) %>% 
  mutate(country = sprintf("%s (%s%%)", country, pct)) %>% 
  waffle(
    colors = c("gray70", viridis_pal(option = "plasma")(10))
  ) +
  labs(
    title = "U.S. Steel Imports — YTD 2017 Percent of Volume",
    subtitle = "Ten nations account for ~80% of U.S. steel imports.",
    caption = "Source: IHS Global Trade Atlas • YTD through September 2017\n* Canada & Brazil are not impacted by the proposed tariffs"
  ) +
  theme_ipsum_ps() +
  theme(legend.position = "top") +
  theme(axis.text = element_blank()) +
  theme(title = element_text(hjust=0.5)) +
  theme(plot.title = element_text(hjust=0.5)) +
  theme(plot.subtitle = element_text(hjust=0.5)) +
  theme(plot.caption = element_text(hjust=1))

The percentages are included in the legend titles in the event that some readers of the chart may want to know the specific numbers, but my feeling for the intent of the original pac-man pies was to provide a list that didn’t include China-proper (despite 45 using them to rile up his base) and give a sense of proportion for the “top 10”. The waffle chart isn’t perfect for it, but it is one option.

How would you use the data (provided in the R snippet) to communicate the message you think needs to be communicated? Drop a note in the comments with a link to your creation(s) if you do give the data a spin.

I work with internet-scale data and do my fair share of macro-analyses on vulnerabilities. I use the R semver package for most of my work and wanted to blather on a bit about it since it’s super-helpful for this work and doesn’t get the attention it deserves. semver makes it possible to create charts like this:

which are very helpful in when conducting exposure analytics.

We’ll need a few packages to help us along the way:

library(here) # file mgmt
library(semver) # the whole purpose of the blog post
library(rvest) # we'll need this to get version->year mappings
library(stringi) # b/c I'm still too lazy to switch to ore
library(hrbrthemes) # pretty graphs
library(tidyverse) # sane data processing idioms

By issuing a stats command to a memcached instance you can get a full list of statistics for the server. The recent newsmaking DDoS used this feature in conjunction with address spoofing to create 30 minutes of chaos for GitHub.

I sent a stats command (followed by a newline) to a vanilla memcached installation and it returned 53 lines (1108 bytes) of STAT results that look something like this:

STAT pid 7646
STAT uptime 141
STAT time 1520447469
STAT version 1.4.25 Ubuntu
STAT libevent 2.0.21-stable
...

The version bit is what we’re after, but there are plenty of other variables you could just as easily focus on if you use memcached in any production capacity.

I extracted raw version response data from our most recent scan for open memcached servers on the internet. For ethical reasons, I cannot blindly share the entire raw data set but hit up research@rapid7.com if you have a need or desire to work with this data.

Let’s read it in and take a look:

version_strings <- read_lines(here("data", "versions.txt"))

set.seed(2018-03-07)

sample(version_strings, 50)

##  [1] "STAT version 1.4.5"             "STAT version 1.4.17"           
##  [3] "STAT version 1.4.25"            "STAT version 1.4.31"           
##  [5] "STAT version 1.4.25"            "STAT version 1.2.6"            
##  [7] "STAT version 1.2.6"             "STAT version 1.4.15"           
##  [9] "STAT version 1.4.17"            "STAT version 1.4.4"            
## [11] "STAT version 1.4.5"             "STAT version 1.2.6"            
## [13] "STAT version 1.4.2"             "STAT version 1.4.14 (Ubuntu)"  
## [15] "STAT version 1.4.7"             "STAT version 1.4.39"           
## [17] "STAT version 1.4.4-14-g9c660c0" "STAT version 1.2.6"            
## [19] "STAT version 1.2.6"             "STAT version 1.4.14"           
## [21] "STAT version 1.4.4-14-g9c660c0" "STAT version 1.4.37"           
## [23] "STAT version 1.4.13"            "STAT version 1.4.4"            
## [25] "STAT version 1.4.17"            "STAT version 1.2.6"            
## [27] "STAT version 1.4.37"            "STAT version 1.4.13"           
## [29] "STAT version 1.4.25"            "STAT version 1.4.15"           
## [31] "STAT version 1.4.25"            "STAT version 1.2.6"            
## [33] "STAT version 1.4.10"            "STAT version 1.4.25"           
## [35] "STAT version 1.4.25"            "STAT version 1.4.9"            
## [37] "STAT version 1.4.30"            "STAT version 1.4.21"           
## [39] "STAT version 1.4.15"            "STAT version 1.4.31"           
## [41] "STAT version 1.4.13"            "STAT version 1.2.6"            
## [43] "STAT version 1.4.13"            "STAT version 1.4.15"           
## [45] "STAT version 1.4.19"            "STAT version 1.4.25 Ubuntu"    
## [47] "STAT version 1.4.37"            "STAT version 1.4.4-14-g9c660c0"
## [49] "STAT version 1.2.6"             "STAT version 1.4.25 Ubuntu"

It’s in decent shape, but it needs some work if we’re going to do a version analysis with it. Let’s clean it up a bit:

data_frame(
  string = stri_match_first_regex(version_strings, "STAT version (.*)$")[,2]
) -> versions

count(versions, string, sort = TRUE) %>%
  knitr::kable(format="markdown")
string n
1.4.15 1966
1.2.6 1764
1.4.17 1101
1.4.37 949
1.4.13 725
1.4.4 531
1.4.25 511
1.4.20 368
1.4.14 (Ubuntu) 334
1.4.21 309
1.4.25 Ubuntu 290
1.4.24 259

Much better! However, we really only need the major parts of the semantic version string for a macro view, so let’s remove non-version strings completely and extract just the major, minor and patch bits:

filter(versions, !stri_detect_fixed(string, "UNKNOWN")) %>% # get rid of things we can't use
  mutate(string = stri_match_first_regex(
    string, "([[:digit:]]+\\.[[:digit:]]+\\.[[:digit:]]+)")[,2] # for a macro-view, the discrete sub-versions aren't important
  ) -> versions

count(versions, string, sort = TRUE) %>%
  knitr::kable(format="markdown")
string n
1.4.15 1966
1.2.6 1764
1.4.17 1101
1.4.37 949
1.4.25 801
1.4.4 747
1.4.13 727
1.4.14 385
1.4.20 368
1.4.21 309
1.4.24 264

Much, much better! Now, let’s dig into the versions a bit. Using semver is dirt-simple. Just use parse_version() to get the usable bits out:

ex_ver <- semver::parse_version(head(versions$string[1]))

ex_ver
## [1] Maj: 1 Min: 4 Pat: 25

str(ex_ver)
## List of 1
##  $ :Class 'svptr' <externalptr> 
##  - attr(*, "class")= chr "svlist"

It’s a special class, referencing an external pointer (the package relies on an underling C++ library and wraps everything up in a bow for us).

These objects can be compared, ordered, sorted, etc but I tend to just turn the parsed versions into a data frame that can be associated back with the main strings. That way we keep things pretty tidy and have tons of flexibility.

bind_cols(
  versions,
  pull(versions, string) %>%
    semver::parse_version() %>%
    as.data.frame()
) %>%
  arrange(major, minor, patch) %>%
  mutate(string = factor(string, levels = unique(string))) -> versions

versions
## # A tibble: 11,157 x 6
##    string major minor patch prerelease build
##    <fct>  <int> <int> <int> <chr>      <chr>
##  1 1.2.0      1     2     0 ""         ""   
##  2 1.2.0      1     2     0 ""         ""   
##  3 1.2.5      1     2     5 ""         ""   
##  4 1.2.5      1     2     5 ""         ""   
##  5 1.2.5      1     2     5 ""         ""   
##  6 1.2.5      1     2     5 ""         ""   
##  7 1.2.5      1     2     5 ""         ""   
##  8 1.2.5      1     2     5 ""         ""   
##  9 1.2.5      1     2     5 ""         ""   
## 10 1.2.5      1     2     5 ""         ""   
## # ... with 11,147 more rows

Now we have a tidy data frame and I did the extra step of creating an ordered factor out of the version strings since they are ordinal values. With just this step, we have everything we need to do a basic plot shoing the version counts in-order:

count(versions, string) %>%
  ggplot() +
  geom_segment(
    aes(string, n, xend = string, yend = 0),
    size = 2, color = "lightslategray"
  ) +
  scale_y_comma() +
  labs(
    x = "memcached version", y = "# instances found",
    title = "Distribution of memcached versions"
  ) +
  theme_ipsum_ps(grid = "Y") +
  theme(axis.text.x = element_text(hjust = 1, vjust = 0.5, angle = 90))

memcached versions (raw)

That chart is informative on its own since we get the perspective that there are some really old versions exposed. But, how old are they? Projects like Chrome or Firefox churn through versions regularly/quickly (on purpose). To make more sense out of this we’ll need more info on releases.

This is where things can get ugly for folks who do not have commercial software management databases handy (or are analyzing a piece of software that hasn’t made it to one of those databases yet). The memcached project maintains a wiki page of version history that’s mostly complete, and definitely complete enough for this exercise. It will some processing before we can associate a version to a year.

GitHub does not allow scraping of their site and — off the top of my head — I do not know if there is a “wiki” API endpoint, but I do know that you can tack on .wiki.git to the end of a GitHub repo to clone the wiki pages, so we’ll use that knowledge and the git2r package to gain access to the
ReleaseNotes.md file that has the data we need:

td <- tempfile("wiki", fileext="git") # temporary "directory"

dir.create(td)

git2r::clone(
  url = "git@github.com:memcached/memcached.wiki.git",
  local_path = td,
  credentials = git2r::cred_ssh_key() # need GH ssh keys setup!
) -> repo
## cloning into '/var/folders/1w/2d82v7ts3gs98tc6v772h8s40000gp/T//Rtmpb209Sk/wiki180eb3c6addcbgit'...
## Receiving objects:   1% (5/481),    8 kb
## Receiving objects:  11% (53/481),    8 kb
## Receiving objects:  21% (102/481),   49 kb
## Receiving objects:  31% (150/481),   81 kb
## Receiving objects:  41% (198/481),  113 kb
## Receiving objects:  51% (246/481),  177 kb
## Receiving objects:  61% (294/481),  177 kb
## Receiving objects:  71% (342/481),  192 kb
## Receiving objects:  81% (390/481),  192 kb
## Receiving objects:  91% (438/481),  192 kb
## Receiving objects: 100% (481/481),  192 kb, done.

read_lines(file.path(repo@path, "ReleaseNotes.md")) %>%
  keep(stri_detect_fixed, "[[ReleaseNotes") %>%
  stri_replace_first_regex(" \\* \\[\\[.*]] ", "") %>%
  stri_split_fixed(" ", 2, simplify = TRUE) %>%
  as_data_frame() %>%
  set_names(c("string", "release_year")) %>%
  mutate(string = stri_trim_both(string)) %>%
  mutate(release_year = stri_replace_first_fixed(release_year, "(", "")) %>% # remove leading parens
  mutate(release_year = stri_replace_all_regex(release_year, "\\-.*$", "")) %>% # we only want year so remove remaining date bits from easy ones
  mutate(release_year = stri_replace_all_regex(release_year, "^.*, ", "")) %>% # take care of most of the rest of the ugly ones
  mutate(release_year = stri_replace_all_regex(release_year, "^[[:alpha:]].* ", "")) %>% # take care of the straggler
  mutate(release_year = stri_replace_last_fixed(release_year, ")", "")) %>% # remove any trailing parens
  mutate(release_year = as.numeric(release_year)) -> memcached_releases # make it numeric

unlink(td, recursive = TRUE) # cleanup the git repo we downloaded

memcached_releases
## # A tibble: 49 x 2
##    string release_year
##    <chr>         <dbl>
##  1 1.5.6          2018
##  2 1.5.5          2018
##  3 1.5.4          2017
##  4 1.5.3          2017
##  5 1.5.2          2017
##  6 1.5.1          2017
##  7 1.5.0          2017
##  8 1.4.39         2017
##  9 1.4.38         2017
## 10 1.4.37         2017
## # ... with 39 more rows

We have more versions in our internet-scraped memcached versions data
set than this wiki page has on it, so we need to restrict the official
release history to what we have. Then, we only want a single instance of
each year for the annotations, so we’ll have to do some further processing:

filter(memcached_releases, string %in% unique(versions$string)) %>%
  mutate(string = factor(string, levels = levels(versions$string))) %>%
  group_by(release_year) %>%
  arrange(desc(string)) %>%
  slice(1) %>%
  ungroup() -> annotation_df

knitr::kable(annotation_df, "markdown")
string release_year
1.4.4 2009
1.4.5 2010
1.4.10 2011
1.4.15 2012
1.4.17 2013
1.4.22 2014
1.4.25 2015
1.4.33 2016
1.5.4 2017
1.5.6 2018

Now, we’re ready to add the annotation layers! We’ll take a blind stab at it before adding in further aesthetic customization:

version_counts <- count(versions, string) # no piping this time

ggplot() +
  geom_blank(data = version_counts,aes(string, n)) + # prime the scales
  geom_vline(
    data = annotation_df, aes(xintercept = as.numeric(string)),
    size = 0.5, linetype = "dotted", color = "orange"
  ) +
  geom_segment(
    data = version_counts,
    aes(string, n, xend = string, yend = 0),
    size = 2, color = "lightslategray"
  ) +
  geom_label(
    data = annotation_df, aes(string, Inf, label=release_year),
    family = font_ps, size = 2.5, color = "lightslateblue",
    hjust = 0, vjust = 1, label.size = 0
  ) +
  scale_y_comma() +
  labs(
    x = "memcached version", y = "# instances found",
    title = "Distribution of memcached versions"
  ) +
  theme_ipsum_ps(grid = "Y") +
  theme(axis.text.x = element_text(hjust = 1, vjust = 0.5, angle = 90))

Almost got it in ggpar 1! We need to tweak this so that the labels do not overlap each other and do not obstruct the segment bars. We can do most of this work in geom_segment() itself, plus add a bit of a tweak to the Y axis scale:

ggplot() +
  geom_blank(data = version_counts,aes(string, n)) + # prime the scales
  geom_vline(
    data = annotation_df, aes(xintercept = as.numeric(string)),
    size = 0.5, linetype = "dotted", color = "orange"
  ) +
  geom_segment(
    data = version_counts,
    aes(string, n, xend = string, yend = 0),
    size = 2, color = "lightslategray"
  ) +
  geom_label(
    data = annotation_df, aes(string, Inf, label=release_year), vjust = 1,
    family = font_ps, size = 2.5, color = "lightslateblue", label.size = 0,
    hjust = c(1, 0, 1, 1, 0, 1, 0, 0, 1, 0),
    nudge_x = c(-0.1, 0.1, -0.1, -0.1, 0.1, -0.1, 0.1, 0.1, -0.1, 0.1)
  ) +
  scale_y_comma(limits = c(0, 2050)) +
  labs(
    x = "memcached version", y = "# instances found",
    title = "Distribution of memcached versions"
  ) +
  theme_ipsum_ps(grid = "Y") +
  theme(axis.text.x = element_text(hjust = 1, vjust = 0.5, angle = 90))

Now, we have version and year info to we can get a better idea of the scope of exposure (and, just how much technical debt many organizations have accrued).

With the ordinal version inforamtion we can also perform other statistical operations as well. All due to the semver package.

You can find this R project over at GitHub.

Tis the season for finding out how well Maine fisherfolk did last year; specifically, Maine lobsterfolk.

Most of the news sites in Maine do a feature on the annual landings (here’s one from Bangor Daily News). There was a marked decline — the largest ever — in both poundage and revenue in 2017 and many sources point to the need to improve fishery management to help ensure both the environmental and economic health of the state.

My preferred view for this annual catch comparison is a connected scatterplot, tracing a path along the years. That way you get the feel of a time-series with the actual poundage-to-value without having to resort to two charts or (heaven forbid) a dual-geom/dual-axis chart.

The State of Maine Department of Marine Resources makes the data available but it’s in a PDF:

Thankfully, the PDF is not obfuscated and is just a plain table so it’s easy to parse and turn into:

The code to retrieve the PDF, parse it and produce said connected scatterplot is below.

library(stringi)
library(pdftools)
library(hrbrthemes)
library(tidyverse)

lobster_by_county_url <- "https://www.maine.gov/dmr/commercial-fishing/landings/documents/lobster.county.pdf"
lobster_by_county_fil <- basename(lobster_by_county_url)

if (!file.exists(lobster_by_county_fil)) download.file(lobster_by_county_url, lobster_by_county_fil)

# read in the PDF
lobster_by_county_pgs <- pdftools::pdf_text(lobster_by_county_fil)

map(lobster_by_county_pgs, stri_split_lines) %>% # split each page into lines
  flatten() %>%
  flatten_chr() %>%
  keep(stri_detect_fixed, "$") %>% # keep only lines with "$" in them
  stri_trim_both() %>% # clean up white space
  stri_split_regex("\ +", simplify = TRUE) %>% # get the columns
  as_data_frame() %>%
  mutate_at(c("V3", "V4"), lucr::from_currency) %>% # turn the formatted text into numbers
  set_names(c("year", "county", "pounds", "value")) %>% # better column names
  filter(county != "TOTAL") %>% # we'll calculate our own, thank you
  mutate(year = as.Date(sprintf("%s-01-01", year))) %>% # I like years to be years for plotting
  mutate(county = stri_trans_totitle(county)) -> lobster_by_county_df

arrange(lobster_by_county_df, year) %>%
  mutate(value = value / 1000000, pounds = pounds / 1000000) %>% # easier on the eyes
  group_by(year) %>%
  summarise(pounds = sum(pounds), value = sum(value)) %>%
  mutate(year_lab = lubridate::year(year)) %>%
  mutate(highlight = ifelse(year_lab == 2017, "2017", "Other")) %>% # so we can highlight 2017
  ggplot(aes(pounds, value)) +
  geom_path() +
  geom_label(
    aes(label = year_lab, color = highlight, size = highlight),
    family = font_ps, show.legend = FALSE
  ) +
  scale_x_comma(name = "Pounds (millions) →", limits = c(0, 150)) +
  scale_y_comma(name = "$ USD (millions) →", limits = c(0, 600)) +
  scale_color_manual(values = c("2017" = "#742111", "Other" = "#2b2b2b")) +
  scale_size_manual(values = c("2017" = 6, "Other" = 4)) +
  labs(
    title = "Historical Maine Fisheries Landings Data — Lobster (1964-2017)",
    subtitle = "All counties combined; Not adjusted for inflation",
    caption = "The 2002 & 2003 landings may possibly reflect the increased effort by DMR to collect voluntary landings from some lobster dealers;\nLobster reporting became mandatory in 2004 for all Maine dealers buying directly from harvesters.\nSource: <https://www.maine.gov/dmr/commercial-fishing/landings/historical-data.html>"
  ) +
  theme_ipsum_ps(grid = "XY")

What’s Up?

The NPR Visuals Team created and maintains a javascript library that makes it super easy to embed iframes on web pages and have said documents still be responsive.

The widgetframe R htmlwidget uses pym.js to bring this (much needed) functionality into widgets and (eventually) shiny apps.

NPR reported a critical vulnerability in this library on February 15th, 2018 with no details (said details will be coming next week).

Per NPR’s guidance, any production code using pym.js needs to be pulled or updated to use this new library.

I created an issue & pushed up a PR that incorporates the new version. NOTE that the YAML config file in the existing CRAN package and GitHub dev version incorrectly has 1.3.2 as the version (it’s really the 1.3.1 dev version).

A look at the diff:

suggest that the library was not performing URL sanitization (and now is).

Watch Out For Standalone Docs

Any R markdown docs compiled in “standalone” mode will need to be recompiled and re-published as the vulnerable pym.js library comes along for the ride in those documents.

Regardless of “standalone mode”, if you used widgetframe in any context, including:

anything created is vulnerable regardless of standalone compilation or not.

FIN

Once the final details are released I’ll update this post and may do a new post. Until then:

  • check if you’ve used widgetframe (directly or indirectly)
  • REMOVE ALL VULNERABLE DOCS from RPubs, GitHub pages, your web site (etc) today
  • regenerate all standalone documents ASAP
  • regenerate your blogs, books, dashboards, etc ASAP with the patched code; DO THIS FOR INTERNAL as well as internet-facing content.
  • monitor this space

Much of what I need to do for work-work involves using tools that are (for the moment) not in R. Today, I needed to test the validity of (and other processing on) DMARC records and I’m loathe to either reinvent the wheel or reticulate bits from a fragmented programming language ecosystem unless absolutely necessary. Thankfully, there’s libopendmarc which works well on sane operating systems, but it is a C library that needs an interface to use in R.

However, I also really didn’t want to start a new package for this just yet (there will eventually be one, though, and I prefer working in a package context for Rcpp work). I just needed to run opendmarc_policy_store_dmarc() against a decent-sized chunk of domain names and already-retrieved DMARC TXT records. So, I decided to write a small “inline” cppFunction() to get’er done.

Why am I blogging about this?

Despite growing popularity and a nice examples site, many newcomers to Rcpp (literally the way you want to go when it comes to bridging C[++] and R) still voice discontent about there not being enough “easy” examples. Granted, they are quitely likely looking for full-bore tutorials covering a different, explicit use cases. The aforelinked Gallery has some of those and there are codified examples in — literally — rcppexamples. But, there definitely needs to be more blog posts, books and such linking to them and expanding upon them.

Having mentioned that I’m using cppFunction(), one could, further, ask cppFunction() has a help page with an example, so why blather about using it?”. Fair point! And, there is a reason which was hinted at in the opening paragraph.

I need to use libopendmarc and that requires making a “plugin” if I’m going to do this “inline”. For some other DMARC processing I also need to use libresolv since the library needs to make DNS requests and uses resolv. You don’t need a plugin for a package version as you just need to boilerplate some “find these libraries and get their paths right for Makevars.in” and add the linking code in there as well. Here, we need to register two plugins that provide metdata for the magic that happens under the covers when Rcpp takes your inline code, compiles it and makes the function shared object available in R.

Plugins can be complex and do transformations, but the two I needed to write are just helping ensure the right #include lines are there along with the right linker libraries. Here they are:

library(Rcpp)

registerPlugin(
  name = "libresolv",
  plugin = function(x) {
    list(
      includes = "",
      env = list(PKG_LIBS="-lresolv")
    )
  }
)

registerPlugin(
  name = "libopendmarc",
  plugin = function(x) {
    list(
      includes = "#include <opendmarc/dmarc.h>",
      env = list(PKG_LIBS="-lopendmarc")
    )
  }
)

All they do is make data structures available in the environment. We can use inline::getPlugin() to see them:

inline::getPlugin("libresolv")
## $includes
## [1] ""
##
## $env
## $env$PKG_LIBS
## [1] "-lresolv"


inline::getPlugin("libopendmarc")
## $includes
## [1] "#include <opendmarc/dmarc.h>"
## 
## $env
## $env$PKG_LIBS
## [1] "-lopendmarc"

Finally, the tiny bit of C/C++ code to take in the necessary parameters and return the result. In this case, we’re passing in a character vector of domain names and DMARC records and getting back a logical vector with the test results. Apart from the necessary initialization and cleanup code for libopendmarc this is an idiom you’ll recognize if you look over packages that use Rcpp.

cppFunction(
std::vector< bool > is_dmarc_valid(std::vector< std::string> domains,
                                   std::vector< std::string> dmarc_records) {

  std::vector< bool > out(dmarc_records.size());

  DMARC_POLICY_T *pctx;
  OPENDMARC_STATUS_T status;

  pctx = opendmarc_policy_connect_init((u_char *)"1.2.3.4", 0);

  for (unsigned int i=0; i<dmarc_records.size(); i++) {

    status = opendmarc_policy_store_dmarc(
      pctx,
      (u_char *)dmarc_records[i].c_str(),
      (u_char *)domains[i].c_str(),
      NULL
    );

    out[i] = (status == DMARC_PARSE_OKAY);

    pctx = opendmarc_policy_connect_rset(pctx);

  }

  pctx = opendmarc_policy_connect_shutdown(pctx);

  return(out);

}
,
plugins=c("libresolv", "libopendmarc"))

(Note: the code-formatting plugin was tossing a serious fit about the long text field so you’ll need to put a single quote after cppFunction( and before the line with the , if you’re cutting and pasting at home).

Right at the end, the final parameter is telling cppFunction() what plugins to use.

Executing that line shunts a modified version of the function to disk, compiles it and lets us use the function in R (use cacheDir, showOutput and verbose parameters to control how many gory details lie undeneath this pristine shell).

After running the function, is_dmarc_valid() is available in the environment and ready to use.

domains <- c("bit.ly", "bizible.com", "blackmountainsystems.com", "blackspoke.com")
dmarc <-  c("v=DMARC1; p=none; pct=100; rua=mailto:dmarc@bit.ly; ruf=mailto:ruf@dmarc.bitly.net; fo=1;", 
            "v=DMARC1; p=reject; fo=1; rua=mailto:postmaster@bizible.com; ruf=mailto:forensics@bizible.com;", 
            "v=DMARC1; p=quarantine; pct=100; rua=mailto:demarcrecords@blkmtn.com, mailto:ttran@blkmtn.com", 
            "user.cechire.com.")

is_dmarc_valid(domains, dmarc)
## [1]  TRUE  TRUE  TRUE FALSE

Processing those 5 took just about 10 microseconds which meant I could process the ~1,000,000 domains+DMARCs in no time at all. And, I have something I can use in a DMARC utility package (coming “soon”).

Hopefully this was a useful reference for both hooking up external libraries to “inline” Rcpp functions and for how to go about doing this type of thing in general.