hrbrmstr, Author at rud.is

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Homebrew 2.0.0 Released == homebrewanalytics package updated

A major new release of Homebrew has landed and now includes support for Linux as well as Windows! via the Windows Subsystem for Linux. There are overall stability and speed improvements baked in as well. The aforelinked notification has all the info you need to see the minutiae. Unless you’ve been super-lax in updating, brew update will get you the latest release.

There are extra formulae analytics endpoints and the homebrewanalytics? R package has been updated to handle them. A change worth noting in the package is that all the API calls are memoised to avoid hammering the Homebrew servers (though the “API” is really just file endpoints and they aren’t big files but bandwidth is bandwidth). Use the facilities in the memoise package to invalidate the cache if you have long running scripts.

Use your favorite social coding site to install it (If I don’t maintain mirrors on your open social coding platform of choice just drop a note in the comments and I’ll start mirroring there as well):

devtools::install_git("https://git.sr.ht/~hrbrmstr/homebrewanalytics")
# or
devtools::install_gitlab("hrbrmstr/homebrewanalytics")
# or
devtools::install_github("hrbrmstr/homebrewanalytics")

The README and in-package manual pages provide basic examples of retrieving data. But we can improve upon those here, such as finding out the dependency distribution of Homebrew formulae:

library(hrbrthemes)
library(homebrewanalytics) # git.sr.hr/~hrbrmstr ; git[la|hu]b/hrbrmstr
library(tidyverse)

f <- brew_formulae()

mutate(f, n_dependencies = lengths(build_dependencies)) %>% 
  count(n_dependencies) %>% 
  mutate(n_dependencies = factor(n_dependencies)) %>% 
  ggplot() +
  geom_col(aes(n_dependencies, n), fill = ft_cols$slate, width = 0.65) +
  scale_y_comma("# formulae") +
  labs(
    x = "# Dependencies",
    title = "Dependency distribution for Homebrew formulae"
  ) +
  theme_ft_rc(grid="Y")

Given how long it sometimes takes to upgrade my various Homebrew installations I was surprised to see 0 be so prevalent, but one of the major changes in 2.0.0 is going to be more binary installs (unless you really need custom builds) so that is likely part of my experience, especially with the formulae I need to support cybersecurity and spatial operations.

We can also see which formuale are in the top 50%:

unlist(f$dependencies) %>% 
  table(dnn = "library") %>% 
  broom::tidy() %>% 
  arrange(desc(n)) %>% 
  mutate(pct = n/sum(n), cpct = cumsum(pct)) %>% 
  filter(cpct <= 0.5) %>% 
  mutate(pct = scales::percent(pct)) %>% 
  mutate(library = factor(library, levels = rev(library))) %>% 
  ggplot(aes(n, library)) +
  geom_segment(aes(xend=0, yend=library), color = ft_cols$slate, size=3.5) +
  geom_text(
    aes(x = (n+max(n)*0.005), label = sprintf("%s (%s)", n, pct)), 
    hjust = 0, size = 3, family = font_rc, color = ft_cols$gray
  ) +
  scale_x_comma(position = "top", limits=c(0, 500)) +
  labs(
    x = "# package using the library", y = NULL,
    title = "Top 50% of libraries used across Homebrew formulae"
  ) +
  theme_ft_rc(grid="X") +
  theme(axis.text.y = element_text(family = "mono"))

It seems openssl is pretty popular (not surprising but always good to see cybersecurity things at the top of good lists for a change)! macOS ships with an even more dreadful (I know that’s hard to imagine) default Python setup than usual so it being number 2 is not unexpected.

And, finally, we can also check on how frequently formulae are installed. Let’s look back on the last 90 days:

ggplot() +
  geom_density(
    aes(x = installs$count, y = stat(count)),
    color = ft_cols$slate, fill = alpha(ft_cols$slate, 1/2)
  ) +
  scale_x_comma("# install events", trans = "log10") +
  scale_y_comma("# formulae") +
  labs(
    title = "Homebrew Formulate 'Install Events' Distribution (Past 90 days)"
  ) +
  theme_ft_rc(grid="XY")

I’ll let you play with the package to find out who the heavy hitters are and explore more about the Homebrew ecosystem.

FIN

Kick the tyres. File issues & PRs and a hearty “Welcome!” to the Homebrew ecosystem for Linux and Windows users. My hope is that the WSL availability will eventually make it easier to develop for Windows systems and avoid the “download the kinda sketchy compiled windows libraries from github on package install” practice we have today.

If you crank out some analytics using the packages don’t forget to blog about it and drop a link in the comments!

Fast Static Maps Built with R

Luke Whyte posted an article (apologies for a Medium link) over on Towards Data Science showing how to use a command line workflow involving curl, node and various D3 libraries and javascript source files to build a series of SVG static maps. It’s well written and you should give it a read especially since he provides the code and data.

We can do all of that in R with the help of a couple packages and by using a free geocoding service which will also allow us to put more data on the map, albeit with some extra work due to it returning weird values for Hawaii locations.

library(albersusa) # git.sr.ht/~hrbrmstr/albersusa | git[la|hu]b/hrbrmstr/albersusa
library(rgeocodio) # git.sr.ht/~hrbrmstr/rgeocodio | git[la|hu]b/hrbrmstr/rgeocodio
library(tidyverse)

# the data url from the original blog
fil <- "https://query.data.world/s/awyahzfiikyoqi5ygpvauqslwmqltr"

read_csv(fil, col_types = "cd") %>% 
  select(area=1, pct=2) %>% 
  mutate(pct = pct/100) -> xdf # make percents proper percents

gc <- gio_batch_geocode(xdf$area)

The result of the geocoding is a data frame that has various confidences associated with the result. We’ll pick the top one for each and then correct for the errant Hawaii longitude it gives back:

map2_df(gc$query, gc$response_results, ~{
  out <- .y[1,,]
  out$area <- .x
  out
}) %>% 
  filter(!is.na(location.lat)) %>% 
  select(area, state = address_components.state, lat=location.lat, lon=location.lng) %>% 
  mutate(
    lat = ifelse(grepl("Honolu", area), 21.3069, lat),
    lon = ifelse(grepl("Honolu", area), -157.8583, lon)
  ) %>% 
  left_join(xdf) %>% 
  as_tibble() -> area_pct

area_pct
## # A tibble: 47 x 5
##    area                                        state   lat    lon   pct
##    <chr>                                       <chr> <dbl>  <dbl> <dbl>
##  1 McAllen-Edinburg-Mission, TX                TX     26.2  -98.1 0.102
##  2 Houston-The Woodlands-Sugar Land, TX        TX     29.6  -95.8 0.087
##  3 Santa Maria-Santa Barbara, CA               CA     34.4 -120.  0.081
##  4 Las Vegas-Henderson-Paradise, NV            NV     36.1 -115.  0.08 
##  5 Los Angeles-Long Beach-Anaheim, CA          CA     33.9 -118.  0.075
##  6 Miami-Fort Lauderdale-West Palm Beach, FL   FL     26.6  -80.1 0.073
##  7 Dallas-Fort Worth-Arlington, TX             TX     33.3  -98.4 0.069
##  8 Washington-Arlington-Alexandria, DC-VA-MD-… WV     39.2  -81.7 0.068
##  9 Bridgeport-Stamford-Norwalk, CT             CT     41.3  -73.1 0.067
## 10 San Jose-Sunnyvale-Santa Clara, CA          CA     37.4 -122.  0.065
## # … with 37 more rows

The albersusa package provides base maps with Alaska & Hawaii elided into a composite U.S. map. As such, we need to elide any points that are in Alaska & Hawaii:

us <- usa_composite()
us_map <- fortify(us, region="name")

hi <- select(filter(area_pct, state == "HI"), lon, lat)
(hi <- points_elided(hi))
area_pct[area_pct$state == "HI", c("lon", "lat")] <- hi

Then, it’s just a matter of using ggplot2:

ggplot() +
  geom_map(
    data = us@data, map=us_map,
    aes(map_id=name), 
    fill = "white", color = "#2b2b2b", size = 0.1
  ) +
  geom_point(
    data = area_pct, aes(lon, lat, size = pct), 
    fill = alpha("#b30000", 1/2), color = "#b30000", shape=21
  ) +
  ggalt::coord_proj(us_laea_proj) +
  scale_y_continuous(expand=c(0, 3)) +
  scale_radius(
    name = NULL, label = scales::percent_format(1)
  ) +
  labs(x = "Estimated percent of undocumented residents in U.S. metro areas. Source: Pew Research Center") +
  theme_minimal() +
  theme(axis.text = element_blank()) +
  theme(axis.title.x = element_text(hjust=0.5, size = 8)) +
  theme(axis.title.y = element_blank()) +
  theme(panel.grid = element_blank()) +
  theme(legend.position = c(0.9, 0.3)) -> gg

ggsave(filename = "map.svg", device = "svg", plot = gg, height = 5, width = 7)

Unlike the post’s featured image (which has to be a bitmap…grrr) the resultant SVG is below:

FIN

There is absolutely nothing wrong with working where you’re most comfortable and capable and Luke definitely wields the command line and javascript incredibly well. This alternate way of doing things in R may help other data journalists who are more comfortable in R or want to increase their R knowledge replicate and expand upon Luke’s process.

If you’ve got alternate ways of doing this in R or even (gasp) Python, drop a note in the comments with a link to your blog so folks who are comfortable neither in R nor the command line can see even more ways of producing this type of content.

Quick Hit: Using seymour to Subscribe to your Git[la|hu]b Repo Issues in Feedly

The seymour? Feedly API package has been updated to support subscribing to RSS/Atom feeds. Previously the package was intended to just treat your Feedly as a data source, but there was a compelling use case for enabling subscription support: subscribing to code repository issues. Sure, there’s already email notice integration for repository issues on most social coding platforms but if you always have Feedly up (like I usually do) having issues aggregated into a Feedly category may be a better way to keep tabs on what’s going on.

If you use GitLab, that platform already has RSS feeds for public repositories. GitHub users have to use either RSSHub or gh-feed to do the same (note that you can host your own instance of either of those tools).

If you have more than a few repos and want to have their issues shunted to Feedly you could go manually enter them into the Feedly UI but that could be a pain, especially if “more than a few” is in the dozens or hundreds. But, we have R and can automate this. I’m providing an example for GitHub (since most readers art still stuck on that legacy platform) via the gh package but the same methods can be done for GitLab using the gitlabr? package.

First, we need to get the list of public GitHub issues you own:

library(gh)
library(purrr)
library(seymour) # git.sr.ht/~hrbrmstr/seymour, git[la|hu]b/hrbrmstr/seymour

gh::gh(
  "/user/repos", 
  visibility = "public",
  affiliation = "owner",
  sort = "created", direction = "desc", 
  .token = Sys.getenv("GITHUB_PAT") # see ?devtools::github_pat
) -> gh_repos

If you have more than 30 the gh package has a gh_next() function which will enable you to paginate through all your repos.

Now you need to choose between gh-feed or RSSHub and prepend a special URL prefix to a user/repo string to create a usable RSS URL. These are the prefixes:

We’ll use RSSHub for the remainder of the example.

Let’s do this first in base R. feedly_subscribe() takes an RSS/Atom URL as a parameter and optionally supports supplying a title and a vector of Feedly category names to help organize your new addition. The title will be automagically intuited by Feedly if not supplied. We’ll iterate over the return value of our call to the GitHub API and add subscribe to each repo.

do.call(
  rbind.data.frame,
  lapply(sapply(gh_repos, "[[", "name"), function(.x) {
    feedly_subscribe(
      feed_url = sprintf("https://rsshub.app/github/%s/%s", "hrbrmstr", .x),
      category = "github issues"
    )
  }) 
) -> res

The res value is a data frame that just has the resultant metadata about feed ids and where they’re located.

Here’s the same thing tidyverse-style:

map_chr(repos, "name") %>% 
  sprintf("https://rsshub.app/github/%s/%s", "hrbrmstr", .) %>% 
  map_df(feedly_subscribe, category = "github issues") -> res

FIN

There are two other addition to the seymour package: feedly_subscriptions(). This convenience function just pulls a data frame of feeds you subscribe to. The same data could be retrieved via the existing “stream” functions but this new function is faster and more targeted. The other one is feedly_categories() which you can use to identify the categories you have. The same data could be retrieved via the “collections” functions, but this function — again — is faster and more targeted.

As usual, kick the tyres and file issues/PRs as needed.

Acquisitions and Supply Chains: The Achilles’ heel of Product/Organizational Security

2019-01-29 – 09:50
Posted in Cybersecurity, Information Security
Leave a Comment

(A reminder to folks expecting “R”/”data science” content: the feed for that is at https://rud.is/b/category/r/feed/ if you don’t want to see the occasional non-R/datasci posts.)

Over at the $WORK blog we posted some research into the fairly horrible Cisco RV320/RV325 router vulnerability. The work blog is the work blog and this blog is my blog (i.e. opinions are my own, yada, yada, yada) and I felt compelled to post a cautionary take to vendors and organizations in general on how security issues can creep into your environment as a result of acquisitions and supply chains.

Looking purely at the evidence gathered from internet scans — which include SSL certificate info — and following the trail in the historical web archive one can make an informed, speculative claim that the weakness described in CVE-2019-1653 existed well before the final company logo ended up on the product.

It appears that NetKlass was at least producing the boards for this class of SMB VPN router and ultimately ended up supplying them to Linksys. A certain giant organization bought that company (and subsequently sold it off again) and it’s very likely this vulnerability ended up in said behemoth’s lap due to both poor supply chain management — in that Linksys seems to have done no security testing on the sourced parts — and, due to the acquisition, which caused those security issues to end up in a major brand’s product inventory.

This can happen to any organization involved in sourcing hardware/software from a third party and/or involved in acquiring another company. Receiving compliance-driven checkbox forms on the efficacy of the target security programs (or sw/hw) is not sufficient but is all too common a practice. Real due diligence involves kinda-trusting then verifying that the claims are accurate.

Rigorous product testing on the part of the original sourcing organization and follow-up assurance testing at the point of acquisition would have very likely caught this issue before it became a responsibly disclosed vulnerability.

Quick Hit: Automating Production Graphics Uploads in R Markdown Documents with googledrive

As someone who measures all kinds of things on the internet as part of his $DAYJOB, I can say with some authority that huge swaths of organizations are using cloud-services such as Google Apps, Dropbox and Office 365 as part of their business process workflows. For me, one regular component that touches the “cloud” is when I have to share R-generated charts with our spiffy production team for use in reports, presentations and other general communications.

These are typically project-based tasks and data science team members typically use git- and AWS-based workflows for gathering data, performing analyses and generating output. While git is great at sharing code and ensuring the historical integrity of our analyses, we don’t expect the production team members to be or become experts in git to use our output. They live in Google Drive and thanks to the googledrive? package we can bridge the gap between code and output with just a few lines of R code.

We use “R projects” to organize things and either use spinnable R scripts or R markdown documents inside those projects to gather, clean and analyze data.

For 2019, we’re using new, work-specific R markdown templates that have one new YAML header parameter:

params:
  gdrive_folder_url: "https://drive.google.com/drive/u/2/SOMEUSELESSHEXSTRING"

which just defines the Google Drive folder URL for the final output directory in the ☁️.

Next is a new pre-configured knitr chunk call at the start of these production chart-generating documents:

knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE, dev = c("png", "cairo_pdf"),
  echo = FALSE,
  fig.retina = 2,
  fig.width = 10,
  fig.height = 6,
  fig.path = "prod/charts/"
)

since the production team wants PDF so they can work with it in their tools. Our testing showed that cairo_pdf produces the best/most consistent output, but PNGs show up better in the composite HTML documents so we use that order deliberately.

The real change is the consistent naming of the fig.path directory. By doing this, all we have to do is add a few lines (again, automatically generated) to the bottom of the document to have all the output automagically go to the proper Google Drive folder:

# Upload to production ----------------------------------------------------

googledrive::drive_auth()

# locate the folder
gdrive_prod_folder <- googledrive::as_id(params$gdrive_folder_url)

# clean it out
gdrls <- googledrive::drive_ls(gdrive_prod_folder)
if (nrow(gdrls) > 0) {
  dplyr::pull(gdrls, id) %>%
    purrr::walk(~googledrive::drive_rm(googledrive::as_id(.x)))
}

# upload new
list.files(here::here("prod/charts"), recursive = TRUE, full.names = TRUE) %>%
  purrr::walk(googledrive::drive_upload, path = gdrive_prod_folder)

Now, we never have to remember to drag documents into a browser and don’t have to load invasive Google applications onto our systems to ensure the right folks have the right files at the right time. We just have to use the new R markdown document type to generate a starter analysis document with all the necessary boilerplate baked in. Plus, .httr-oauth file is automatically ignored in .gitignore so there’s no information leakage to shared git repositories.

FIN

If you want to experiment with this, you can find a pre-configured template in the markdowntemplates package over at sr.ht, GitLab, or GitHub.

If you install the package you’ll be able to select this output type right from the new document dialog:

and new template will be ready to go with no copying, cutting or pasting.

Plus, since the Google Drive folder URL is an R markdown parameter, you can also use this in script automation (provided that you’ve wired up oauth correctly for those scripts).

hrbrthemes 0.6.0 on CRAN + Other In-Development Package News

Version 0.6.0 of the hrbrthemes? package should be hitting a CRAN mirror near you soon. Apart from some general documentation and code cleanup this release includes the dark theme folks have been seeing in blog posts and tweets over the past few months. It’s called theme_ft_rc() since it is an homage to the wonderful new chart theme developed by the @ft_data crew over at the Financial Times (you can see examples from their work here).

While there was nothing stopping folks from using the GitHub version, the CRAN release makes it more widely available. There are still intermittent issues with fonts for some folks which I’ll be working on for the next release.

Since you’ve already seen lots of examples of these charts I won’t just make a gratuitous example using the theme. I will, however, make some charts based on a new data package dubbed iceout?. The iceout package was originally conceived by Ben Tupper from the Bigelow Laboratory for Ocean Sciences. I keep an eye on fellow Mainer repositories and I did not realize (but should have known) that researches keep track of when inland bodies of water freeze and thaw. The package name is derived from the term used for the thaw measurements (“ice-out” or “ice-off”).

Before becoming obsessed with this data and getting the package to the current state it is in, the original codebase worked off of a USGS Lake Ice-Out Data for New England dataset that focused solely on New England and only went up to 2005. Some digging discovered that

Maine’s Department of Agriculture and Forestry maintains online records since 2003; and,
Minnesota’s Department of Natural Resources maintains a comprehensive database of records going back to the 1800’s.

But I hit the jackpot after discovering the U.S. National Snow & Ice Data Center’s Global Lake and River Ice Phenology dataset which:

… contains freeze and breakup dates and other ice cover descriptive data for 865 lakes and rivers. Of the 542 water bodies that have records longer than 19 years, 370 are in North America and 172 are in Eurasia; 249 have records longer than 50 years; and 66 longer than 100 years. A few have data prior to 1845. These data, from water bodies distributed around the Northern Hemisphere, allow analysis of broad spatial patterns as well as long-term temporal patterns.

So, I converted the original package to a data package containing all four of those datasets plus some interactive functions for pulling “live” data and a set of “builders” to regenerate the databases. Let’s take a quick look at what’s in the NSIDC data and the global coverage area:

library(iceout) # github/hrbrmstr/iceout
library(hrbrthemes) 
library(ggplot2)
library(dplyr)

data("nsidc_iceout")

glimpse(nsidc_iceout)
## Observations: 35,918
## Variables: 37
## $ lakecode                <chr> "ARAI1", "ARAI1", "ARAI1", "ARAI1", "ARAI1", "ARAI1", "ARAI1…
## $ lakename                <chr> "Lake Suwa", "Lake Suwa", "Lake Suwa", "Lake Suwa", "Lake Su…
## $ lakeorriver             <chr> "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", …
## $ season                  <chr> "1443-44", "1444-45", "1445-46", "1446-47", "1447-48", "1448…
## $ iceon_year              <dbl> 1443, 1444, 1445, 1446, 1447, 1448, 1449, 1450, 1451, 1452, …
## $ iceon_month             <dbl> 12, 11, 12, 12, 11, 12, 12, 12, 12, 11, 12, 12, 12, 12, 12, …
## $ iceon_day               <dbl> 8, 23, 1, 2, 30, 8, 13, 8, 23, 28, 3, 5, 1, 5, 6, 20, 10, 15…
## $ iceoff_year             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ iceoff_month            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ iceoff_day              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ duration                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ latitude                <dbl> 36.15, 36.15, 36.15, 36.15, 36.15, 36.15, 36.15, 36.15, 36.1…
## $ longitude               <dbl> 138.08, 138.08, 138.08, 138.08, 138.08, 138.08, 138.08, 138.…
## $ country                 <chr> "Japan", "Japan", "Japan", "Japan", "Japan", "Japan", "Japan…
## $ froze                   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ obs_comments            <chr> "calendar correction for ice_on: -30 days of original data; …
## $ area_drained            <dbl> 531, 531, 531, 531, 531, 531, 531, 531, 531, 531, 531, 531, …
## $ bow_comments            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ conductivity_us         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ elevation               <dbl> 759, 759, 759, 759, 759, 759, 759, 759, 759, 759, 759, 759, …
## $ filename                <chr> "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARA…
## $ initials                <chr> "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARA…
## $ inlet_streams           <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
## $ landuse_code            <chr> "UAFO", "UAFO", "UAFO", "UAFO", "UAFO", "UAFO", "UAFO", "UAF…
## $ largest_city_population <dbl> 52000, 52000, 52000, 52000, 52000, 52000, 52000, 52000, 5200…
## $ max_depth               <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, …
## $ mean_depth              <dbl> 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, …
## $ median_depth            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ power_plant_discharge   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ secchi_depth            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ shoreline               <dbl> 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, …
## $ surface_area            <dbl> 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, …
## $ state                   <chr> "Nagano Prefecture", "Nagano Prefecture", "Nagano Prefecture…
## $ iceon_date              <date> 1443-12-08, 1444-11-23, 1445-12-01, 1446-12-02, 1447-11-30,…
## $ iceon_doy               <dbl> 342, 328, 335, 336, 334, 343, 347, 342, 357, 333, 337, 339, …
## $ iceout_date             <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ iceout_doy              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

maps::map("world", ".", exact = FALSE, plot = FALSE, fill = TRUE) %>%
  fortify() -> wrld

ggplot() + 
  ggalt::geom_cartogram(
    data = wrld, map = wrld, aes(long, lat, map_id=region), 
    fill="#3B454A",  color = "white", size = 0.125
  ) +
  geom_point(
    data = distinct(nsidc_iceout, lakeorriver, longitude, latitude),
    aes(longitude, latitude, fill = lakeorriver), 
    size = 1.5, color = "#2b2b2b", stroke = 0.125, shape = 21
  ) +
  scale_fill_manual(
    name = NULL, values = c("L"="#fdbf6f", "R"="#1f78b4"), labels=c("L" = "Lake", "R" = "River")
  ) +
  ggalt::coord_proj("+proj=wintri", ylim = range(nsidc_iceout$latitude, na.rm = TRUE)) +
  labs(title = "NSIDC Dataset Coverage") +
  theme_ft_rc(grid="") +
  theme(legend.position = c(0.375, 0.1)) +
  theme(axis.text = element_blank(), axis.title = element_blank())

W00t! Lots of data (though not all of the extra features are populated for all readings/areas)!

I think the reason the ice-out data garnered my obsession was how it can be used as another indicator that we are indeed in the midst of a climate transformation. Let’s look at the historical ice-out information for Maine inland bodies of water:

filter(nsidc_iceout, country == "United States", state == "ME") %>% 
  mutate(iceout_date = as.Date(format(iceout_date, "2020-%m-%d"))) %>% # we want the Y axis formatted as month-day so we choose a leap year to ensure we get leap dates (if any)
  ggplot(aes(iceoff_year, iceout_date)) +
  geom_point(aes(color = lakename), size = 0.5, alpha=1/4) +
  geom_smooth(aes(color = lakename), se=FALSE, method = "loess", size=0.25) +
  scale_y_date(date_labels = "%b-%d") +
  labs(
    x = NULL, y = "Ice-out Month/Day", color = NULL,
    title = "Historical Ice-out Data/Trends for Maine Inland Bodies of Water"
  ) +
  theme_ft_rc(grid="XY")

You can follow that code-pattern to look at other states. It’s also fun to look at the ice-out date distributions by latitude grouping:

filter(nsidc_iceout, !is.na(latitude) & !is.na(longitude) & !is.na(iceout_date)) %>% 
  filter(country == "United States") %>% 
  mutate(iceout_date = as.Date(format(iceout_date, "2020-%m-%d"))) %>% 
  mutate(lat_grp = cut(latitude, scales::pretty_breaks(5)(latitude), ordered_result = TRUE)) %>% 
  arrange(desc(iceoff_year)) %>% 
  ggplot() +
  ggbeeswarm::geom_quasirandom(
    aes(lat_grp, iceout_date, fill = iceoff_year), groupOnX = TRUE, 
    shape = 21, size =1, color = "white", stroke = 0.125, alpha=1/2
  ) +
  scale_y_date(date_labels = "%b-%d") +
  viridis::scale_fill_viridis(name = "Year", option = "magma") +
  labs(
    x = "Latitude Grouping", y = "Ice-out Month/Day",
    title = "U.S. Ice-out Historical Day/Month Distributions by Latitude Grouping"
  ) +
  theme_ft_rc(grid="Y")

If you want to focus on individual lakes there’s a Shiny app for that (well one for the U.S. anyway).

After loading the package, just enter explore_us() at an R console and you’ll see something like this:

The leaflet view will zoom to each new lake selected and the graph will be updated as well.

Other Package News

The sergeant? package is reaching a stable point in the 0.8.0 branch (mostly due to David Severski’s tireless help finding bugs ?) and should be headed to CRAN soon. Get your issues or PRs in if you want them CRANdied.

I’ve finally updated the Java library dependencies in pdfboxjars? so pdfbox? will no longer cause GitHub to tell you or I that it is insecure.

There’s a new package dubbed reapr? that is aimed somewhere at the intersection of curl + httr + rvest. Fundamentally, it provides some coder-uplift when scraping data. The README has examples but here’s what you get on an initial scrape of this blog’s index page:

reapr::reap_url("http://rud.is/b")
##                Title: rud.is | "In God we trust. All others must bring data"
##         Original URL: http://rud.is/b
##            Final URL: https://rud.is/b/
##           Crawl-Date: 2019-01-17 19:51:09
##               Status: 200
##         Content-Type: text/html; charset=UTF-8
##                 Size: 50 kB
##           IP Address: 104.236.112.222
##                 Tags: body[1], center[1], form[1], h2[1], head[1], hgroup[1], html[1],
##                       label[1], noscript[1], section[1], title[1],
##                       aside[2], nav[2], ul[2], style[5], img[6],
##                       input[6], article[8], time[8], footer[9], h1[9],
##                       header[9], p[10], li[19], meta[20], div[31],
##                       script[40], span[49], link[53], a[94]
##           # Comments: 17
##   Total Request Time: 2.093s

The reap_url() function:

Uses httr::GET() to make web connections and retrieve content which enables it to behave more like an actual (non-javascript-enabled) browser. You can pass anything httr::GET() can handle to ... (e.g. httr::user_agent()) to have as much granular control over the interaction as possible.
Returns a richer set of data. After the httr::response object is obtained many tasks are performed including:
- timestamping of the URL crawl
- extraction of the asked-for URL and the final URL (in the case
  of redirects)
- extraction of the IP address of the target server
- extraction of both plaintext and parsed (xml_document) HTML
- extraction of the plaintext webpage <title> (if any)
- generation of a dynamic list tags in the document which can be
  fed directly to HTML/XML search/retrieval function (which may
  speed up node discovery)
- extraction of the text of all comments in the HTML document
- inclusion of the full httr::response object with the returned
  object
- extraction of the time it took to make the complete request

I’m still wrestling with the API so definitely file issues with suggestions (wherever you’re most comfortable socially coding).

Speaking of IP addresses (bullet 3 above), I finally got some time to study the gdns? C library (a modern DNS API library) and created the clandnstine? package. The package name jeu de mots is due to the fact that the intent is to have it solely support DNS over TLS requests since regular DNS is plaintext, enables ISP spying/injection and generally fraught with peril. All forms of DNS lookups are supported. The catch is that you have to point it at a DNS over TLS-capable resolver. The package defaults to Quad9 (9.9.9.9) because I trust them more than Google or Cloudflare (btw: that’s not saying much as I trust used car salesfolks more than all three of them). Keep an eye (or RSS reader) peeled on $WORK blog over the coming weeks as I’ll have some analysis and data on a few hundred DNS over TLS endpoints you can use thanks to a new study developed by cow-orkers Jon Hart and Shan Skidar.

There also a toy package forecequotes? that is more “have fun with the cli & crayon packages” than anything else. But if you like Star Wars, random quote APIs and want to integrate richer command line interface output into your work, then definitely give it a peek.

Finally, I haven’t used R’s direct C interface in a while (since Rcpp is addictive and handy) and wanted to keep those skills fresh, so I made a wrapper to an old (in internet years) IP address trie C library. The underlying library is much slower than what we use in iptools but it works, does a bit more than its iptoos counterpart and covers data marshaling, external pointer handling, and attribute/class setting so it may be a half-decent reference package for using the R<->C bridge.

FIN

If you know of more/better ice-out data please drop an issue in the Bigelow Labs’ iceout repo and I’ll get it integrated. And, if you do your own ice-out exploration definitely blog about it, tell R Weekly and drop a note in the comments.

Here are links to all the mentioned packages grouped by social coding platform (so you can interact/collaborate wherever you feel most comfortable working):

sr.ht

GitLab

GitHub

splashr 0.6.0 Now Uses the CRAN-nascent stevedore Package for Docker Orchestration

2019-01-14 – 20:56
Posted in R, splash, splashr, web scraping
Comments (2)

The splashr package [srht|GL|GH] — an alternative to Selenium for javascript-enabled/browser-emulated web scraping — is now at version 0.6.0 (still in dev-mode but on its way to CRAN in the next 14 days).

The major change from version 0.5.x (which never made it to CRAN) is a swap out of the reticulated docker package with the pure-R stevedore? package which will make it loads more compatible across the landscape of R installs as it removes a somewhat heavy dependency on a working Python environment (something quite challenging to consistently achieve in that fragmented language ecosystem).

Another addition is a set of new user agents for Android, Kindle, Apple TV & Chromecast as an increasing number of sites are changing what type of HTML (et. al.) they send to those and other alternative glowing rectangles. A more efficient/sane user agent system will also be introduced prior to the CRAN. Now’s the time to vote on existing issues or file new ones if there is a burning desire for new or modified functionality.

Since the Travis tests now work (they were failing miserably because of they Python dependency) I’ve integrated the changes from the 0.6.0 to the master branch but you can follow the machinations of the 0.6.0 branch up until CRAN release.

Roll Your Own Federal Government Shutdown-caused SSL Certificate Expiration Monitor in R

By now, even remote villages on uncharted islands in the Pacific know that the U.S. is in the midst of a protracted partial government shutdown. It’s having real impacts on the lives of Federal government workers but they aren’t the only ones. Much of the interaction Federal agencies have with the populace takes place online and the gateway to most of these services/information is a web site.

There are Federal standards that require U.S. government web sites to use SSL/TLS certificates and those certificates have something in common with, say, a loaf of bread you buy at the store: they expire. In all but the best of orgs — or we zany folks who use L e t ‘ s E n c r y p t and further propel internet denizens into a false sense of safety & privacy — renewing certificates involves manual labor/human intervention. For a good chunk of U.S. Federal agencies, those particular humans aren’t around. If a site’s SSL certificate expires and isn’t re-issued, it causes browsers to do funny things, like this:

Now, some of these sites are configured improperly in many ways, including them serving pages on both http and https (vs redirecting to https immediately upon receiving an http connection). But, browsers like Chrome will generally try https first and scare you into not viewing the site.

But, how big a problem could this really be? We can find out with a fairly diminutive R script that:

grabs a list of Federal agency domains (thanks to the GSA)
tries to make a SSL/TLS connection (via the openssl package) to the apex domain or www. prefixed apex domain
find the expiration date for the cert
do some simple date math

I’ve commented the script below pretty well so I’ll refrain from further blathering:

library(furrr)
library(openssl)
library(janitor)
library(memoise)
library(hrbrthemes)
library(tidyverse)

# fetch the GSA CSV:

read_csv(
  file = "https://raw.githubusercontent.com/GSA/data/master/dotgov-domains/current-federal.csv",
  col_types = "ccccccc"
) %>% 
  janitor::clean_names() -> xdf

# make openssl::download_ssl_cert calls safer in the even there
# are network/connection issues
.dl_cert <- possibly(openssl::download_ssl_cert, otherwise = NULL)

# memoise the downloader just in case we need to break the iterator
# below or another coding error causes it to break (the cached values
# will go away in a new R session or if you manually purge them)
dl_cert <- memoise::memoise(.dl_cert)

# we'll do this in parallel to save time (~1,200 domains)
plan(multiprocess)

# now follow the process described in the bullet points
future_map_dfr(xdf$domain_name, ~{

  who <- .x

  crt <- dl_cert(who)  

  if (!is.null(crt)) {
    # shld be the first cert and expires is second validity field
    expires <- crt[[1]]$validity[2] 
  } else {
    crt <- dl_cert(sprintf("www.%s", who)) # may be on www b/c "gov"
    if (!is.null(crt)) {
      expires <- crt[[1]]$validity[2]
    } else {
      expires <- NA_character_  
    }
  }

  # keep a copy of the apex domain, the expiration field and the cert
  # (in the event you want to see just how un-optimized the U.S. IT 
  # infrastructure is by how many stupid vendors they use for certs)
  tibble(
    who = who,
    expires = expires,
    cert = list(crt)
  )

}) -> cdf

Now, lets make strings into proper dates, count only the dates starting with the date of the shutdown to the end of 2019 (b/c the reckless human at the helm is borderline insane enough to do that) and plot the timeline:

filter(cdf, !is.na(expires)) %>% 
  mutate(
    expires = as.Date(
      as.POSIXct(expires, format="%b %d %H:%M:%S %Y")
    )
  ) %>% 
  arrange(expires) 
  count(expires) %>% 
  filter(
    expires >= as.Date("2018-12-22"), 
    expires <= as.Date("2019-12-31")
  ) %>% 
  ggplot(aes(expires, n)) +
  geom_vline(
    xintercept = Sys.Date(), linetype="dotted", size=0.25, color = "white"
  ) +
  geom_label(
    data = data.frame(), 
    aes(x = Sys.Date(), y = Inf, label = "Today"),
    color = "black", vjust = 1
  ) +
  geom_segment(aes(xend=expires, yend=0), color = ft_cols$peach) + 
  scale_x_date(name=NULL, date_breaks="1 month", date_labels="%b") +
  scale_y_comma("# Federal Agency Certs") +
  labs(title = "2019 Federal Agency ShutdownCertpoalypse") +
  theme_ft_rc(grid="Y")

Now, I’m unwarrantedly optimistic that this debacle could be over by the end of January. How many certs (by agency) could go bad by then?

left_join(cdf, xdf, by=c("who"="domain_name")) %>% 
  mutate(
    expires = as.Date(
      as.POSIXct(expires, format="%b %d %H:%M:%S %Y")
    )
  ) %>% 
  filter(
    expires >= as.Date("2018-12-22"),
    expires <= as.Date("2019-01-31")
  ) %>% 
  count(agency, sort = TRUE)
## # A tibble: 10 x 2
##    agency                                          n
##    <chr>                                       <int>
##  1 Government Publishing Office                    8
##  2 Department of Commerce                          4
##  3 Department of Defense                           3
##  4 Department of Housing and Urban Development     3
##  5 Department of Justice                           3
##  6 Department of Energy                            1
##  7 Department of Health and Human Services         1
##  8 Department of State                             1
##  9 Department of the Interior                      1
## 10 Department of the Treasury                      1

Ugh.

FIN

Not every agency is fully shutdown and not all workers in charge of cert renewals are furloughed (or being forced to work without pay). But, this one other area shows the possible unintended consequences of making rash, partisan decisions (something both Democrats & Republicans excel at).

You can find the contiguous R code at 2018-01-10-shutdown-certpocalypse.R and definitely try to explore the contents of those certificates.