rud.is

R Package Update: urlscan

The urlscan? package (an interface to the urlscan.io API) is now at version 0.2.0 and supports urlscan.io’s authentication requirement when submitting a link for analysis. The service is handy if you want to learn about the details — all the gory technical details — for a website.

For instance, say you wanted to check on r-project.org. You could manually go to the site, enter that into the request bar and wait for the result:

Or, you can use R!. First pick your preferred social coding site, read through the source (this is going to be new advice for every post starting with this one. Don’t blindly trust code from any social coding site) an then install the urlscan package:

devtools::install_git("https://git.sr.ht/~hrbrmstr/urlscan")
# or
devtools::install_gitlab("hrbrmstr/urlscan")
# or
devtools::install_github("hrbrmstr/urlscan")

Next, head on over back to urlscan.io and grab an API key (it’s free). Stick that in your ~/.Renviron under URLSCAN_API_KEY and then readRenviron("~/.Renviron") in your R console.

Now, let’s check out r-project.org.

library(urlscan)
library(tidyverse)

rproj <- urlscan_search("r-project.org")

rproj
##   URL Submitted: https://r-project.org/
##   Submission ID: eb2a5da1-dc0d-43e9-8236-dbc340b53772
## Submission Type: public
## Submission Note: Submission successful

There is more data in that rproj object but we have enough to get more detailed results back. Note that site will return an error when you use urlscan_result() if it hasn’t finished the analysis yet.

rproj_res <- urlscan_result("eb2a5da1-dc0d-43e9-8236-dbc340b53772", include_shot = TRUE)

rproj_res
##             URL: https://www.r-project.org/
##         Scan ID: eb2a5da1-dc0d-43e9-8236-dbc340b53772
##       Malicious: FALSE
##      Ad Blocked: FALSE
##     Total Links: 8
## Secure Requests: 19
##    Secure Req %: 100%

That rproj_res holds quite a bit of data and makes no assumptions about how you want to use it so you will need to do some wrangling with it to find out. The rproj_res$scan_result entry contains entries with the following information:

task: Information about the submission: Time, method, options, links to screenshot/DOM
page: High-level information about the page: Geolocation, IP, PTR
lists: Lists of domains, IPs, URLs, ASNs, servers, hashes
data: All of the requests/responses, links, cookies, messages
meta: Processor output: ASN, GeoIP, AdBlock, Google Safe Browsing
stats: Computed stats (by type, protocol, IP, etc.)

Let’s see how many domains the R Core folks are allowing to track you (if you’re not on legacy Windows OS you can find curlparse at https://git.sr.ht/~hrbrmstr/curlparse or git[la|hu]b/hrbrmstr/curlparse:

curlparse::domain(rproj_res$scan_result$lists$urls) %>% # you can use urltools::domain() instead of curlparse
  table(dnn = "domain") %>% 
  broom::tidy() %>% 
  arrange(desc(n))
## # A tibble: 7 x 2
##   domain                        n
##   <chr>                     <int>
## 1 platform.twitter.com          7
## 2 www.r-project.org             5
## 3 pbs.twimg.com                 3
## 4 syndication.twitter.com       2
## 5 ajax.googleapis.com           1
## 6 cdn.syndication.twimg.com     1
## 7 r-project.org                 1

Ironically, this is also how I learned that they allow Twitter to insecurely (no subresource integrity nor any content security policy) execute javascript in your browser (twitter javascript is blocked via multiple means at the hrbrmstr compound so I couldn’t see the widget).

Since I added the include_shot = TRUE option, we also get a page screenshot back (as a magick object):

rproj_res$screenshot

FIN

There’s tons of metadata to explore about web sites by using this package so jump in, kick the tyres, have fun! and file issues/PRs as needed.

Homebrew 2.0.0 Released == homebrewanalytics package updated

A major new release of Homebrew has landed and now includes support for Linux as well as Windows! via the Windows Subsystem for Linux. There are overall stability and speed improvements baked in as well. The aforelinked notification has all the info you need to see the minutiae. Unless you’ve been super-lax in updating, brew update will get you the latest release.

There are extra formulae analytics endpoints and the homebrewanalytics? R package has been updated to handle them. A change worth noting in the package is that all the API calls are memoised to avoid hammering the Homebrew servers (though the “API” is really just file endpoints and they aren’t big files but bandwidth is bandwidth). Use the facilities in the memoise package to invalidate the cache if you have long running scripts.

Use your favorite social coding site to install it (If I don’t maintain mirrors on your open social coding platform of choice just drop a note in the comments and I’ll start mirroring there as well):

devtools::install_git("https://git.sr.ht/~hrbrmstr/homebrewanalytics")
# or
devtools::install_gitlab("hrbrmstr/homebrewanalytics")
# or
devtools::install_github("hrbrmstr/homebrewanalytics")

The README and in-package manual pages provide basic examples of retrieving data. But we can improve upon those here, such as finding out the dependency distribution of Homebrew formulae:

library(hrbrthemes)
library(homebrewanalytics) # git.sr.hr/~hrbrmstr ; git[la|hu]b/hrbrmstr
library(tidyverse)

f <- brew_formulae()

mutate(f, n_dependencies = lengths(build_dependencies)) %>% 
  count(n_dependencies) %>% 
  mutate(n_dependencies = factor(n_dependencies)) %>% 
  ggplot() +
  geom_col(aes(n_dependencies, n), fill = ft_cols$slate, width = 0.65) +
  scale_y_comma("# formulae") +
  labs(
    x = "# Dependencies",
    title = "Dependency distribution for Homebrew formulae"
  ) +
  theme_ft_rc(grid="Y")

Given how long it sometimes takes to upgrade my various Homebrew installations I was surprised to see 0 be so prevalent, but one of the major changes in 2.0.0 is going to be more binary installs (unless you really need custom builds) so that is likely part of my experience, especially with the formulae I need to support cybersecurity and spatial operations.

We can also see which formuale are in the top 50%:

unlist(f$dependencies) %>% 
  table(dnn = "library") %>% 
  broom::tidy() %>% 
  arrange(desc(n)) %>% 
  mutate(pct = n/sum(n), cpct = cumsum(pct)) %>% 
  filter(cpct <= 0.5) %>% 
  mutate(pct = scales::percent(pct)) %>% 
  mutate(library = factor(library, levels = rev(library))) %>% 
  ggplot(aes(n, library)) +
  geom_segment(aes(xend=0, yend=library), color = ft_cols$slate, size=3.5) +
  geom_text(
    aes(x = (n+max(n)*0.005), label = sprintf("%s (%s)", n, pct)), 
    hjust = 0, size = 3, family = font_rc, color = ft_cols$gray
  ) +
  scale_x_comma(position = "top", limits=c(0, 500)) +
  labs(
    x = "# package using the library", y = NULL,
    title = "Top 50% of libraries used across Homebrew formulae"
  ) +
  theme_ft_rc(grid="X") +
  theme(axis.text.y = element_text(family = "mono"))

It seems openssl is pretty popular (not surprising but always good to see cybersecurity things at the top of good lists for a change)! macOS ships with an even more dreadful (I know that’s hard to imagine) default Python setup than usual so it being number 2 is not unexpected.

And, finally, we can also check on how frequently formulae are installed. Let’s look back on the last 90 days:

ggplot() +
  geom_density(
    aes(x = installs$count, y = stat(count)),
    color = ft_cols$slate, fill = alpha(ft_cols$slate, 1/2)
  ) +
  scale_x_comma("# install events", trans = "log10") +
  scale_y_comma("# formulae") +
  labs(
    title = "Homebrew Formulate 'Install Events' Distribution (Past 90 days)"
  ) +
  theme_ft_rc(grid="XY")

I’ll let you play with the package to find out who the heavy hitters are and explore more about the Homebrew ecosystem.

FIN

Kick the tyres. File issues & PRs and a hearty “Welcome!” to the Homebrew ecosystem for Linux and Windows users. My hope is that the WSL availability will eventually make it easier to develop for Windows systems and avoid the “download the kinda sketchy compiled windows libraries from github on package install” practice we have today.

If you crank out some analytics using the packages don’t forget to blog about it and drop a link in the comments!

Fast Static Maps Built with R

Luke Whyte posted an article (apologies for a Medium link) over on Towards Data Science showing how to use a command line workflow involving curl, node and various D3 libraries and javascript source files to build a series of SVG static maps. It’s well written and you should give it a read especially since he provides the code and data.

We can do all of that in R with the help of a couple packages and by using a free geocoding service which will also allow us to put more data on the map, albeit with some extra work due to it returning weird values for Hawaii locations.

library(albersusa) # git.sr.ht/~hrbrmstr/albersusa | git[la|hu]b/hrbrmstr/albersusa
library(rgeocodio) # git.sr.ht/~hrbrmstr/rgeocodio | git[la|hu]b/hrbrmstr/rgeocodio
library(tidyverse)

# the data url from the original blog
fil <- "https://query.data.world/s/awyahzfiikyoqi5ygpvauqslwmqltr"

read_csv(fil, col_types = "cd") %>% 
  select(area=1, pct=2) %>% 
  mutate(pct = pct/100) -> xdf # make percents proper percents

gc <- gio_batch_geocode(xdf$area)

The result of the geocoding is a data frame that has various confidences associated with the result. We’ll pick the top one for each and then correct for the errant Hawaii longitude it gives back:

map2_df(gc$query, gc$response_results, ~{
  out <- .y[1,,]
  out$area <- .x
  out
}) %>% 
  filter(!is.na(location.lat)) %>% 
  select(area, state = address_components.state, lat=location.lat, lon=location.lng) %>% 
  mutate(
    lat = ifelse(grepl("Honolu", area), 21.3069, lat),
    lon = ifelse(grepl("Honolu", area), -157.8583, lon)
  ) %>% 
  left_join(xdf) %>% 
  as_tibble() -> area_pct

area_pct
## # A tibble: 47 x 5
##    area                                        state   lat    lon   pct
##    <chr>                                       <chr> <dbl>  <dbl> <dbl>
##  1 McAllen-Edinburg-Mission, TX                TX     26.2  -98.1 0.102
##  2 Houston-The Woodlands-Sugar Land, TX        TX     29.6  -95.8 0.087
##  3 Santa Maria-Santa Barbara, CA               CA     34.4 -120.  0.081
##  4 Las Vegas-Henderson-Paradise, NV            NV     36.1 -115.  0.08 
##  5 Los Angeles-Long Beach-Anaheim, CA          CA     33.9 -118.  0.075
##  6 Miami-Fort Lauderdale-West Palm Beach, FL   FL     26.6  -80.1 0.073
##  7 Dallas-Fort Worth-Arlington, TX             TX     33.3  -98.4 0.069
##  8 Washington-Arlington-Alexandria, DC-VA-MD-… WV     39.2  -81.7 0.068
##  9 Bridgeport-Stamford-Norwalk, CT             CT     41.3  -73.1 0.067
## 10 San Jose-Sunnyvale-Santa Clara, CA          CA     37.4 -122.  0.065
## # … with 37 more rows

The albersusa package provides base maps with Alaska & Hawaii elided into a composite U.S. map. As such, we need to elide any points that are in Alaska & Hawaii:

us <- usa_composite()
us_map <- fortify(us, region="name")

hi <- select(filter(area_pct, state == "HI"), lon, lat)
(hi <- points_elided(hi))
area_pct[area_pct$state == "HI", c("lon", "lat")] <- hi

Then, it’s just a matter of using ggplot2:

ggplot() +
  geom_map(
    data = us@data, map=us_map,
    aes(map_id=name), 
    fill = "white", color = "#2b2b2b", size = 0.1
  ) +
  geom_point(
    data = area_pct, aes(lon, lat, size = pct), 
    fill = alpha("#b30000", 1/2), color = "#b30000", shape=21
  ) +
  ggalt::coord_proj(us_laea_proj) +
  scale_y_continuous(expand=c(0, 3)) +
  scale_radius(
    name = NULL, label = scales::percent_format(1)
  ) +
  labs(x = "Estimated percent of undocumented residents in U.S. metro areas. Source: Pew Research Center") +
  theme_minimal() +
  theme(axis.text = element_blank()) +
  theme(axis.title.x = element_text(hjust=0.5, size = 8)) +
  theme(axis.title.y = element_blank()) +
  theme(panel.grid = element_blank()) +
  theme(legend.position = c(0.9, 0.3)) -> gg

ggsave(filename = "map.svg", device = "svg", plot = gg, height = 5, width = 7)

Unlike the post’s featured image (which has to be a bitmap…grrr) the resultant SVG is below:

FIN

There is absolutely nothing wrong with working where you’re most comfortable and capable and Luke definitely wields the command line and javascript incredibly well. This alternate way of doing things in R may help other data journalists who are more comfortable in R or want to increase their R knowledge replicate and expand upon Luke’s process.

If you’ve got alternate ways of doing this in R or even (gasp) Python, drop a note in the comments with a link to your blog so folks who are comfortable neither in R nor the command line can see even more ways of producing this type of content.

Quick Hit: Using seymour to Subscribe to your Git[la|hu]b Repo Issues in Feedly

The seymour? Feedly API package has been updated to support subscribing to RSS/Atom feeds. Previously the package was intended to just treat your Feedly as a data source, but there was a compelling use case for enabling subscription support: subscribing to code repository issues. Sure, there’s already email notice integration for repository issues on most social coding platforms but if you always have Feedly up (like I usually do) having issues aggregated into a Feedly category may be a better way to keep tabs on what’s going on.

If you use GitLab, that platform already has RSS feeds for public repositories. GitHub users have to use either RSSHub or gh-feed to do the same (note that you can host your own instance of either of those tools).

If you have more than a few repos and want to have their issues shunted to Feedly you could go manually enter them into the Feedly UI but that could be a pain, especially if “more than a few” is in the dozens or hundreds. But, we have R and can automate this. I’m providing an example for GitHub (since most readers art still stuck on that legacy platform) via the gh package but the same methods can be done for GitLab using the gitlabr? package.

First, we need to get the list of public GitHub issues you own:

library(gh)
library(purrr)
library(seymour) # git.sr.ht/~hrbrmstr/seymour, git[la|hu]b/hrbrmstr/seymour

gh::gh(
  "/user/repos", 
  visibility = "public",
  affiliation = "owner",
  sort = "created", direction = "desc", 
  .token = Sys.getenv("GITHUB_PAT") # see ?devtools::github_pat
) -> gh_repos

If you have more than 30 the gh package has a gh_next() function which will enable you to paginate through all your repos.

Now you need to choose between gh-feed or RSSHub and prepend a special URL prefix to a user/repo string to create a usable RSS URL. These are the prefixes:

We’ll use RSSHub for the remainder of the example.

Let’s do this first in base R. feedly_subscribe() takes an RSS/Atom URL as a parameter and optionally supports supplying a title and a vector of Feedly category names to help organize your new addition. The title will be automagically intuited by Feedly if not supplied. We’ll iterate over the return value of our call to the GitHub API and add subscribe to each repo.

do.call(
  rbind.data.frame,
  lapply(sapply(gh_repos, "[[", "name"), function(.x) {
    feedly_subscribe(
      feed_url = sprintf("https://rsshub.app/github/%s/%s", "hrbrmstr", .x),
      category = "github issues"
    )
  }) 
) -> res

The res value is a data frame that just has the resultant metadata about feed ids and where they’re located.

Here’s the same thing tidyverse-style:

map_chr(repos, "name") %>% 
  sprintf("https://rsshub.app/github/%s/%s", "hrbrmstr", .) %>% 
  map_df(feedly_subscribe, category = "github issues") -> res

FIN

There are two other addition to the seymour package: feedly_subscriptions(). This convenience function just pulls a data frame of feeds you subscribe to. The same data could be retrieved via the existing “stream” functions but this new function is faster and more targeted. The other one is feedly_categories() which you can use to identify the categories you have. The same data could be retrieved via the “collections” functions, but this function — again — is faster and more targeted.

As usual, kick the tyres and file issues/PRs as needed.

Acquisitions and Supply Chains: The Achilles’ heel of Product/Organizational Security

2019-01-29 – 09:50
Posted in Cybersecurity, Information Security
Leave a Comment

(A reminder to folks expecting “R”/”data science” content: the feed for that is at https://rud.is/b/category/r/feed/ if you don’t want to see the occasional non-R/datasci posts.)

Over at the $WORK blog we posted some research into the fairly horrible Cisco RV320/RV325 router vulnerability. The work blog is the work blog and this blog is my blog (i.e. opinions are my own, yada, yada, yada) and I felt compelled to post a cautionary take to vendors and organizations in general on how security issues can creep into your environment as a result of acquisitions and supply chains.

Looking purely at the evidence gathered from internet scans — which include SSL certificate info — and following the trail in the historical web archive one can make an informed, speculative claim that the weakness described in CVE-2019-1653 existed well before the final company logo ended up on the product.

It appears that NetKlass was at least producing the boards for this class of SMB VPN router and ultimately ended up supplying them to Linksys. A certain giant organization bought that company (and subsequently sold it off again) and it’s very likely this vulnerability ended up in said behemoth’s lap due to both poor supply chain management — in that Linksys seems to have done no security testing on the sourced parts — and, due to the acquisition, which caused those security issues to end up in a major brand’s product inventory.

This can happen to any organization involved in sourcing hardware/software from a third party and/or involved in acquiring another company. Receiving compliance-driven checkbox forms on the efficacy of the target security programs (or sw/hw) is not sufficient but is all too common a practice. Real due diligence involves kinda-trusting then verifying that the claims are accurate.

Rigorous product testing on the part of the original sourcing organization and follow-up assurance testing at the point of acquisition would have very likely caught this issue before it became a responsibly disclosed vulnerability.

Quick Hit: Automating Production Graphics Uploads in R Markdown Documents with googledrive

As someone who measures all kinds of things on the internet as part of his $DAYJOB, I can say with some authority that huge swaths of organizations are using cloud-services such as Google Apps, Dropbox and Office 365 as part of their business process workflows. For me, one regular component that touches the “cloud” is when I have to share R-generated charts with our spiffy production team for use in reports, presentations and other general communications.

These are typically project-based tasks and data science team members typically use git- and AWS-based workflows for gathering data, performing analyses and generating output. While git is great at sharing code and ensuring the historical integrity of our analyses, we don’t expect the production team members to be or become experts in git to use our output. They live in Google Drive and thanks to the googledrive? package we can bridge the gap between code and output with just a few lines of R code.

We use “R projects” to organize things and either use spinnable R scripts or R markdown documents inside those projects to gather, clean and analyze data.

For 2019, we’re using new, work-specific R markdown templates that have one new YAML header parameter:

params:
  gdrive_folder_url: "https://drive.google.com/drive/u/2/SOMEUSELESSHEXSTRING"

which just defines the Google Drive folder URL for the final output directory in the ☁️.

Next is a new pre-configured knitr chunk call at the start of these production chart-generating documents:

knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE, dev = c("png", "cairo_pdf"),
  echo = FALSE,
  fig.retina = 2,
  fig.width = 10,
  fig.height = 6,
  fig.path = "prod/charts/"
)

since the production team wants PDF so they can work with it in their tools. Our testing showed that cairo_pdf produces the best/most consistent output, but PNGs show up better in the composite HTML documents so we use that order deliberately.

The real change is the consistent naming of the fig.path directory. By doing this, all we have to do is add a few lines (again, automatically generated) to the bottom of the document to have all the output automagically go to the proper Google Drive folder:

# Upload to production ----------------------------------------------------

googledrive::drive_auth()

# locate the folder
gdrive_prod_folder <- googledrive::as_id(params$gdrive_folder_url)

# clean it out
gdrls <- googledrive::drive_ls(gdrive_prod_folder)
if (nrow(gdrls) > 0) {
  dplyr::pull(gdrls, id) %>%
    purrr::walk(~googledrive::drive_rm(googledrive::as_id(.x)))
}

# upload new
list.files(here::here("prod/charts"), recursive = TRUE, full.names = TRUE) %>%
  purrr::walk(googledrive::drive_upload, path = gdrive_prod_folder)

Now, we never have to remember to drag documents into a browser and don’t have to load invasive Google applications onto our systems to ensure the right folks have the right files at the right time. We just have to use the new R markdown document type to generate a starter analysis document with all the necessary boilerplate baked in. Plus, .httr-oauth file is automatically ignored in .gitignore so there’s no information leakage to shared git repositories.

FIN

If you want to experiment with this, you can find a pre-configured template in the markdowntemplates package over at sr.ht, GitLab, or GitHub.

If you install the package you’ll be able to select this output type right from the new document dialog:

and new template will be ready to go with no copying, cutting or pasting.

Plus, since the Google Drive folder URL is an R markdown parameter, you can also use this in script automation (provided that you’ve wired up oauth correctly for those scripts).

hrbrthemes 0.6.0 on CRAN + Other In-Development Package News

Version 0.6.0 of the hrbrthemes? package should be hitting a CRAN mirror near you soon. Apart from some general documentation and code cleanup this release includes the dark theme folks have been seeing in blog posts and tweets over the past few months. It’s called theme_ft_rc() since it is an homage to the wonderful new chart theme developed by the @ft_data crew over at the Financial Times (you can see examples from their work here).

While there was nothing stopping folks from using the GitHub version, the CRAN release makes it more widely available. There are still intermittent issues with fonts for some folks which I’ll be working on for the next release.

Since you’ve already seen lots of examples of these charts I won’t just make a gratuitous example using the theme. I will, however, make some charts based on a new data package dubbed iceout?. The iceout package was originally conceived by Ben Tupper from the Bigelow Laboratory for Ocean Sciences. I keep an eye on fellow Mainer repositories and I did not realize (but should have known) that researches keep track of when inland bodies of water freeze and thaw. The package name is derived from the term used for the thaw measurements (“ice-out” or “ice-off”).

Before becoming obsessed with this data and getting the package to the current state it is in, the original codebase worked off of a USGS Lake Ice-Out Data for New England dataset that focused solely on New England and only went up to 2005. Some digging discovered that

Maine’s Department of Agriculture and Forestry maintains online records since 2003; and,
Minnesota’s Department of Natural Resources maintains a comprehensive database of records going back to the 1800’s.

But I hit the jackpot after discovering the U.S. National Snow & Ice Data Center’s Global Lake and River Ice Phenology dataset which:

… contains freeze and breakup dates and other ice cover descriptive data for 865 lakes and rivers. Of the 542 water bodies that have records longer than 19 years, 370 are in North America and 172 are in Eurasia; 249 have records longer than 50 years; and 66 longer than 100 years. A few have data prior to 1845. These data, from water bodies distributed around the Northern Hemisphere, allow analysis of broad spatial patterns as well as long-term temporal patterns.

So, I converted the original package to a data package containing all four of those datasets plus some interactive functions for pulling “live” data and a set of “builders” to regenerate the databases. Let’s take a quick look at what’s in the NSIDC data and the global coverage area:

library(iceout) # github/hrbrmstr/iceout
library(hrbrthemes) 
library(ggplot2)
library(dplyr)

data("nsidc_iceout")

glimpse(nsidc_iceout)
## Observations: 35,918
## Variables: 37
## $ lakecode                <chr> "ARAI1", "ARAI1", "ARAI1", "ARAI1", "ARAI1", "ARAI1", "ARAI1…
## $ lakename                <chr> "Lake Suwa", "Lake Suwa", "Lake Suwa", "Lake Suwa", "Lake Su…
## $ lakeorriver             <chr> "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", "L", …
## $ season                  <chr> "1443-44", "1444-45", "1445-46", "1446-47", "1447-48", "1448…
## $ iceon_year              <dbl> 1443, 1444, 1445, 1446, 1447, 1448, 1449, 1450, 1451, 1452, …
## $ iceon_month             <dbl> 12, 11, 12, 12, 11, 12, 12, 12, 12, 11, 12, 12, 12, 12, 12, …
## $ iceon_day               <dbl> 8, 23, 1, 2, 30, 8, 13, 8, 23, 28, 3, 5, 1, 5, 6, 20, 10, 15…
## $ iceoff_year             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ iceoff_month            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ iceoff_day              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ duration                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ latitude                <dbl> 36.15, 36.15, 36.15, 36.15, 36.15, 36.15, 36.15, 36.15, 36.1…
## $ longitude               <dbl> 138.08, 138.08, 138.08, 138.08, 138.08, 138.08, 138.08, 138.…
## $ country                 <chr> "Japan", "Japan", "Japan", "Japan", "Japan", "Japan", "Japan…
## $ froze                   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ obs_comments            <chr> "calendar correction for ice_on: -30 days of original data; …
## $ area_drained            <dbl> 531, 531, 531, 531, 531, 531, 531, 531, 531, 531, 531, 531, …
## $ bow_comments            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ conductivity_us         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ elevation               <dbl> 759, 759, 759, 759, 759, 759, 759, 759, 759, 759, 759, 759, …
## $ filename                <chr> "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARA…
## $ initials                <chr> "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARAI", "ARA…
## $ inlet_streams           <chr> "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", …
## $ landuse_code            <chr> "UAFO", "UAFO", "UAFO", "UAFO", "UAFO", "UAFO", "UAFO", "UAF…
## $ largest_city_population <dbl> 52000, 52000, 52000, 52000, 52000, 52000, 52000, 52000, 5200…
## $ max_depth               <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, …
## $ mean_depth              <dbl> 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, 4.7, …
## $ median_depth            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ power_plant_discharge   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ secchi_depth            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ shoreline               <dbl> 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, …
## $ surface_area            <dbl> 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, 12.9, …
## $ state                   <chr> "Nagano Prefecture", "Nagano Prefecture", "Nagano Prefecture…
## $ iceon_date              <date> 1443-12-08, 1444-11-23, 1445-12-01, 1446-12-02, 1447-11-30,…
## $ iceon_doy               <dbl> 342, 328, 335, 336, 334, 343, 347, 342, 357, 333, 337, 339, …
## $ iceout_date             <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ iceout_doy              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

maps::map("world", ".", exact = FALSE, plot = FALSE, fill = TRUE) %>%
  fortify() -> wrld

ggplot() + 
  ggalt::geom_cartogram(
    data = wrld, map = wrld, aes(long, lat, map_id=region), 
    fill="#3B454A",  color = "white", size = 0.125
  ) +
  geom_point(
    data = distinct(nsidc_iceout, lakeorriver, longitude, latitude),
    aes(longitude, latitude, fill = lakeorriver), 
    size = 1.5, color = "#2b2b2b", stroke = 0.125, shape = 21
  ) +
  scale_fill_manual(
    name = NULL, values = c("L"="#fdbf6f", "R"="#1f78b4"), labels=c("L" = "Lake", "R" = "River")
  ) +
  ggalt::coord_proj("+proj=wintri", ylim = range(nsidc_iceout$latitude, na.rm = TRUE)) +
  labs(title = "NSIDC Dataset Coverage") +
  theme_ft_rc(grid="") +
  theme(legend.position = c(0.375, 0.1)) +
  theme(axis.text = element_blank(), axis.title = element_blank())

W00t! Lots of data (though not all of the extra features are populated for all readings/areas)!

I think the reason the ice-out data garnered my obsession was how it can be used as another indicator that we are indeed in the midst of a climate transformation. Let’s look at the historical ice-out information for Maine inland bodies of water:

filter(nsidc_iceout, country == "United States", state == "ME") %>% 
  mutate(iceout_date = as.Date(format(iceout_date, "2020-%m-%d"))) %>% # we want the Y axis formatted as month-day so we choose a leap year to ensure we get leap dates (if any)
  ggplot(aes(iceoff_year, iceout_date)) +
  geom_point(aes(color = lakename), size = 0.5, alpha=1/4) +
  geom_smooth(aes(color = lakename), se=FALSE, method = "loess", size=0.25) +
  scale_y_date(date_labels = "%b-%d") +
  labs(
    x = NULL, y = "Ice-out Month/Day", color = NULL,
    title = "Historical Ice-out Data/Trends for Maine Inland Bodies of Water"
  ) +
  theme_ft_rc(grid="XY")

You can follow that code-pattern to look at other states. It’s also fun to look at the ice-out date distributions by latitude grouping:

filter(nsidc_iceout, !is.na(latitude) & !is.na(longitude) & !is.na(iceout_date)) %>% 
  filter(country == "United States") %>% 
  mutate(iceout_date = as.Date(format(iceout_date, "2020-%m-%d"))) %>% 
  mutate(lat_grp = cut(latitude, scales::pretty_breaks(5)(latitude), ordered_result = TRUE)) %>% 
  arrange(desc(iceoff_year)) %>% 
  ggplot() +
  ggbeeswarm::geom_quasirandom(
    aes(lat_grp, iceout_date, fill = iceoff_year), groupOnX = TRUE, 
    shape = 21, size =1, color = "white", stroke = 0.125, alpha=1/2
  ) +
  scale_y_date(date_labels = "%b-%d") +
  viridis::scale_fill_viridis(name = "Year", option = "magma") +
  labs(
    x = "Latitude Grouping", y = "Ice-out Month/Day",
    title = "U.S. Ice-out Historical Day/Month Distributions by Latitude Grouping"
  ) +
  theme_ft_rc(grid="Y")

If you want to focus on individual lakes there’s a Shiny app for that (well one for the U.S. anyway).

After loading the package, just enter explore_us() at an R console and you’ll see something like this:

The leaflet view will zoom to each new lake selected and the graph will be updated as well.

Other Package News

The sergeant? package is reaching a stable point in the 0.8.0 branch (mostly due to David Severski’s tireless help finding bugs ?) and should be headed to CRAN soon. Get your issues or PRs in if you want them CRANdied.

I’ve finally updated the Java library dependencies in pdfboxjars? so pdfbox? will no longer cause GitHub to tell you or I that it is insecure.

There’s a new package dubbed reapr? that is aimed somewhere at the intersection of curl + httr + rvest. Fundamentally, it provides some coder-uplift when scraping data. The README has examples but here’s what you get on an initial scrape of this blog’s index page:

reapr::reap_url("http://rud.is/b")
##                Title: rud.is | "In God we trust. All others must bring data"
##         Original URL: http://rud.is/b
##            Final URL: https://rud.is/b/
##           Crawl-Date: 2019-01-17 19:51:09
##               Status: 200
##         Content-Type: text/html; charset=UTF-8
##                 Size: 50 kB
##           IP Address: 104.236.112.222
##                 Tags: body[1], center[1], form[1], h2[1], head[1], hgroup[1], html[1],
##                       label[1], noscript[1], section[1], title[1],
##                       aside[2], nav[2], ul[2], style[5], img[6],
##                       input[6], article[8], time[8], footer[9], h1[9],
##                       header[9], p[10], li[19], meta[20], div[31],
##                       script[40], span[49], link[53], a[94]
##           # Comments: 17
##   Total Request Time: 2.093s

The reap_url() function:

Uses httr::GET() to make web connections and retrieve content which enables it to behave more like an actual (non-javascript-enabled) browser. You can pass anything httr::GET() can handle to ... (e.g. httr::user_agent()) to have as much granular control over the interaction as possible.
Returns a richer set of data. After the httr::response object is obtained many tasks are performed including:
- timestamping of the URL crawl
- extraction of the asked-for URL and the final URL (in the case
  of redirects)
- extraction of the IP address of the target server
- extraction of both plaintext and parsed (xml_document) HTML
- extraction of the plaintext webpage <title> (if any)
- generation of a dynamic list tags in the document which can be
  fed directly to HTML/XML search/retrieval function (which may
  speed up node discovery)
- extraction of the text of all comments in the HTML document
- inclusion of the full httr::response object with the returned
  object
- extraction of the time it took to make the complete request

I’m still wrestling with the API so definitely file issues with suggestions (wherever you’re most comfortable socially coding).

Speaking of IP addresses (bullet 3 above), I finally got some time to study the gdns? C library (a modern DNS API library) and created the clandnstine? package. The package name jeu de mots is due to the fact that the intent is to have it solely support DNS over TLS requests since regular DNS is plaintext, enables ISP spying/injection and generally fraught with peril. All forms of DNS lookups are supported. The catch is that you have to point it at a DNS over TLS-capable resolver. The package defaults to Quad9 (9.9.9.9) because I trust them more than Google or Cloudflare (btw: that’s not saying much as I trust used car salesfolks more than all three of them). Keep an eye (or RSS reader) peeled on $WORK blog over the coming weeks as I’ll have some analysis and data on a few hundred DNS over TLS endpoints you can use thanks to a new study developed by cow-orkers Jon Hart and Shan Skidar.

There also a toy package forecequotes? that is more “have fun with the cli & crayon packages” than anything else. But if you like Star Wars, random quote APIs and want to integrate richer command line interface output into your work, then definitely give it a peek.

Finally, I haven’t used R’s direct C interface in a while (since Rcpp is addictive and handy) and wanted to keep those skills fresh, so I made a wrapper to an old (in internet years) IP address trie C library. The underlying library is much slower than what we use in iptools but it works, does a bit more than its iptoos counterpart and covers data marshaling, external pointer handling, and attribute/class setting so it may be a half-decent reference package for using the R<->C bridge.

FIN

If you know of more/better ice-out data please drop an issue in the Bigelow Labs’ iceout repo and I’ll get it integrated. And, if you do your own ice-out exploration definitely blog about it, tell R Weekly and drop a note in the comments.

Here are links to all the mentioned packages grouped by social coding platform (so you can interact/collaborate wherever you feel most comfortable working):

sr.ht

GitLab

GitHub

splashr 0.6.0 Now Uses the CRAN-nascent stevedore Package for Docker Orchestration

2019-01-14 – 20:56
Posted in R, splash, splashr, web scraping
Comments (2)

The splashr package [srht|GL|GH] — an alternative to Selenium for javascript-enabled/browser-emulated web scraping — is now at version 0.6.0 (still in dev-mode but on its way to CRAN in the next 14 days).

The major change from version 0.5.x (which never made it to CRAN) is a swap out of the reticulated docker package with the pure-R stevedore? package which will make it loads more compatible across the landscape of R installs as it removes a somewhat heavy dependency on a working Python environment (something quite challenging to consistently achieve in that fragmented language ecosystem).

Another addition is a set of new user agents for Android, Kindle, Apple TV & Chromecast as an increasing number of sites are changing what type of HTML (et. al.) they send to those and other alternative glowing rectangles. A more efficient/sane user agent system will also be introduced prior to the CRAN. Now’s the time to vote on existing issues or file new ones if there is a burning desire for new or modified functionality.

Since the Travis tests now work (they were failing miserably because of they Python dependency) I’ve integrated the changes from the 0.6.0 to the master branch but you can follow the machinations of the 0.6.0 branch up until CRAN release.