Skip navigation

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Modern websites are complex beasts. They house photo galleries, interactive visualizations, web fonts, analytics code and other diverse types of content. Despite the potential for diversity, many web sites share similar “tech stacks” — the components that come together to make them what they are. These stacks consist of web servers (often with special capabilities), cache managers and a wide array of front-end web components. Unless a site goes to great lengths to cloak what they are using, most of these stack components leave a fingerprint — bits and pieces that we can piece together to identify them.

Wappalyzer is one tool that we can use to take these fingerprints and match them against a database of known components. If you’re not familiar with that service, go there now and enter in the URL of your own blog or company and come back to see why I’ve mentioned it.

Back? Good.

If you explored that site a bit, you likely saw a link to their GitHub repo. There you’ll find JavaScript source code and even some browser plugins along with the core “fingerprint database”, apps.json?. If you poke around on their repo, you’ll find the repo’s wiki and eventually see that folks have made various unofficial ports to other languages.

By now, you’ve likely guessed that this blog post is introducing an R port of “wappalyzer” and if you have made such a guess, you’re correct! The rappalyzer? package is a “work-in-progress” but it’s usable enough to get some feedback on the package API, solicit contributors and show some things you can do with it.

Just rappalyze something already!

For the moment, one real function is exposed: rappalyze(). Feed it a URL string or an httr response object and — if the analysis was fruitful — you’ll get back a data frame with all the identified tech stack components. Let’s see what jquery.com? is made of:

devtools::install_github("hrbrmstr/rappalyzer")

library(rappalyzer)

rappalyze("https://jquery.com/")
## # A tibble: 8 x 5
##         tech              category match_type version confidence
##        <chr>                 <chr>      <chr>   <chr>      <dbl>
## 1 CloudFlare                   CDN    headers    <NA>        100
## 2     Debian     Operating Systems    headers    <NA>        100
## 3  Modernizr JavaScript Frameworks     script    <NA>        100
## 4      Nginx           Web Servers    headers    <NA>        100
## 5        PHP Programming Languages    headers  5.4.45        100
## 6  WordPress                   CMS       meta   4.5.2        100
## 7  WordPress                 Blogs       meta   4.5.2        100
## 8     jQuery JavaScript Frameworks     script  1.11.3        100

If you played around on the Wappalyzer site you saw that it “spiders” many URLs to try to get a very complete picture of components that a given site may use. I just used the main jQuery site in the above example and we managed to get quite a bit of information from just that interaction.

Wappalyzer (and, hence rappalyzer) works by comparing sets of regular expressions against different parts of a site’s content. For now, rappalyzer checks:

  • the site URL (the one after a site’s been contacted & content retrieved)
  • HTTP headers (an invisible set of metadata browsers/crawlers & servers share)
  • HTML content
  • scripts
  • meta tags

Wappalyzer-proper runs in a browser-context, so it can pick up DOM elements and JavaScript environment variables which can provide some extra information not present in the static content.

As you can see from the returned data frame, each tech component has one or more associated categories as well as a version number (if it could be guessed) and a value indicating how confident the guess was. It will also show where the guess came from (headers, scripts, etc).

A peek under the hood

I started making a pure R port of Wappalyzer but there are some frustrating “gotchas” with the JavaScript regular expressions that would have meant spending time identifying all the edge cases. Since the main Wappalyzer source is in JavaScript it was trivial to down-port it to a V8?-compatible version and expose it via an R function. A longer-term plan is to deal with the regular expression conversion, but I kinda needed the functionality sooner than later (well, wanted it in R, anyway).

On the other hand, keeping it in JavaScript has some advantages, especially with the advent of chimera? (a phantomjs alternative that can put a full headless browser into a V8 engine context). Using that would mean getting the V8 package ported to more modern version of the V8 source code which isn’t trivial, but doable. Using chimera would make it possible to identify even more tech stack components.

Note also that these tech stack analyses can be dreadfully slow. That’s not due to using V8. The ones that slow are slow in all the language ports. This is due to the way the regular expressions are checked against the site content and for just using regular expressions to begin with. I’ve got some ideas to speed things up and may also introduce some “guide rails” to prevent lengthy operations and avoiding some checks if page characteristics meet certain criteria. Drop an issue if you have ideas as well.

Why is this useful?

Well, you can interactively see what a site uses like we did above. But, we can also look for commonalities across a number of sites. We’ll do this in a small example now. You can skip the expository and just work with the RStudio project if that’s how you’d rather roll.

Crawling 1,000 orgs

I have a recent-ish copy of a list of Fortune 1000 companies and their industry sectors along with their website URLs. Let’s see how many of those sites we can still access and what we identify.

I’m going to pause for just a minute to revisit some “rules” in the context of this operation.

Specifically, I’m not:

  • reading each site’s terms of service/terms & conditions
  • checking each site’s robots.txt file
  • using a uniquely identifying non-browser string crawler user-agent (since we need the site to present content like it would to a browser)

However, I’m also not:

  • using the site content in any way except to identify tech stack components (something dozens of organizations do)
  • scraping past the index page HTML content, so there’s no measurable resource usage

I believe I’ve threaded the ethical needle properly (and, I do this for a living), but if you’re uncomfortable with this you should use a different target URL list.

Back to the code. We’ll need some packages:

library(hrbrthemes)
library(tidyverse)
library(curl)
library(httr)
library(rvest)
library(stringi)
library(urltools)
library(rappalyzer) # devtools::install_github("hrbrmstr/rappalyzer")
library(rprojroot)
# I'm also using a user agent shortcut from the splashr package which is on CRAN

Now, I’ve included data in the RStudio project GH repo since it can be used later on to test out changes to rappalyzer. That’s the reason for the if/file.exists tests. As I’ve said before, be kind to sites you scrape and cache content whenever possible to avoid consuming resources that you don’t own.

Let’s take a look at our crawler code:

rt <- find_rstudio_root_file()

if (!file.exists(file.path(rt, "data", "f1k_gets.rds"))) {

  f1k <- read_csv(file.path(rt, "data", "f1k.csv"))

  targets <- pull(f1k, website)

  results <- list()
  errors <- list()

  OK <- function(res) {
    cat(".", sep="")
    results <<- c(results, list(res))
  }

  BAD <- function(err_msg) {
    cat("X", sep="")
    errors <<- c(errors, list(err_msg))
  }

  pool <- multi_set(total_con = 20)

  walk(targets, ~{
    multi_add(
      new_handle(url = .x, useragent = splashr::ua_macos_chrome, followlocation = TRUE, timeout = 60),
      OK, BAD
    )
  })

  multi_run(pool = pool)

  write_rds(results, file.path(rt, "data", "f1k_gets.rds"), compress = "xz")

} else {
  results <- read_rds(file.path(rt, "data", "f1k_gets.rds"))
}

If you were expecting read_html() and/or GET() calls you’re likely surprised at what you see. We’re trying to pull content from a thousand web sites, some of which may not be there anymore (for good or just temporarily). Sequential calls would need many minutes with error handling. We’re using the asynchronous/parallel capabilities in the curl? package to setup all 1,000 requests with the ability to capture the responses. There’s a hack-ish “progress” bar that uses . for “good” requests and X for ones that didn’t work. It still takes a little time (and you may need to tweak the total_con value on older systems) but way less than sequential GETs would have, and any errors are implicitly handled (ignored).

The next block does the tech stack identification:

if (!file.exists(file.path(rt, "data", "rapp_results.rds"))) {

  results <- keep(results, ~.x$status_code < 300)
  map(results, ~{
    list(
      url = .x$url,
      status_code = .x$status_code,
      content = .x$content,
      headers = httr:::parse_headers(.x$headers)
    ) -> res
    class(res) <- "response"
    res
  }) -> results

  # this takes a *while*
  pb <- progress_estimated(length(results))
  map_df(results, ~{
    pb$tick()$print()
    rap_df <- rappalyze(.x)
    if (nrow(rap_df) > 0) rap_df <- mutate(rap_df, url = .x$url)
  }) -> rapp_results

  write_rds(rapp_results, file.path(rt, "data", "rapp_results.rds"))

} else {
  rapp_results <- read_rds(file.path(rt, "data", "rapp_results.rds"))
}

We filter out non-“OK” (HTTP 200-ish response) curl results and turn them into just enough of an httr request object to be able to work with them.

Then we let rappalyzer work. I had time to kill so I let it run sequentially. In a production or time-critical context, I’d use the future? package.

We’re almost down to data we can play with! Let’s join it back up with the original metadata:

left_join(
  mutate(rapp_results, host = domain(rapp_results$url)) %>%
    bind_cols(suffix_extract(.$host))
  ,
  mutate(f1k, host = domain(website)) %>%
    bind_cols(suffix_extract(.$host)),
  by = c("domain", "suffix")
) %>%
  filter(!is.na(name)) -> rapp_results

length(unique(rapp_results$name))
## [1] 754

Note that some orgs seem to have changed hands/domains so we’re left with ~750 sites to play with (we could have tried to match more, but this is just an exercise). If you do the same curl-fetching exercise on your own, you may get fewer or more total sites. Internet crawling is fraught with peril and it’s not really possible to conveniently convey 100% production code in a blog post.

Comparing “Fortune 1000” org web stacks

Let’s see how many categories we picked up:

xdf <- distinct(rapp_results, name, sector, category, tech)

sort(unique(xdf$category))
##  [1] "Advertising Networks"  "Analytics"             "Blogs"
##  [4] "Cache Tools"           "Captchas"              "CMS"
##  [7] "Ecommerce"             "Editors"               "Font Scripts"
## [10] "JavaScript Frameworks" "JavaScript Graphics"   "Maps"
## [13] "Marketing Automation"  "Miscellaneous"         "Mobile Frameworks"
## [16] "Programming Languages" "Tag Managers"          "Video Players"
## [19] "Web Frameworks"        "Widgets"

Plenty to play with!

Now, what do you think the most common tech component is across all these diverse sites?

count(xdf, tech, sort=TRUE)
## # A tibble: 115 x 2
##                        tech     n
##                       <chr> <int>
##  1                   jQuery   572
##  2       Google Tag Manager   220
##  3                Modernizr   197
##  4        Microsoft ASP.NET   193
##  5          Google Font API   175
##  6        Twitter Bootstrap   162
##  7                WordPress   150
##  8                jQuery UI   143
##  9             Font Awesome   118
## 10 Adobe Experience Manager    69
## # ... with 105 more rows

I had suspected it was jQuery (and hinted at that when I used it as the initial rappalyzer example). In a future blog post we’ll look at just how vulnerable orgs are based on which CDNs they use (many use similar jQuery and other resource CDNs, and use them insecurely).

Let’s see how broadly each tech stack category is used across the sectors:

group_by(xdf, sector) %>%
  count(category) %>%
  ungroup() %>%
  arrange(category) %>%
  mutate(category = factor(category, levels=rev(unique(category)))) %>%
  ggplot(aes(category, n)) +
  geom_boxplot() +
  scale_y_comma() +
  coord_flip() +
  labs(x=NULL, y="Tech/Services Detected across Sectors",
       title="Usage of Tech Stack Categories by Sector") +
  theme_ipsum_rc(grid="X")

That’s alot of JavaScript. But, there’s also large-ish use of CMS, Web Frameworks and other components.

I wonder what the CMS usage looks like:

filter(xdf, category=="CMS") %>%
  count(tech, sort=TRUE)
## # A tibble: 15 x 2
##                        tech     n
##                       <chr> <int>
##  1                WordPress    75
##  2 Adobe Experience Manager    69
##  3                   Drupal    68
##  4               Sitefinity    26
##  5     Microsoft SharePoint    19
##  6                 Sitecore    14
##  7                      DNN    11
##  8                Concrete5     6
##  9     IBM WebSphere Portal     4
## 10        Business Catalyst     2
## 11                    Hippo     2
## 12                TYPO3 CMS     2
## 13               Contentful     1
## 14              Orchard CMS     1
## 15             SilverStripe     1

Drupal and WordPress weren’t shocking to me (again, I do this for a living) but they may be to others when you consider this is the Fortune 1000 we’re talking about.

What about Web Frameworks?

filter(xdf, category=="Web Frameworks") %>%
  count(tech, sort=TRUE)
## # A tibble: 7 x 2
##                          tech     n
##                         <chr> <int>
## 1           Microsoft ASP.NET   193
## 2           Twitter Bootstrap   162
## 3             ZURB Foundation    61
## 4               Ruby on Rails     2
## 5            Adobe ColdFusion     1
## 6                    Kendo UI     1
## 7 Woltlab Community Framework     1

Yikes! Quite a bit of ASP.NET. But, many enterprises haven’t migrated to more modern, useful platforms (yet).

FIN

As noted earlier, the data & code are on GitHub. There are many questions you can ask and answer with this data set and definitely make sure to share any findings you discover!

We’ll continue exploring different aspects of what you can glean from looking at web sites in a different way in future posts.

I’ll leave you with one interactive visualization that lets you explore the tech stacks by sector.

devtools::install_github("jeromefroe/circlepackeR")
library(circlepackeR)
library(data.tree)
library(treemap)

cpdf <- count(xdf, sector, tech)

cpdf$pathString <- paste("rapp", cpdf$sector, cpdf$tech, sep = "/")
stacks <- as.Node(cpdf)

circlepackeR(stacks, size = "n", color_min = "hsl(56,80%,80%)",
             color_max = "hsl(341,30%,40%)", width = "100%", height="800px")

I occasionally hang out on StackOverflow and often use an answer as an opportunity to fill a package void for a particular need. docxtractr and qrencoder are two (of many) packages that were birthed from SO answers. I usually try to answer with inline code first then expand the functionality into a package (if warranted). Some make it to CRAN (like those two), others stay on GitHub.

This (short) post is about two new ones: webhose? and pigeon?.

The webhose package is an API interface package to https://webhose.io/, which is an interesting service that scrapes the web & “dark web” and provides a short but handy API to retrieving the content using a fairly intuitive query language.

The pigeon package is a hastily-hacked-together wrapper around pgn-extract, a cross-platform utility written in C for working with chess game data in PGN format.

I’m not going to have any time to round out the corners on either of those packages but will gladly make time to help anyone who wants to jump on board to either (or both!) of them.

Working on either package will let you get your feet wet (or, wetter) with R package development. They both need:

  • more functions + docs!
  • test harness setup
  • Travis-CI, code coverage & AppVeyor configs

Working on webhose will give you experience dealing with HTTP APIs (and their API is super clean to work with) and possibly introduce you to an area of research you’re not already in.

Working on pigeon will give you experience integrating C[++] & R code (and the C-library I’ve hack-wrapped definitely needs some care & feeding to ensure no memory leaks are present).

Neither is “mission critical”. The world will gladly go on w/o either of them. But, if you wanted a judgement-free place to hone your R package skills or try/learn new things (and ask questions along the way), file an issue, drop a note in the comments or hit me up on Twitter. If you’re an experienced R package-r and want to “own” either of these, that’s ? as well!

I’d encourage all nascent R coders to adopt “SODD” and use SO as a place to hone your skills while you help others (and you don’t need to write a package for every answer :-).

spiderbar, spiderbar 
Reads robots rules from afar.
Crawls the web, any size; 
Fetches with respect, never lies.
Look Out! 
Here comes the spiderbar.

Is it fast? 
Listen bud, 
It's got C++ under the hood.
Can you scrape, from a site?
Test with can_fetch(), TRUE == alright
Hey, there 
There goes the spiderbar. 

(Check the end of the post if you don’t recognize the lyrical riff.)

Face front, true believer!

I’ve used and blogged about Peter Meissner’s most excellent robotstxt package before. It’s an essential tool for any ethical web scraper.

But (there’s always a “but“, right?), it was a definite bottleneck for an unintended package use case earlier this year (yes, I still have not rounded out the corners on my “crawl delay” forthcoming post).

I needed something faster for my bulk Crawl-Delay analysis which led me to this small, spiffy C++ library for parsing robots.txt files. After a tiny bit of wrangling, that C++ library has turned into a small R package spiderbar which is now hitting a CRAN mirror near you, soon. (CRAN — rightly so — did not like the unoriginal name rep).

How much faster?

I’m glad you asked!

Let’s take a look at one benchmark: parsing robots.txt and extracting Crawl-delay entries. Just how much faster is spiderbar?

library(spiderbar)
library(robotstxt)
library(microbenchmark)
library(tidyverse)
library(hrbrthemes)

rob <- get_robotstxt("imdb.com")

microbenchmark(

  robotstxt = {
    x <- parse_robotstxt(rob)
    x$crawl_delay
  },

  spiderbar = {
    y <- robxp(rob)
    crawl_delays(y)
  }

) -> mb1

update_geom_defaults("violin", list(colour = "#4575b4", fill="#abd9e9"))

autoplot(mb1) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for parsing 'robots.txt' and extracting 'Crawl-delay' entries",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

As you can see, it’s just a tad bit faster ?.

Now, you won’t notice that temporal gain in an interactive context but you absolutely will if you are cranking through a few million of them across a few thousand WARC files from the Common Crawl.

But, I don’t care about Crawl-Delay!

OK, fine. Do you care about fetchability? We can speed that up, too!

rob_txt <- parse_robotstxt(rob)
rob_spi <- robxp(rob)

microbenchmark(

  robotstxt = {
    robotstxt:::path_allowed(rob_txt$permissions, "/Vote")
  },

  spiderbar = {
    can_fetch(rob_spi, "/Vote")
  }

) -> mb2

autoplot(mb2) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for testing resource 'fetchability'",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

Vectorized or it didn’t happen.

(Gosh, even Spider-Man got more respect!)

OK, this is a tough crowd, but we’ve got vectorization covered as well:

microbenchmark(

  robotstxt = {
    paths_allowed(c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"), "imdb.com")
  },

  spiderbar = {
    can_fetch(rob_spi, c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"))
  }

) -> mb3

autoplot(mb3) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for testing multiple resource 'fetchability'",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

Excelsior!

Peter’s package does more than this one since it helps find the robots.txt files and provides helpful data frames for more robots exclusion protocol content. And, we’ve got some plans for package interoperability. So, stay tuned, true believer, for more spider-y goodness.

You can check out the code and leave package questions or comments on GitHub.

(Hrm…Peter Parker was Spider-Man and Peter Meissner wrote robotstxt which is all about spiders. Coincidence?! I think not!)

International Code Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back.

There will be no ‘rrrrrr’ abuse in this post, I’m afraid, but there will be plenty of R code.

We’re going to combine pirate day with “pirating” data, in the sense that I’m going to show one way on how to use the web scraping powers of R responsibly to collect data on and explore modern-day pirate encounters.

Scouring The Seas Web For Pirate Data

Interestingly enough, there are many of sources for pirate data. I’ve blogged a few in the past, but I came across a new (to me) one by the International Chamber of Commerce. Their Commercial Crime Services division has something called the Live Piracy & Armed Robbery Report:

(site png snapshot taken with splashr)

I fiddled a bit with the URL and — sure enough — if you work a bit you can get data going back to late 2013, all in the same general format, so I jotted down base URLs and start+end record values and filed them away for future use:

library(V8)
library(stringi)
library(httr)
library(rvest)
library(robotstxt)
library(jwatr) # github/hrbrmstr/jwatr
library(hrbrthemes)
library(purrrlyr)
library(rprojroot)
library(tidyverse)

report_urls <- read.csv(stringsAsFactors=FALSE, header=TRUE, text="url,start,end
https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/, 1345, 1459
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/151/, 1137, 1339
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/details/146/, 885, 1138
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/144/, 625, 884
https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/133/, 337, 623")

by_row(report_urls, ~sprintf(.x$url %s+% "%s", .x$start:.x$end), .to="url_list") %>%
  pull(url_list) %>%
  flatten_chr() -> target_urls

head(target_urls)
## [1] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1345"
## [2] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1346"
## [3] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1347"
## [4] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1348"
## [5] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1349"
## [6] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1350"

Time to pillage some details!

But…Can We Really Do It?

I poked around the site’s terms of service/terms and conditions and automated retrieval was not discouraged. Yet, those aren’t the only sea mines we have to look out for. Perhaps they use their robots.txt to stop pirates. Let’s take a look:

robotstxt::get_robotstxt("https://www.icc-ccs.org/")
## # If the Joomla site is installed within a folder such as at
## # e.g. www.example.com/joomla/ the robots.txt file MUST be
## # moved to the site root at e.g. www.example.com/robots.txt
## # AND the joomla folder name MUST be prefixed to the disallowed
## # path, e.g. the Disallow rule for the /administrator/ folder
## # MUST be changed to read Disallow: /joomla/administrator/
## #
## # For more information about the robots.txt standard, see:
## # http://www.robotstxt.org/orig.html
## #
## # For syntax checking, see:
## # http://www.sxw.org.uk/computing/robots/check.html
##
## User-agent: *
## Disallow: /administrator/
## Disallow: /cache/
## Disallow: /cli/
## Disallow: /components/
## Disallow: /images/
## Disallow: /includes/
## Disallow: /installation/
## Disallow: /language/
## Disallow: /libraries/
## Disallow: /logs/
## Disallow: /media/
## Disallow: /modules/
## Disallow: /plugins/
## Disallow: /templates/
## Disallow: /tmp/

Ahoy! We’ve got a license to pillage!

But, we don’t have a license to abuse their site.

While I still haven’t had time to follow up on an earlier post about ‘crawl-delay’ settings across the internet I have done enough work on it to know that a 5 or 10 second delay is the most common setting (when sites bother to have this directive in their robots.txt file). ICC’s site does not have this setting defined, but we’ll still pirate crawl responsibly and use a 5 second delay between requests:

s_GET <- safely(GET)

pb <- progress_estimated(length(target_urls))
map(target_urls, ~{
  pb$tick()$print()
  Sys.sleep(5)
  s_GET(.x)
}) -> httr_raw_responses

write_rds(httr_raw_responses, "data/2017-icc-ccs-raw-httr-responses.rds")

good_responses <- keep(httr_raw_responses, ~!is.null(.x$result))

jwatr::response_list_to_warc_file(good_responses, "data/icc-good")

There are more “safety” measures you can use with httr::GET() but this one is usually sufficient. It just prevents the iteration from dying when there are hard retrieval errors.

I also like to save off the crawl results so I can go back to the raw file (if needed) vs re-scrape the site (this crawl takes a while). I do it two ways here, first using raw httr response objects (including any “broken” ones) and then filtering out the “complete” responses and saving them in WARC format so it’s in a more common format for sharing with others who may not use R.

Digging For Treasure

Did I mention that while the site looks like it’s easy to scrape it’s really not easy to scrape? That nice looking table is a sea mirage ready to trap unwary sailors crawlers in a pit of despair. The UX is built dynamically from on-page javascript content, a portion of which is below:

Now, you’re likely thinking: “Don’t we need to re-scrape the site with seleniumPipes or splashr?”

Fear not, stout yeoman! We can do this with the content we have if we don’t mind swabbing the decks first. Let’s put the map code up first and then dig into the details:

# make field names great again
mfga <- function(x) {
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  x <- make.unique(x, sep = "_")
  x
}

# I know the columns I want and this makes getting them into the types I want easier
cols(
  attack_number = col_character(),
  attack_posn_map = col_character(),
  date = col_datetime(format = ""),
  date_time = col_datetime(format = ""),
  id = col_integer(),
  location_detail = col_character(),
  narrations = col_character(),
  type_of_attack = col_character(),
  type_of_vessel = col_character()
) -> pirate_cols

# iterate over the good responses with a progress bar
pb <- progress_estimated(length(good_responses))
map_df(good_responses, ~{

  pb$tick()$print()

  # `safely` hides the data under `result` so expose it
  doc <- content(.x$result)

  # target the `<script>` tag that has our data, carve out the target lines, do some data massaging and evaluate the javascript with V8
  html_nodes(doc, xpath=".//script[contains(., 'requirejs')]") %>%
    html_text() %>%
    stri_split_lines() %>%
    .[[1]] %>%
    grep("narrations_ro", ., value=TRUE) %>%
    sprintf("var dat = %s;", .) %>%
    ctx$eval()

  p <- ctx$get("dat", flatten=TRUE)

  # now, process that data, turing the ugly returned list content into something we can put in a data frame
  keep(p[[1]], is.list) %>%
    map_df(~{
      list(
        field = mfga(.x[[3]]$label),
        value = .x[[3]]$value
      )
    }) %>%
    filter(value != "") %>%
    distinct(field, .keep_all = TRUE) %>%
    spread(field, value)

}) %>%
  type_convert(col_types = pirate_cols) %>%
  filter(stri_detect_regex(attack_number, "^[[:digit:]]")) %>%
  filter(lubridate::year(date) > 2012) %>%
  mutate(
    attack_posn_map = stri_replace_last_regex(attack_posn_map, ":.*$", ""),
    attack_posn_map = stri_replace_all_regex(attack_posn_map, "[\\(\\) ]", "")
  ) %>%
  separate(attack_posn_map, sep=",", into=c("lat", "lng")) %>%
  mutate(lng = as.numeric(lng), lat = as.numeric(lat)) -> pirate_df

write_rds(pirate_df, "data/pirate_df.rds")

The first bit there is a function to “make field names great again”. We’re processing some ugly list data and it’s not all uniform across all years so this will help make the data wrangling idiom more generic.

Next, I setup a cols object because we’re going to be extracting data from text as text and I think it’s cleaner to type_convert at the end vs have a slew of as.numeric() (et al) statements in-code (for small mumnging). You’ll note at the end of the munging pipeline I still need to do some manual conversions.

Now we can iterate over the good (complete) responses.

The purrr::safely function shoves the real httr response in result so we focus on that then “surgically” extract the target data from the <script> tag. Once we have it, we get it into a form we can feed into the V8 javascript engine and then retrieve the data from said evaluation.

Because ICC used the same Joomla plugin over the years, the data is uniform, but also can contain additional fields, so we extract the fields in a generic manner. During the course of data wrangling, I noticed there were often multiple Date: fields, so we throw in some logic to help avoid duplicate field names as well.

That whole process goes really quickly, but why not save off the clean data at the end for good measure?

Gotta Have A Pirate Map

Now we can begin to explore the data. I’ll leave most of that to you (since I’m providing the scraped data oh github), but here are a few views. First, just some simple counts per month:

mutate(pirate_df, year = lubridate::year(date), year_mon = as.Date(format(date, "%Y-%m-01"))) %>%
  count(year_mon) %>%
  ggplot(aes(year_mon, n)) +
  geom_segment(aes(xend=year_mon, yend=0)) +
  scale_y_comma() +
  labs(x=NULL, y=NULL,
       title="(Confirmed) Piracy Incidents per Month",
       caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") +
  theme_ipsum_rc(grid="Y")

And, finally, a map showing pirate encounters but colored by year:

world <- map_data("world")

mutate(pirate_df, year = lubridate::year(date)) %>%
  arrange(year) %>%
  mutate(year = factor(year)) -> plot_df

ggplot() +
  geom_map(data = world, map = world, aes(x=long, y=lat, map_id=region), fill="#b2b2b2") +
  geom_point(data = plot_df, aes(lng, lat, color=year), size=2, alpha=1/3) +
  ggalt::coord_proj("+proj=wintri") +
  viridis::scale_color_viridis(name=NULL, discrete=TRUE) +
  labs(x=NULL, y=NULL,
       title="Piracy Incidents per Month (Confirmed)",
       caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") +
  theme_ipsum_rc(grid="XY") +
  theme(legend.position = "bottom")

Taking Up The Mantle of the Dread Pirate Hrbrmstr

Hopefully this post shed some light on scraping responsibly and using different techniques to get to hidden data in web pages.

There’s some free-form text and more than a few other ways to look at the data. You can find the code and data on Github and don’t hesitate to ask questions in the comments or file an issue. If you make something blog it! Share your ideas and creations with the rest of the R (or other language) communities!

I was socially engineered by @yoniceedee into creating today’s post due to being prodded with this tweet:

Since there aren’t nearly enough sf and geom_sf examples out on the wild, wild web, here’s a short one that shows how to do basic sf operations, including how to plot sf objects in ggplot2 and animate a series of them with magick.

I’m hoping someone riffs off of this to make an interactive version with Shiny. If you do, definitely drop a link+note in the comments!

(If folks want some exposition here, let me know in the comments and I’ll rig up an R Markdown document with more step-by-step details.)

Full RStudio project file (with pre-cached data) is on GitHub.

library(rprojroot)
library(sf)
library(magick)
library(tidyverse) # NOTE: Needs github version of ggplot2

root <- find_rstudio_root_file()

# "borrow" the files from SmokyMountains.com, but be nice and cache them to
# avoid hitting their web server for every iteration

c("https://smokymountains.com/wp-content/themes/smcom-2015/to-delete/js/us.json",
  "https://smokymountains.com/wp-content/themes/smcom-2015/js/foliage2.tsv",
  "https://smokymountains.com/wp-content/themes/smcom-2015/js/foliage-2017.csv") %>%
  walk(~{
    sav_tmp <- file.path(root, "data", basename(.x))
    if (!file.exists(sav_tmp)) download.file(.x, sav_tmp)
  })

# next, we read in the GeoJSON file twice. first, to get the counties
states_sf <- read_sf(file.path(root, "data", "us.json"), "states", stringsAsFactors = FALSE)

# we only want the continental US
states_sf <- filter(states_sf, !(id %in% c("2", "15", "72", "78")))

# it doesn't have a CRS so we give it one
st_crs(states_sf) <- 4326

# I ran into hiccups using coord_sf() to do this, so we convert it to Albers here
states_sf <- st_transform(states_sf, 5070)


# next we read in the states
counties_sf <- read_sf(file.path(root, "data", "us.json"), "counties", stringsAsFactors = FALSE)
st_crs(counties_sf) <- 4326
counties_sf <- st_transform(counties_sf, 5070)

# now, we read in the foliage data
foliage <- read_tsv(file.path(root, "data", "foliage-2017.csv"),
                    col_types = cols(.default=col_double(), id=col_character()))

# and, since we have a lovely sf tidy data frame, bind it together
left_join(counties_sf, foliage, "id") %>%
  filter(!is.na(rate1)) -> foliage_sf

# now, we do some munging so we have better labels and so we can
# iterate over the weeks
gather(foliage_sf, week, value, -id, -geometry) %>%
  mutate(value = factor(value)) %>%
  filter(week != "rate1") %>%
  mutate(week = factor(week,
                       levels=unique(week),
                       labels=format(seq(as.Date("2017-08-26"),
                                         as.Date("2017-11-11"), "1 week"),
                                     "%b %d"))) -> foliage_sf

# now we make a ggplot object for each week and save it out to a png
pb <- progress_estimated(nlevels(foliage_sf$week))
walk(1:nlevels(foliage_sf$week), ~{

  pb$tick()$print()

  xdf <- filter(foliage_sf, week == levels(week)[.x])

  ggplot() +
    geom_sf(data=xdf, aes(fill=value), size=0.05, color="#2b2b2b") +
    geom_sf(data=states_sf, color="white", size=0.125, fill=NA) +
    viridis::scale_fill_viridis(
      name=NULL,
      discrete = TRUE,
      labels=c("No Change", "Minimal", "Patchy", "Partial", "Near Peak", "Peak", "Past Peak"),
      drop=FALSE
    ) +
    labs(title=sprintf("Foliage: %s ", unique(xdf$week))) +
    ggthemes::theme_map() +
    theme(panel.grid=element_line(color="#00000000")) +
    theme(panel.grid.major=element_line(color="#00000000")) +
    theme(legend.position="right") -> gg

  ggsave(sprintf("%02d.png", .x), gg, width=5, height=3)

})

# we read them all back in and animate the foliage
sprintf("%02d.png", 1:nlevels(foliage_sf$week)) %>%
  map(image_read) %>%
  image_join() %>%
  image_animate(1)
insert(post, "{ 'standard_disclaimer' : 'My opinion, not my employer\'s' }")

This is a post about the fictional company FredCo. If the context or details presented by the post seem familiar, it’s purely coincidental. This is, again, a fictional story.

Let’s say FredCo had a pretty big breach that (fictionally) garnered media, Twitterverse, tech-world and Government-level attention and that we have some spurious details that let us sit back in our armchairs to opine about. What might have helped create the debacle at FredCo?

Despite (fictional) endless mainstream media coverage and a good chunk of ‘on background’ infosec-media clandestine blatherings we know very little about the breach itself (though it’s been fictionally, officially blamed on failure to patch Apache Struts). We know even less (fictionally officially) about the internal reach of the breach (apart from the limited consumer impact official disclosures). We know even less than that (fictionally officially) about how FredCo operates internally (process-wise).

But, I’ve (fictionally) seen:

  • a detailed breakdown of the number of domains, subdomains, and hosts FredCo “manages”.
  • the open port/service configurations of the public components of those domains
  • public information from individuals who are more willing to (fictionally) violate the CFAA than I am to get more than just port configuration information
  • a 2012/3 SAS 1 Type II report about FredCo controls
  • testimonies from FredCo execs regarding efficacy of $SECURITY_TECHOLOGY and 3 videos purporting to be indicative of expert opine on how to use BIIGG DATERZ to achieve cybersecurity success
  • the board & management structure + senior management bonus structures, complete with incentive-based objectives they were graded on

so, I’m going to blather a bit about how this fictional event should finally tear down the Potemkin village that is the combination of the Regulatory+Audit Industrial Complex and the Cybersecurity Industrial Complex.

“Tear down” with respect to the goal being to help individuals understand that a significant portion of organizations you entrust with your data are not incentivized or equipped to protect your data and that these same conditions exist in more critical areas — such as transportation, health care, and critical infrastructure — and you should expect a failure on the scale of FredCo — only with real, harmful impact — if nothing ends up changing soon.

From the top

There is boilerplate mention of “security” in the objectives of the senior executives between 2015 & 2016 14A filings:

  • CEO: “Employing advanced analytics and technology to help drive client growth, security, efficiency and profitability.”
  • CFO: “Continuing to advance and execute global enterprise risk management processes, including directing increased investment in data security, disaster recovery and regulatory compliance capabilities.”
  • CLO: “Continuing to refine and build out the Company’s global security organization.”
  • President, Workforce Solutions: None
  • CHRO: None
  • President – US Information Services: None

You’ll be happy to know that they all received either “Distinguished” or “Exceeds” on their appraisals and received a multiplier of their bonus & compensation targets as a result.

Furthermore, there is no one in the make-up of FredCo’s board of directors who has shown an interest or specialization in cybersecurity.

From the camera-positioned 50-yard line on instant replay, the board and shareholders of FredCo did not think protection of your identity and extremely personal information was important enough to include on three top executive directives and performance measure and was given little more than boilerplate mention for others. Investigators who look into FredCo’s breach should dig deep into the last decade of the detailed measures for these objectives. I have first-hand experience how these types of HR processes are managed in large orgs, which is why I’m encouraging this area for investigation.

“Security” is a terrible term, but it only works when it is an emergent property of the business processes of an organization. That means it must be contextual for every worker. Some colleagues suggest individual workers should not have to care about cybersecurity when making decisions or doing work, but even minimum-wage retail and grocery store clerks are educated about shoplifting risks and are given tools, tips and techniques to prevent loss. When your HR organizations is not incentivized to help create and maintain a cybersecurity-aware culture from the top you’re going to have problems, and when there are no cyberscurity-oriented targets for the CIO or even business process owners, don’t expect your holey screen door to keep out predators.

Awwwdit, Part I

NOTE: I’m not calling out any particular audit organization as I’ve only seen one fictional official report.

The Regulatory+Audit Industrial Complex is a lucrative business cabal. Governments and large business meta-agencies create structures where processes can be measured, verified and given a big green ✅. This validation exercise is generally done in one or more ways:

  • simple questionnaire, very high level questions, no veracity validation
  • more detailed questionnaire, mid-level questions, usually some in-person lightweight checking
  • detailed questionnaire, but with topics that can be sliced-and-diced by the legal+technical professions to mean literally anything, measured in-person by (usually) extremely junior reviewers with little-to-no domain expertise who follow review playbooks, get overwhelmed with log entries and scope-refinement+reduction and who end up being steered towards “important” but non-material findings

Sure, there are good audits and good auditors, but I will posit they are the rare diamonds in a bucket of zirconia.

We need to cover some technical ground before covering this further, though.

Shocking Struts

We’ll take the stated breach cause at face-value: failure to patch an remote-accessible vulnerability with Apache Struts. This was presented as the singular issue enabling attackers to walk (with crutches) away with scads of identify-theft-enabling personal data, administrator passwords, database passwords, and the recipe for the winning entry in the macaroni salad competition at last year’s HR annual picnic. Who knew one Java library had so much power!

We don’t know the architecture of all the web apps at FredCo. However, your security posture should not be a Jenga game tower, easily destroyed by removing one peg. These are all (generally) components of externally-facing applications at the scale of FredCo:

  • routers
  • switches
  • firewalls
  • load balancers
  • operating systems
  • application servers
  • middleware servers
  • database servers
  • customized code

These are mimicked (to varying levels of efficacy) across:

  • development
  • test
  • staging
  • production

environments.

They may coexist (in various layers of the network) with:

  • HR systems
  • Finance systems
  • Intranet servers
  • Active Directory
  • General user workstations
  • Executive workstations
  • Developer workstations
  • Mobile devices
  • Remote access infrastructure (i.e. VPNs)

A properly incentivized organization ensures there are logical and physical separation between/isolation of “stuff that matters” and that varying levels of authentication & authorization are applied to ensure access is restricted.

Keeping all that “secure” requires:

  • managing thousands of devices (servers, network components, laptops, desktops, mobile devices)
  • managing thousands of identities
  • managing thousands of configurations across systems, networks and devices
  • managing hundreds to thousands of connections between internal and external networks
  • managing thousands of rules
  • managing thousands of vulnerabilities (as they become known)
  • managing a secure development life cycle across hundreds or thousands of applications

Remember, though, that FredCo ostensibly managed all of that well and the data loss was solely due to one Java library.

If your executives (all of them) and workers (all of them) are not incentivized with that list in mind, you will have problems, but let’s talk about the security challenges back in the context of the audit role.

Awwwdit, Part II

The post is already long, so we’ll make this quick.

If I dropped you off — yes, you, because you’re likely as capable as the auditors mentioned in the previous section on audit — into that environment once a year, do you think you’d be able to ferret out issues based on convoluted network diagrams, poorly documented firewall rules and source code, non-standard checklists of user access management processes?

Let’s say I dropped you in months before the known Struts vulnerability and re-answer the question.

The burden placed on internal and — especially — external auditors is great and they are pretty much set up for failure from engagement number one.

Couple IT complexity with the fact that many orgs like FredCo aren’t required to do more than ensure financial reporting processes are ?.

But, even if there were more technical, security-oriented audits performed, you’d likely have ten different report findings by as many firms or auditors, especially if they were point-in-time audits. Furthermore, FredCo has has decades of point-in-time audits but hundreds of auditors and dozens of firms. The conditions of the breach were likely not net-new, so how did decades of systemic IT failures go unnoticed by this cabal?

IT audit functions are a multi-billion dollar business. FredCo is partially the result of the built-in cracks in the way verification is performed in orgs. In other words, I posit the Regulatory+Audit Industrial Complex bears some of the responsibility for FredCo’s breach.

Divisive Devices

From the (now removed) testimonials & videos, it was clear there may have been a “blinky light” problem in the mindset of those responsible for cybersecurity at FredCo. Relying solely on the capabilities of one or more devices (they are usually appliances with blinky lights) and thinking that storing petabytes of log data are going to stop “bad guys’ is a great recipe for a breach parfait.

But, the Cybersecurity Industrial Complex continues to dole out LED-laden boxes with the fervor of a U.S. doctor handing out opioids. Sure, they are just giving orgs what they want, but it doesn’t make it responsible behaviour. Just like the opioid problem, the “device” issue is likely causing cyber-sickness in more organizations that you’d like to admit. You may even know someone who works at an org with a box-addition.

I posit the Cybersecurity Industrial Complex bears some of the responsibility for FredCo’s breach, especially when you consider the hundreds of marketing e-mails I’ve seen post-FredCo breach telling me how CyberBox XJ9-11 would have stopped FredCo’s attackers cold.

A Matter of Trust

If removing a Struts peg from FredCo’s IT Jenga board caused the fictional tower to crash:

  • What do you think the B2B infrastructure looks like?
  • How do you think endpoints are managed?
  • What isolation, segmentation and access controls really exist?
  • How effective do you think their security awareness program is?
  • How many apps are architected & managed as poorly as the breached one?
  • How many shadow IT deployments exist in the ☁️ with your data in it?
  • How can you trust FredCo with anything of importance?

Fictional FIN

In this fictional world I’ve created one ending is:

  • all B2B connections to FredCo have been severed
  • lawyers at a thousand firms are working on language for filings to cancel all B2B contracts with FredCo
  • FredCo was de-listed from exchanges
  • FredCo executives are defending against a slew of criminal and civil charges
  • The U.S. Congress and U.K. Parliament have come together to undertake a joint review of regulatory and audit practices spanning both countries (since it impacted both countries and the Reg+Audit cabal spans both countries they decided to save time and money) resulting in sweeping changes
  • The SEC has mandated detailed cybersecurity objectives be placed on all senior management executives at all public companies and have forced results of those objectives assessments to be part of a new filing requirement.
  • The SEC has also mandated that at least one voting board member of public companies must have demonstrated experience with cybersecurity
  • The FTC creates and enforces standards on cybersecurity product advertising practices
  • You have understood that nobody has your back when it comes to managing your sensitive, personal data and that you must become an active participant in helping to ensure your elected representatives hold all organizations accountable when it comes to taking their responsibilities seriously.

but, another is:

  • FredCo’s stock bounces back
  • FredCo loses no business partners
  • FredCo’s current & former execs faced no civil or criminal charges
  • Congress makes a bit of opportunistic, temporary bluster for the sake of 2018 elections but doesn’t do anything more than berate FredCo publicly
  • You’re so tired of all these breaches and data loss that you go back to playing “Clash of Clans” on your mobile phone and do nothing.

I’ve blathered about trust before 1 2, but said blatherings were in a “what if” context. Unfortunately, the if has turned into a when, which begged for further blathering on a recent FOSS ecosystem cybersecurity incident.

The gg_spiffy @thomasp85 linked to a post by the SK-CSIRT detailing the discovery and take-down of a series of malicious Python packages. Here’s their high-level incident summary:

SK-CSIRT identified malicious software libraries in the official Python package
repository, PyPI, posing as well known libraries. A prominent example is a fake
package urllib-1.21.1.tar.gz, based upon a well known package
urllib3-1.21.1.tar.gz.
Such packages may have been downloaded by unwitting developer or administrator
by various means, including the popular “pip” utility (pip install urllib).
There is evidence that the fake packages have indeed been downloaded and
incorporated into software multiple times between June 2017 and September 2017.

Words are great but, unlike some other FOSS projects (*cough* R *cough*) the PyPI folks have authoritative log data regarding package downloads from PyPI. This means we can begin to quantify the exposure. The Google BigQuery SQL was pretty straightforward:

SELECT timestamp, file.project as package, country_code, file.version AS version
FROM (
  (TABLE_DATE_RANGE([the-psf:pypi.downloads], TIMESTAMP('2016-01-22'), TIMESTAMP('2017-09-15')))
)
WHERE file.project IN ('acqusition', apidev-coop', 'bzip', 'crypt', 'django-server',
                       'pwd', 'setup-tools', 'telnet', 'urlib3', 'urllib')

Let’s see what the daily downloads of the malicious package look like:

Thanks to Curtis Doty (@dotysan on GH) I learned that the BigQuery table can be further filtered to exclude mirror-to-mirror traffic. The data for that is now in the GH repository and the chart in this callout shows that the exposure was very, very (very) limited:

But, we need counts of the mal-package dopplegangers (i.e. the good packages) to truly understand scope of exposure:

Thankfully, the SK-CSIRT folks caught this in time and the exposure was limited. But those are some popular tools that were targeted and it’s super-easy to sneak these into requirements.txt and scripts since the names are similar and the functionality is duplicated.

I’ll further note that the crypto package was “good” at some point in time then went away and was replaced with the nefarious one. That seems like a pretty big PyPI oversight (vis-a-vis package retirement & name re-use), but I’m not casting stones. R’s devtools::install_github() and wanton source()ing are just as bad, and the non-CRAN ecosystem is an even more varmint-prone “wild west” environment.

Furthermore, this is a potential exposure issue in many FOSS package repository ecosystems. On the one hand, these are open environments with tons of room for experimentation, creativity and collaboration. On the other hand, they’re all-too-easy targets for malicious hackers to prey upon.

I, unfortunately, have no quick-fix solutions to offer. “Review your code and dependencies” is about the best I can suggest until individual ecosystems work on better integrity & authenticity controls or there is a cross-ecosystem effort to establish “best practices” and perhaps even staffed, verified, audited, free services that work like a sheriff+notary to help ensure the safety of projects relying on open source components.

Python folks: double check that you weren’t a victim here (it’s super easy to type some of those package names wrong, and hopefully you’ve noticed builds failing if you had done so).

R folks: don’t be smug, watch your GitHub dependencies and double check your projects.

You can find the data and the scripts used to generate the charts (ironically enough) on GitHub.

Finally: I just want to close with a “thank you!” to PyPI’s Donald Stufft who (quickly!) pointed me to a blog post detailing the BigQuery setup.

I’ve blogged about my in-development R package hgr a before and it’s slowly getting to a CRAN release. There are two new features to it that are more useful in an interactive session than in a programmatic context. Since they build on each other, we’ll take them in order.

New S3 print() Method

Objects created with hgr::just_the_facts() used to be just list objects. Technically they are still list objects but they are also classed as hgr objects. The main reason for this was to support the new default print() method.

When you print() and hgr object, the new S3 method will extract the $content part of the object and pass it through some htmltools functions to display the readability-enhanced content in a browser (whatever R is configured to use for that on your system…you likely use RStudio if you read this blog and it will be in the Viewer pane for most users). This enables visual inspection of the content and (as previously stated) is pretty much only useful in an interactive context.

Rather than show you that now, we’ll see it in the context of the next feature.

‘Just The Facts’ RStudio Addin

I’m going to break with tradition and use a Medium post as an example, mostly because one of the things I detest about that content platform is just how little reading I can do on it (ironic since they touted things like typography when it launched). We’ll use this site as the example (an interesting JS article @timelyportfolio RT’d):

Now, that’s full-screen on a 13″ 2016 MacBook Pro and is pretty much useless. Now, I could (and, would normally) just use the Mercury Reader extension to strip away the cruft:

But, what if I wanted the content itself in R (for further processing) or didn’t feel like using (or policy disable the ability to use) a Chrome extension? Enter: “Just The Facts” RStudio Addin edition.

Just copy a text URL to the clipboard and choose the ‘Just The Facts’ addin:

and, you’ll get a few items as a result:

  • code executed in your R console that will…
  • add a new, unique object in your global environment with the results of a call to hgr::just_the_facts() on the copied URL, and…
  • the object auto-print()ed in the console, which…
  • displays the content in your RStudio Viewer

An image (or two) is definitely worth a 1,000 (ok, 48) words:

FIN

I’m likely going to crank out another addin that works more like a browser (with a URL bar in-Viewer) but also keep an in-environment log of requests so you can use or archive them afterwards.

Take it for a spin and don’t be sy in the GH issues page!