Skip navigation

Over at $DAYJOB’s blog I’ve queued up a post that shows how to use our new ropendata? package to work with our Open Data portal’s API. I’m not super-sure when it’s going to be posted so keep an RSS reader fixed on https://blog.rapid7.com/ if you’re interested in seeing it (I may make a small note of it here so it can wind its way into R Weekly & R-bloggers).

The example data used in the post is the public version of what I talked about in a recent post here, namely the devices discovered exposing the Ubiquity Discovery Protocol.

I’m quite blessed at work since we have virtually all of our icky payload data pre-processed and in parquet map columns in Athena so I don’t really have to do much data wrangling once we’ve fully baked a new study.

The format of the public data for the Ubiquiti discovery protocol scan results is a bit different than the base64 encoded data in the previous post in that the payload response is a hex-encoded character string; e.g.

0100009302000a002722bccf9db126fa9a02000a002722bdcf9dc0a80101010006002722bccf9d0a000400006ae40b000c626a732e6572656e696c646f0c00064147352d48500d00104d6f72726f5f446f757261646f5f30330e000102030022584d2e6172373234302e76352e362e332e32383539312e3135313133302e31373439100002e24514000d41697247726964204d35204850

So, every two characters is a byte (e.g. "01" is 0x01).

R has a nice strtoi() function for converting a hex-encoded byte into a raw value but it only works for one byte. We can split a string (like the one above) into a character vector of length 2 hex strings in many ways, one of which is using helper functions from the stringi package:

library(stringi)
library(magrittr) # for %>%

x <- "0100009302000a002722bccf9db126fa9a02000a002722bdcf9dc0a80101010006002722bccf9d0a000400006ae40b000c626a732e6572656e696c646f0c00064147352d48500d00104d6f72726f5f446f757261646f5f30330e000102030022584d2e6172373234302e76352e362e332e32383539312e3135313133302e31373439100002e24514000d41697247726964204d35204850"

stri_sub(x, seq(1, stri_length(x), by = 2), length = 2)
##   [1] "01" "00" "00" "93" "02" "00" "0a" "00" "27" "22" "bc" "cf" "9d" "b1" "26" "fa" "9a"
##  [18] "02" "00" "0a" "00" "27" "22" "bd" "cf" "9d" "c0" "a8" "01" "01" "01" "00" "06" "00"
##  [35] "27" "22" "bc" "cf" "9d" "0a" "00" "04" "00" "00" "6a" "e4" "0b" "00" "0c" "62" "6a"
##  [52] "73" "2e" "65" "72" "65" "6e" "69" "6c" "64" "6f" "0c" "00" "06" "41" "47" "35" "2d"
##  [69] "48" "50" "0d" "00" "10" "4d" "6f" "72" "72" "6f" "5f" "44" "6f" "75" "72" "61" "64"
##  [86] "6f" "5f" "30" "33" "0e" "00" "01" "02" "03" "00" "22" "58" "4d" "2e" "61" "72" "37"
## [103] "32" "34" "30" "2e" "76" "35" "2e" "36" "2e" "33" "2e" "32" "38" "35" "39" "31" "2e"
## [120] "31" "35" "31" "31" "33" "30" "2e" "31" "37" "34" "39" "10" "00" "02" "e2" "45" "14"
## [137] "00" "0d" "41" "69" "72" "47" "72" "69" "64" "20" "4d" "35" "20" "48" "50"

We still need to run that through strtoi() and turn it into a raw vector (at least for this use-case):

stri_sub(x, seq(1, stri_length(x), by = 2), length = 2) %>%
  strtoi(base = 16) %>%
  as.raw()
##   [1] 01 00 00 93 02 00 0a 00 27 22 bc cf 9d b1 26 fa 9a 02 00 0a 00 27 22 bd cf 9d c0 a8 01
##  [30] 01 01 00 06 00 27 22 bc cf 9d 0a 00 04 00 00 6a e4 0b 00 0c 62 6a 73 2e 65 72 65 6e 69
##  [59] 6c 64 6f 0c 00 06 41 47 35 2d 48 50 0d 00 10 4d 6f 72 72 6f 5f 44 6f 75 72 61 64 6f 5f
##  [88] 30 33 0e 00 01 02 03 00 22 58 4d 2e 61 72 37 32 34 30 2e 76 35 2e 36 2e 33 2e 32 38 35
## [117] 39 31 2e 31 35 31 31 33 30 2e 31 37 34 39 10 00 02 e2 45 14 00 0d 41 69 72 47 72 69 64
## [146] 20 4d 35 20 48 50

On one of my systems, an individual use of that full processing pipeline with the sample string takes about 170μs which is not bad. But, what if we have half a million of them (as was the case with the blog post for work)? I mean, sure, it’s only about a minute and a half of processing time (with some variance as each bit of input will be of different lengths), but that’s a painful interactive 1.5 minutes and we still need to wrap that bit of code in a function with some vectorization so it can be used easily.

This is a good example of where the complexity introduced by using a little C++/Rcpp may be warranted, especially since the BH package—which brings us a ton of capabilities from the Boost C++ library—has some handy string utilities, including an boost::algorithm::unhex() function.

Here’s one way to attack the problem in C++/Rcpp within a plain ol’ R session:

library(Rcpp)

cppFunction(depends = "BH", '
  List dehexify_cpp(StringVector input) {

    List out(input.size()); // make room for our return value

    for (unsigned int i=0; i<input.size(); i++) { // iterate over the input 

      if (StringVector::is_na(input[i]) || (input[i].size() == 0)) {
        out[i] = StringVector::create(NA_STRING); // bad input
      } else if (input[i].size() % 2 == 0) { // likey to be ok input

        RawVector tmp(input[i].size() / 2); // only need half the space
        std::string h = boost::algorithm::unhex(Rcpp::as<std::string>(input[i])); // do the work
        std::copy(h.begin(), h.end(), tmp.begin()); // copy it to our raw vector

        out[i] = tmp; // save it to the List

      } else {
        out[i] =  StringVector::create(NA_STRING); // bad input
      }

    }

    return(out);

  }
', includes = c('#include <boost/algorithm/hex.hpp>')
)

Now, we have a dehexify_cpp() function in our environment, so we can use it on any valid R data. Let’s see if we get the same results as the stringi R version:

dehexify_cpp(x)
## [[1]]
##   [1] 01 00 00 93 02 00 0a 00 27 22 bc cf 9d b1 26 fa 9a 02 00 0a 00 27 22 bd cf 9d c0 a8 01
##  [30] 01 01 00 06 00 27 22 bc cf 9d 0a 00 04 00 00 6a e4 0b 00 0c 62 6a 73 2e 65 72 65 6e 69
##  [59] 6c 64 6f 0c 00 06 41 47 35 2d 48 50 0d 00 10 4d 6f 72 72 6f 5f 44 6f 75 72 61 64 6f 5f
##  [88] 30 33 0e 00 01 02 03 00 22 58 4d 2e 61 72 37 32 34 30 2e 76 35 2e 36 2e 33 2e 32 38 35
## [117] 39 31 2e 31 35 31 31 33 30 2e 31 37 34 39 10 00 02 e2 45 14 00 0d 41 69 72 47 72 69 64
## [146] 20 4d 35 20 48 50

Apart from it being a list (since we took care of vectorization at the same time) it is, indeed, the same data.

With that tiny bit of fairly straightforward Rcpp/C++ code we get a substantially faster execution time of around 4μs. Yep, that’s not a typo: four microseconds.

We’ll give it a real world test with the payload data from work:

# This assumes you have a "~/Data" directory. Put it somewhere
# else if you don't have a "~/Data" directory.

if (!file.exists("~/Data/dehexify-sample-data.txt.gz")) {
  download.file(
    url = "https://rud.is/dl/dehexify-sample-data.txt.gz", 
    destfile = "~/Data/dehexify-sample-data.txt.gz"
  )
}

char_hex_lines <- readr::read_lines("~/Data/dehexify-sample-data.txt.gz")

length(char_hex_lines)
## [1] 501926

res <- dehexify_cpp(char_hex_lines)

That took just over a second to run on my main development system. But, did it really work? I chose index 998 at random so let’s poke at it with the tool from the other blog post:

udpprobe::parse_ubnt_discovery_response(res[[998]])
## [Model: N5N; Firmware: XW.ar934x.v5.5.9.21734.140403.1801; Uptime: 13.1 (hrs)

Aye, it did, indeed, work.

FIN

It’s still early in 2019 and if you haven’t settled on any resolutions yet or want to substitute out one that isn’t working so well (who wants to drive to the gym anyway?) with another, perhaps add “experiment with Rcpp” to the list since a tiny dose of it can go a very long way into speeding up some tasks.

We worked pretty hard over at $DAYJOB on helping to quantify and remediate a fairly significant configuration weakness in Ubiquiti network work gear attached to the internet.

Ubiquiti network gear — routers, switches, wireless access points, etc. — are enterprise grade components and are a joy to work with. Our home network is liberally populated with this gear. Ubiquiti nodes have a “discovery service” so they can be identified and brought under management of a controller. About half-a-million folks neglected to either join a controller or just left the discovery service running and accessible via the internet interface. ?

We use a myriad of tech to perform discovery and content scans on the internet, everything from zmap to Python and other bits in-between. It’s super easy to craft a quick UDP client in Ruby or Python to talk to nodes that speak UDP and get a response. Unfortunately, R only has built-in connection support for TCP communications. This makes a ton of sense given how R came to be and the primary uses of it for the bulk of its existence. I wanted to be able to test my own, non-internet-exposed Ubiquiti gear and do it from R, thus begat the udpprobe? package.

You can install it via your social coding platform of choice (after checking out the source code since you shouldn’t blindly trust any internet git repo):

devtools::install_git("https://git.sr.ht/~hrbrmstr/udpprobe")
# or
devtools::install_gitlab("hrbrmstr/udpprobe")
# or
devtools::install_github("hrbrmstr/udpprobe")

Some good news for you Windows folks is that it actually works on your legacy OS as well!

What can one do with this package?

I’m glad you asked.

We’ll get to the Ubiquiti portion in a bit. First, we’ll hand craft the payload for a DNS lookup for the address of example.com. You make thousands of DNS lookups every day but likely never poked at what they really look like. These lookups are still generally performed over UDP and the protocol and record formats are [thoroughly documented]. Here’s what that A record request for example.com looks like:

library(udpprobe)

c(
  0xaa, 0xaa, 0x01, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 
  0x00, 0x00, 0x07, 0x65, 0x78, 0x61, 0x6d, 0x70, 0x6c, 0x65, 
  0x03, 0x63, 0x6f, 0x6d, 0x00, 0x00, 0x01, 0x00, 0x01
) -> dns_req

The string example.com is 0x65 0x78 0x61 0x6d 0x70 0x6c 0x65 0x2e 0x63 0x6f 0x6d and you can find it at position 14 in the raw vector (technically it’s still a numeric vector but I made it in hex so it’s easier to refer to it that way). Right after it is a terminator then a 2 byte sequence to tell the DNS server we’re looking for an A record. Let’s make Google do some work:

(resp <- udp_send_payload("8.8.8.8", 53, dns_req))
##  [1] aa aa 81 80 00 01 00 01 00 00 00 00 07 65 78 61 6d 70
## [19] 6c 65 03 63 6f 6d 00 00 01 00 01 c0 0c 00 01 00 01 00
## [37] 00 0c 47 00 04 5d b8 d8 22

You should be able to find example.com in there again if you look closely. We’ll assume the response was OK and yank out the IP address it sent back:

paste0(as.integer(tail(resp, 4)), collapse = ".")
## [1] "93.184.216.34"

and, verify it with Jeroen’s spiffy curl package:

curl::nslookup("example.com")
## [1] "93.184.216.34"

Having some fun with Ubiquiti kit

If you read the aforelinked blog post you’d know that to talk to Ubiquiti gear we need to send 0x01 0x00 0x00 0x00 to UDP port 10001. Since I plan on expanding the ubpprobe package to include helper functions for standard probes, we’ll use the singular one provided so far to talk to a known exposed system. Rather than give you the IP address I’ve stored it in an environment variable:

(x <- ubnt_discovery_probe(Sys.getenv("UBNT_TEST_HOST")))
##   [1] 01 00 00 a0 02 00 0a dc 9f db 3a 5f 09 8a ff bd a9 02
##  [19] 00 0a dc 9f db 3b 5f 09 c0 a8 02 01 01 00 06 dc 9f db
##  [37] 3a 5f 09 0a 00 04 00 01 cb 0d 0b 00 15 39 36 39 20 2d
##  [55] 20 4a 75 76 65 6e 61 6c 20 52 69 62 65 69 72 6f 0c 00
##  [73] 03 4c 4d 35 0d 00 11 4e 45 54 53 55 50 45 52 2d 53 49
##  [91] 51 55 45 49 52 41 0e 00 01 02 03 00 22 58 4d 2e 61 72
## [109] 37 32 34 30 2e 76 35 2e 36 2e 35 2e 32 39 30 33 33 2e
## [127] 31 36 30 35 31 35 2e 32 31 31 39 10 00 02 e8 a5 14 00
## [145] 13 4e 61 6e 6f 53 74 61 74 69 6f 6e 20 4c 6f 63 6f 20
## [163] 4d 35

That shortcut function just calls:

udp_send_payload(Sys.getenv("UBNT_TEST_HOST"), 10001L, c(0x01, 0x00, 0x00, 0x00))

Unlike DNS, the Ubiquiti response payload is not formally documented, but folks on the Ubiquiti forums figured most of it out and we added some additional coverage for more “unknown” fields. We can use the built-in parser for these payload responses to see what kind of device it is and what firmware it’s running:

(y <- parse_ubnt_discovery_response(x))
## [Model: LM5; Firmware: XM.ar7240.v5.6.5.29033.160515.2119; Uptime: 1.4 (hrs)

Yep, it even has a pretty-printer. Here’s some of what’s under the hood (again, I’ve redacted things you shouldn’t know about since it could harm the target:

str(y, 1)
List of 10
 $ name       : chr "969 - Juvenal Ribeiro"
 $ model_long : chr "NanoStation Loco M5"
 $ model_short: chr "LM5"
 $ firmware   : chr "XM.ar7240.v5.6.5.29033.160515.2119"
 $ essid      : chr "NETSUPER-SIQUEIRA"
 $ uptime     : int 118619
 $ macs       : chr [1:2] "dc:9f:db:3a:xx:xx" "dc:9f:db:3b:xx:xx"
 $ ips        : chr [1:2] "REDACTED", "REDACTED"
 $ ux0e       : raw 02
 $ ux10       : raw [1:2] e8 a5
 - attr(*, "class")= chr "ubnt_d"

That’s way too much info to be leaking to the internet and 500,000 nodes were gleefully giving it away for anyone that asked.

FIN

I need to add support for UDP timeouts and dynamic response sizes (there’s a temporary hard-coded limit of 4K). There is support for UDP timeouts and dynamic response sizes (provided you give a good enough buffer hint). I tested it on a Windows VM and it does work but more testing would be appreciated by those of you on that platform.

Kick the tyres, file issues & PRs and welcome to the world of UDP in R!

The urlscan? package (an interface to the urlscan.io API) is now at version 0.2.0 and supports urlscan.io’s authentication requirement when submitting a link for analysis. The service is handy if you want to learn about the details — all the gory technical details — for a website.

For instance, say you wanted to check on r-project.org. You could manually go to the site, enter that into the request bar and wait for the result:

Or, you can use R!. First pick your preferred social coding site, read through the source (this is going to be new advice for every post starting with this one. Don’t blindly trust code from any social coding site) an then install the urlscan package:

devtools::install_git("https://git.sr.ht/~hrbrmstr/urlscan")
# or
devtools::install_gitlab("hrbrmstr/urlscan")
# or
devtools::install_github("hrbrmstr/urlscan")

Next, head on over back to urlscan.io and grab an API key (it’s free). Stick that in your ~/.Renviron under URLSCAN_API_KEY and then readRenviron("~/.Renviron") in your R console.

Now, let’s check out r-project.org.

library(urlscan)
library(tidyverse)

rproj <- urlscan_search("r-project.org")

rproj
##   URL Submitted: https://r-project.org/
##   Submission ID: eb2a5da1-dc0d-43e9-8236-dbc340b53772
## Submission Type: public
## Submission Note: Submission successful

There is more data in that rproj object but we have enough to get more detailed results back. Note that site will return an error when you use urlscan_result() if it hasn’t finished the analysis yet.

rproj_res <- urlscan_result("eb2a5da1-dc0d-43e9-8236-dbc340b53772", include_shot = TRUE)

rproj_res
##             URL: https://www.r-project.org/
##         Scan ID: eb2a5da1-dc0d-43e9-8236-dbc340b53772
##       Malicious: FALSE
##      Ad Blocked: FALSE
##     Total Links: 8
## Secure Requests: 19
##    Secure Req %: 100%

That rproj_res holds quite a bit of data and makes no assumptions about how you want to use it so you will need to do some wrangling with it to find out. The rproj_res$scan_result entry contains entries with the following information:

  • task: Information about the submission: Time, method, options, links to screenshot/DOM
  • page: High-level information about the page: Geolocation, IP, PTR
  • lists: Lists of domains, IPs, URLs, ASNs, servers, hashes
  • data: All of the requests/responses, links, cookies, messages
  • meta: Processor output: ASN, GeoIP, AdBlock, Google Safe Browsing
  • stats: Computed stats (by type, protocol, IP, etc.)

Let’s see how many domains the R Core folks are allowing to track you (if you’re not on legacy Windows OS you can find curlparse at https://git.sr.ht/~hrbrmstr/curlparse or git[la|hu]b/hrbrmstr/curlparse:

curlparse::domain(rproj_res$scan_result$lists$urls) %>% # you can use urltools::domain() instead of curlparse
  table(dnn = "domain") %>% 
  broom::tidy() %>% 
  arrange(desc(n))
## # A tibble: 7 x 2
##   domain                        n
##   <chr>                     <int>
## 1 platform.twitter.com          7
## 2 www.r-project.org             5
## 3 pbs.twimg.com                 3
## 4 syndication.twitter.com       2
## 5 ajax.googleapis.com           1
## 6 cdn.syndication.twimg.com     1
## 7 r-project.org                 1

Ironically, this is also how I learned that they allow Twitter to insecurely (no subresource integrity nor any content security policy) execute javascript in your browser (twitter javascript is blocked via multiple means at the hrbrmstr compound so I couldn’t see the widget).

Since I added the include_shot = TRUE option, we also get a page screenshot back (as a magick object):

rproj_res$screenshot

FIN

There’s tons of metadata to explore about web sites by using this package so jump in, kick the tyres, have fun! and file issues/PRs as needed.

A major new release of Homebrew has landed and now includes support for Linux as well as Windows! via the Windows Subsystem for Linux. There are overall stability and speed improvements baked in as well. The aforelinked notification has all the info you need to see the minutiae. Unless you’ve been super-lax in updating, brew update will get you the latest release.

There are extra formulae analytics endpoints and the homebrewanalytics? R package has been updated to handle them. A change worth noting in the package is that all the API calls are memoised to avoid hammering the Homebrew servers (though the “API” is really just file endpoints and they aren’t big files but bandwidth is bandwidth). Use the facilities in the memoise package to invalidate the cache if you have long running scripts.

Use your favorite social coding site to install it (If I don’t maintain mirrors on your open social coding platform of choice just drop a note in the comments and I’ll start mirroring there as well):

devtools::install_git("https://git.sr.ht/~hrbrmstr/homebrewanalytics")
# or
devtools::install_gitlab("hrbrmstr/homebrewanalytics")
# or
devtools::install_github("hrbrmstr/homebrewanalytics")

The README and in-package manual pages provide basic examples of retrieving data. But we can improve upon those here, such as finding out the dependency distribution of Homebrew formulae:

library(hrbrthemes)
library(homebrewanalytics) # git.sr.hr/~hrbrmstr ; git[la|hu]b/hrbrmstr
library(tidyverse)

f <- brew_formulae()

mutate(f, n_dependencies = lengths(build_dependencies)) %>% 
  count(n_dependencies) %>% 
  mutate(n_dependencies = factor(n_dependencies)) %>% 
  ggplot() +
  geom_col(aes(n_dependencies, n), fill = ft_cols$slate, width = 0.65) +
  scale_y_comma("# formulae") +
  labs(
    x = "# Dependencies",
    title = "Dependency distribution for Homebrew formulae"
  ) +
  theme_ft_rc(grid="Y")

Given how long it sometimes takes to upgrade my various Homebrew installations I was surprised to see 0 be so prevalent, but one of the major changes in 2.0.0 is going to be more binary installs (unless you really need custom builds) so that is likely part of my experience, especially with the formulae I need to support cybersecurity and spatial operations.

We can also see which formuale are in the top 50%:

unlist(f$dependencies) %>% 
  table(dnn = "library") %>% 
  broom::tidy() %>% 
  arrange(desc(n)) %>% 
  mutate(pct = n/sum(n), cpct = cumsum(pct)) %>% 
  filter(cpct <= 0.5) %>% 
  mutate(pct = scales::percent(pct)) %>% 
  mutate(library = factor(library, levels = rev(library))) %>% 
  ggplot(aes(n, library)) +
  geom_segment(aes(xend=0, yend=library), color = ft_cols$slate, size=3.5) +
  geom_text(
    aes(x = (n+max(n)*0.005), label = sprintf("%s (%s)", n, pct)), 
    hjust = 0, size = 3, family = font_rc, color = ft_cols$gray
  ) +
  scale_x_comma(position = "top", limits=c(0, 500)) +
  labs(
    x = "# package using the library", y = NULL,
    title = "Top 50% of libraries used across Homebrew formulae"
  ) +
  theme_ft_rc(grid="X") +
  theme(axis.text.y = element_text(family = "mono"))

It seems openssl is pretty popular (not surprising but always good to see cybersecurity things at the top of good lists for a change)! macOS ships with an even more dreadful (I know that’s hard to imagine) default Python setup than usual so it being number 2 is not unexpected.

And, finally, we can also check on how frequently formulae are installed. Let’s look back on the last 90 days:

ggplot() +
  geom_density(
    aes(x = installs$count, y = stat(count)),
    color = ft_cols$slate, fill = alpha(ft_cols$slate, 1/2)
  ) +
  scale_x_comma("# install events", trans = "log10") +
  scale_y_comma("# formulae") +
  labs(
    title = "Homebrew Formulate 'Install Events' Distribution (Past 90 days)"
  ) +
  theme_ft_rc(grid="XY")

I’ll let you play with the package to find out who the heavy hitters are and explore more about the Homebrew ecosystem.

FIN

Kick the tyres. File issues & PRs and a hearty “Welcome!” to the Homebrew ecosystem for Linux and Windows users. My hope is that the WSL availability will eventually make it easier to develop for Windows systems and avoid the “download the kinda sketchy compiled windows libraries from github on package install” practice we have today.

If you crank out some analytics using the packages don’t forget to blog about it and drop a link in the comments!

Luke Whyte posted an article (apologies for a Medium link) over on Towards Data Science showing how to use a command line workflow involving curl, node and various D3 libraries and javascript source files to build a series of SVG static maps. It’s well written and you should give it a read especially since he provides the code and data.

We can do all of that in R with the help of a couple packages and by using a free geocoding service which will also allow us to put more data on the map, albeit with some extra work due to it returning weird values for Hawaii locations.

library(albersusa) # git.sr.ht/~hrbrmstr/albersusa | git[la|hu]b/hrbrmstr/albersusa
library(rgeocodio) # git.sr.ht/~hrbrmstr/rgeocodio | git[la|hu]b/hrbrmstr/rgeocodio
library(tidyverse)

# the data url from the original blog
fil <- "https://query.data.world/s/awyahzfiikyoqi5ygpvauqslwmqltr"

read_csv(fil, col_types = "cd") %>% 
  select(area=1, pct=2) %>% 
  mutate(pct = pct/100) -> xdf # make percents proper percents

gc <- gio_batch_geocode(xdf$area)

The result of the geocoding is a data frame that has various confidences associated with the result. We’ll pick the top one for each and then correct for the errant Hawaii longitude it gives back:

map2_df(gc$query, gc$response_results, ~{
  out <- .y[1,,]
  out$area <- .x
  out
}) %>% 
  filter(!is.na(location.lat)) %>% 
  select(area, state = address_components.state, lat=location.lat, lon=location.lng) %>% 
  mutate(
    lat = ifelse(grepl("Honolu", area), 21.3069, lat),
    lon = ifelse(grepl("Honolu", area), -157.8583, lon)
  ) %>% 
  left_join(xdf) %>% 
  as_tibble() -> area_pct

area_pct
## # A tibble: 47 x 5
##    area                                        state   lat    lon   pct
##    <chr>                                       <chr> <dbl>  <dbl> <dbl>
##  1 McAllen-Edinburg-Mission, TX                TX     26.2  -98.1 0.102
##  2 Houston-The Woodlands-Sugar Land, TX        TX     29.6  -95.8 0.087
##  3 Santa Maria-Santa Barbara, CA               CA     34.4 -120.  0.081
##  4 Las Vegas-Henderson-Paradise, NV            NV     36.1 -115.  0.08 
##  5 Los Angeles-Long Beach-Anaheim, CA          CA     33.9 -118.  0.075
##  6 Miami-Fort Lauderdale-West Palm Beach, FL   FL     26.6  -80.1 0.073
##  7 Dallas-Fort Worth-Arlington, TX             TX     33.3  -98.4 0.069
##  8 Washington-Arlington-Alexandria, DC-VA-MD-… WV     39.2  -81.7 0.068
##  9 Bridgeport-Stamford-Norwalk, CT             CT     41.3  -73.1 0.067
## 10 San Jose-Sunnyvale-Santa Clara, CA          CA     37.4 -122.  0.065
## # … with 37 more rows

The albersusa package provides base maps with Alaska & Hawaii elided into a composite U.S. map. As such, we need to elide any points that are in Alaska & Hawaii:

us <- usa_composite()
us_map <- fortify(us, region="name")

hi <- select(filter(area_pct, state == "HI"), lon, lat)
(hi <- points_elided(hi))
area_pct[area_pct$state == "HI", c("lon", "lat")] <- hi

Then, it’s just a matter of using ggplot2:

ggplot() +
  geom_map(
    data = us@data, map=us_map,
    aes(map_id=name), 
    fill = "white", color = "#2b2b2b", size = 0.1
  ) +
  geom_point(
    data = area_pct, aes(lon, lat, size = pct), 
    fill = alpha("#b30000", 1/2), color = "#b30000", shape=21
  ) +
  ggalt::coord_proj(us_laea_proj) +
  scale_y_continuous(expand=c(0, 3)) +
  scale_radius(
    name = NULL, label = scales::percent_format(1)
  ) +
  labs(x = "Estimated percent of undocumented residents in U.S. metro areas. Source: Pew Research Center") +
  theme_minimal() +
  theme(axis.text = element_blank()) +
  theme(axis.title.x = element_text(hjust=0.5, size = 8)) +
  theme(axis.title.y = element_blank()) +
  theme(panel.grid = element_blank()) +
  theme(legend.position = c(0.9, 0.3)) -> gg

ggsave(filename = "map.svg", device = "svg", plot = gg, height = 5, width = 7)

Unlike the post’s featured image (which has to be a bitmap…grrr) the resultant SVG is below:

FIN

There is absolutely nothing wrong with working where you’re most comfortable and capable and Luke definitely wields the command line and javascript incredibly well. This alternate way of doing things in R may help other data journalists who are more comfortable in R or want to increase their R knowledge replicate and expand upon Luke’s process.

If you’ve got alternate ways of doing this in R or even (gasp) Python, drop a note in the comments with a link to your blog so folks who are comfortable neither in R nor the command line can see even more ways of producing this type of content.

The seymour? Feedly API package has been updated to support subscribing to RSS/Atom feeds. Previously the package was intended to just treat your Feedly as a data source, but there was a compelling use case for enabling subscription support: subscribing to code repository issues. Sure, there’s already email notice integration for repository issues on most social coding platforms but if you always have Feedly up (like I usually do) having issues aggregated into a Feedly category may be a better way to keep tabs on what’s going on.

If you use GitLab, that platform already has RSS feeds for public repositories. GitHub users have to use either RSSHub or gh-feed to do the same (note that you can host your own instance of either of those tools).

If you have more than a few repos and want to have their issues shunted to Feedly you could go manually enter them into the Feedly UI but that could be a pain, especially if “more than a few” is in the dozens or hundreds. But, we have R and can automate this. I’m providing an example for GitHub (since most readers art still stuck on that legacy platform) via the gh package but the same methods can be done for GitLab using the gitlabr? package.

First, we need to get the list of public GitHub issues you own:

library(gh)
library(purrr)
library(seymour) # git.sr.ht/~hrbrmstr/seymour, git[la|hu]b/hrbrmstr/seymour

gh::gh(
  "/user/repos", 
  visibility = "public",
  affiliation = "owner",
  sort = "created", direction = "desc", 
  .token = Sys.getenv("GITHUB_PAT") # see ?devtools::github_pat
) -> gh_repos

If you have more than 30 the gh package has a gh_next() function which will enable you to paginate through all your repos.

Now you need to choose between gh-feed or RSSHub and prepend a special URL prefix to a user/repo string to create a usable RSS URL. These are the prefixes:

We’ll use RSSHub for the remainder of the example.

Let’s do this first in base R. feedly_subscribe() takes an RSS/Atom URL as a parameter and optionally supports supplying a title and a vector of Feedly category names to help organize your new addition. The title will be automagically intuited by Feedly if not supplied. We’ll iterate over the return value of our call to the GitHub API and add subscribe to each repo.

do.call(
  rbind.data.frame,
  lapply(sapply(gh_repos, "[[", "name"), function(.x) {
    feedly_subscribe(
      feed_url = sprintf("https://rsshub.app/github/%s/%s", "hrbrmstr", .x),
      category = "github issues"
    )
  }) 
) -> res

The res value is a data frame that just has the resultant metadata about feed ids and where they’re located.

Here’s the same thing tidyverse-style:

map_chr(repos, "name") %>% 
  sprintf("https://rsshub.app/github/%s/%s", "hrbrmstr", .) %>% 
  map_df(feedly_subscribe, category = "github issues") -> res

FIN

There are two other addition to the seymour package: feedly_subscriptions(). This convenience function just pulls a data frame of feeds you subscribe to. The same data could be retrieved via the existing “stream” functions but this new function is faster and more targeted. The other one is feedly_categories() which you can use to identify the categories you have. The same data could be retrieved via the “collections” functions, but this function — again — is faster and more targeted.

As usual, kick the tyres and file issues/PRs as needed.

(A reminder to folks expecting “R”/”data science” content: the feed for that is at https://rud.is/b/category/r/feed/ if you don’t want to see the occasional non-R/datasci posts.)


Over at the $WORK blog we posted some research into the fairly horrible Cisco RV320/RV325 router vulnerability. The work blog is the work blog and this blog is my blog (i.e. opinions are my own, yada, yada, yada) and I felt compelled to post a cautionary take to vendors and organizations in general on how security issues can creep into your environment as a result of acquisitions and supply chains.

Looking purely at the evidence gathered from internet scans — which include SSL certificate info — and following the trail in the historical web archive one can make an informed, speculative claim that the weakness described in CVE-2019-1653 existed well before the final company logo ended up on the product.

It appears that NetKlass was at least producing the boards for this class of SMB VPN router and ultimately ended up supplying them to Linksys. A certain giant organization bought that company (and subsequently sold it off again) and it’s very likely this vulnerability ended up in said behemoth’s lap due to both poor supply chain management — in that Linksys seems to have done no security testing on the sourced parts — and, due to the acquisition, which caused those security issues to end up in a major brand’s product inventory.

This can happen to any organization involved in sourcing hardware/software from a third party and/or involved in acquiring another company. Receiving compliance-driven checkbox forms on the efficacy of the target security programs (or sw/hw) is not sufficient but is all too common a practice. Real due diligence involves kinda-trusting then verifying that the claims are accurate.

Rigorous product testing on the part of the original sourcing organization and follow-up assurance testing at the point of acquisition would have very likely caught this issue before it became a responsibly disclosed vulnerability.

As someone who measures all kinds of things on the internet as part of his $DAYJOB, I can say with some authority that huge swaths of organizations are using cloud-services such as Google Apps, Dropbox and Office 365 as part of their business process workflows. For me, one regular component that touches the “cloud” is when I have to share R-generated charts with our spiffy production team for use in reports, presentations and other general communications.

These are typically project-based tasks and data science team members typically use git- and AWS-based workflows for gathering data, performing analyses and generating output. While git is great at sharing code and ensuring the historical integrity of our analyses, we don’t expect the production team members to be or become experts in git to use our output. They live in Google Drive and thanks to the googledrive? package we can bridge the gap between code and output with just a few lines of R code.

We use “R projects” to organize things and either use spinnable R scripts or R markdown documents inside those projects to gather, clean and analyze data.

For 2019, we’re using new, work-specific R markdown templates that have one new YAML header parameter:

params:
  gdrive_folder_url: "https://drive.google.com/drive/u/2/SOMEUSELESSHEXSTRING"

which just defines the Google Drive folder URL for the final output directory in the ☁️.

Next is a new pre-configured knitr chunk call at the start of these production chart-generating documents:

knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE, dev = c("png", "cairo_pdf"),
  echo = FALSE,
  fig.retina = 2,
  fig.width = 10,
  fig.height = 6,
  fig.path = "prod/charts/"
)

since the production team wants PDF so they can work with it in their tools. Our testing showed that cairo_pdf produces the best/most consistent output, but PNGs show up better in the composite HTML documents so we use that order deliberately.

The real change is the consistent naming of the fig.path directory. By doing this, all we have to do is add a few lines (again, automatically generated) to the bottom of the document to have all the output automagically go to the proper Google Drive folder:

# Upload to production ----------------------------------------------------

googledrive::drive_auth()

# locate the folder
gdrive_prod_folder <- googledrive::as_id(params$gdrive_folder_url)

# clean it out
gdrls <- googledrive::drive_ls(gdrive_prod_folder)
if (nrow(gdrls) > 0) {
  dplyr::pull(gdrls, id) %>%
    purrr::walk(~googledrive::drive_rm(googledrive::as_id(.x)))
}

# upload new
list.files(here::here("prod/charts"), recursive = TRUE, full.names = TRUE) %>%
  purrr::walk(googledrive::drive_upload, path = gdrive_prod_folder)

Now, we never have to remember to drag documents into a browser and don’t have to load invasive Google applications onto our systems to ensure the right folks have the right files at the right time. We just have to use the new R markdown document type to generate a starter analysis document with all the necessary boilerplate baked in. Plus, .httr-oauth file is automatically ignored in .gitignore so there’s no information leakage to shared git repositories.

FIN

If you want to experiment with this, you can find a pre-configured template in the markdowntemplates package over at sr.ht, GitLab, or GitHub.

If you install the package you’ll be able to select this output type right from the new document dialog:

and new template will be ready to go with no copying, cutting or pasting.

Plus, since the Google Drive folder URL is an R markdown parameter, you can also use this in script automation (provided that you’ve wired up oauth correctly for those scripts).