hrbrmstr, Author at rud.is

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

In Dev: WiGLE Your Way Into A Hotspot with wiglr

WiGLE has been around a while and is a great site to explore the pervasiveness or sparsity of Wi-Fi (and cellular) networks around the globe. While interactive use is fun, WiGLE also has a free API (so long as you obey the EULA and aren’t abusive) that lets you explore a little deeper if you register for an account, get a key, and use the WIP wiglr🔗 | GL| GH package.

Install from your least offensive social coding site:

devtools::install_git("https://sr.ht.com/~hrbrmstr/wiglr")
# or
devtools::install_gitlab("hrbrmstr/wiglr")
# or (if you must)
devtools::install_github("hrbrmstr/wiglr")

read the code (to ensure R package developers are pwning or tracking you),and …after putting your “Encoded for use” token in the WIGLE_API_KEY environment variable…start exploring!

Let’s pull all the country stats and plot the top 20:

library(wiglr)
library(hrbrthemes)
library(tidyverse) # for show

cc <- wigle_country_stats()

top_n(cc, 20) %>%  # show top 20 
  mutate(country = factor(country, levels = rev(country))) %>% 
  ggplot(aes(count, country)) +
  geom_segment(aes(xend=0, yend=country), size = 6, color = ft_cols$blue) +
  scale_x_comma(position = "top") +
  labs(
    x = "# Networks", y = NULL,
    title = "WiGLE Top 20 Countries"
  ) +
  theme_ipsum_rc(grid = "X")

We can also supply a bounding box and find all the Wi-Fi access points in the vicinity and use leaflet to make a handy interactive map (you can add other columns, too, to find fully open ones):

library(leaflet)

wifi_box <- wigle_bbox_search(43.2468, 43.2806, -70.9282, -70.8025)

wifi_box$results %>% 
  mutate(labs = sprintf("SSID: <code>%s</code><br/>Encryption: %s", ssid, encryption)) %>% 
  leaflet() %>% 
  addTiles() %>% 
  addCircleMarkers(~trilong, ~trilat, radius = 1, popup = ~labs)

There’s good but not complete WiGLE API coverage:

wigle_about_me: Get WiGLE user object for the current logged-in user
wigle_api_key: Get or set WIGLE_API_KEY value
wigle_bbox_search: Get WiGLE named map of general statistics
wigle_country_stats: Get WiGLE statistics organized by country
wigle_region_stats: Get WiGLE statistics for a specified country, organized by region
wigle_site_stats: Get WiGLE named map of general statistics

FIN

The package is a WIP but the API is really straightforward, so if you’re looking to contribute to a (dare I say “fun!”) open source project in 2019 now’s your chance! Just drop an issue on whatever social coding site you prefer (ping me in a comment if you use something besides SourceHut, GitLab or GitHub and I’ll get the package up there) and spec out what you’d like to do or just file a well-articulated PR. Potential areas to enhance are:

support for more search parameters
a function for automatic pagination
cover the remaining search/retrieval API endpoints
investigate how to use R to submit readings
create an internal package Shiny app for exploring WiGLE straight from R
develop standard base visualizations and add them as package functions
write a vignette
improve package docs
develop a statistical model for the likelihood on finding free/open Wi-Fi in an area or the impact of Wi-Fi/cellular deserts on communities by layering in Census data

Working with or developing for the package may also help shed some light on yet-another-way we’re exposed in the digital world.

As always you’ll get judgement-free help/suggestions. full credit in the DESCRIPTION and the fame and glory of CRAN if it ever makes its way through the process. The package does just enough for me right now so it will be a while before I get to the above TODO list if others don’t jump in first.

Conquering Caffeinated Amazon Athena with the metis Trio of Packages

I must preface this post with the posit that if you’re doing anything interactive() with Amazon Athena you should seriously consider just using their free ODBC drivers as it’s the easiest way to wire them up to R DBI- and tidyverse-wise. I’ve said as much in previous posts. Drop a note in the comments if you don’t know the incantations for repackaging the provided Linux ODBC drivers to work on your flavor of Linux.

However…

There are times—say, when you’re trying to stand up an R service in your kubernetes cluster which bridges data in Athena to analyses & visualizations in R—when ODBC drivers can be more of a hindrance than help and JDBC is the path of least resistance.

Sure, there’s the in-CRAN AWR.Athena package but it’s a fairly constrained and low-feature RJDBC shim which gets the basic job done but not much more.

Enter:

a trio of packages which aims to make it super-straightforward to wire up R to Amazon Athena when ODBC is not available.

Why Three Packages?

For starters, there are CRAN hopes for the metis-trio and one key component of that is separating out the JARs into one package (metis.jars) and actual functionality into others (metis and metis.tidy). We’ll see how the CRAN attempt goes since the JAR package weighs in at sufficient weight to warrant a NOTE. The packaging of the driver reduces the need for you to pre-load the JAR (locally or into, say, a Docker image) or perform a package-initiated download-dance like AWR.Athena does (which I still don’t understand why that hasn’t kicked it out of CRAN the way it does it but ¯\_(ツ)_/¯).

metis.jars also has three helper functions which do some (basic) fun things:

library(metis.jars)

simba_driver_version()
## [1] "02.00.06.1008"

athena_supported_types()
##  [1] "BOOLEAN"   "TINYINT"   "SMALLINT"  "INT"       "INTEGER"  
##  [6] "BIGINT"    "REAL"      "FLOAT"     "DOUBLE"    "DECIMAL"  
## [11] "DATE"      "TIMESTAMP" "BINARY"    "VARBINARY" "CHAR"     
## [16] "VARCHAR"   "STRING"    "ARRAY"     "MAP"       "ROW"      
## [21] "STRUCT"   

metis_jar_path()
## [1] "/Library/Frameworks/R.framework/Versions/3.5/Resources/library/metis.jars/java/AthenaJDBC42_2.0.6.jar"

The first uses the rJava interface to directly query the version (since Amazon seems to update the Simba JAR twice a year). By separating out the JAR into a separate package, updates can be made to the other two sibling packages more frequently without crushing CRAN’s disk space. metis.jars is also versioned to the included JAR so configuration management will be easier for folks.

The reason for the second type-lister function is that there’s hope Amazon will add support for all Presto data types, especially IPADDRESS. It, again, performs JDBC driver introspection to collect the supported types.

Finally, the third function abstracts the JAR location from the metis package or even your own interface package should you choose to depend on it.

OK, But Why Not Just Two?

The metis package is a more functional RJDBC superclass of a DBI wrapper than AWR.Athena. One thing it does that its CRAN cousin cannot is handle BIGINTs properly:

library(metis)

dbConnect(
  metis::Athena(),
  Schema = "sampledb",
  AwsCredentialsProviderClass = "com.simba.athena.amazonaws.auth.PropertiesFileCredentialsProvider",
  AwsCredentialsProviderArguments = path.expand("~/.aws/athenaCredentials.props")
) -> con

dbGetQuery(con, "
SELECT
  CAST('chr' AS CHAR(4)) achar,
  CAST('varchr' AS VARCHAR) avarchr,
  CAST(SUBSTR(timestamp, 1, 10) AS DATE) AS tsday,
  CAST(100.1 AS DOUBLE) AS justadbl,
  CAST(127 AS TINYINT) AS asmallint,
  CAST(100 AS INTEGER) AS justanint,
  CAST(100000000000000000 AS BIGINT) AS abigint,
  CAST(('GET' = 'GET') AS BOOLEAN) AS is_get,
  ARRAY[1, 2, 3] AS arr1,
  ARRAY['1', '2, 3', '4'] AS arr2,
  MAP(ARRAY['foo', 'bar'], ARRAY[1, 2]) AS mp,
  CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE)) AS rw,
  CAST('{\"a\":1}' AS JSON) js
FROM elb_logs
LIMIT 1
") %>% 
  dplyr::glimpse()
## Observations: 1
## Variables: 13
## $ achar     <chr> "chr "
## $ avarchr   <chr> "varchr"
## $ tsday     <date> 2014-09-29
## $ justadbl  <dbl> 100.1
## $ asmallint <int> 127
## $ justanint <int> 100
## $ abigint   <S3: integer64> 100000000000000000
## $ is_get    <lgl> TRUE
## $ arr1      <chr> "1, 2, 3"
## $ arr2      <chr> "1, 2, 3, 4"
## $ mp        <chr> "{bar=2, foo=1}"
## $ rw        <chr> "{x=1, y=2.0}"
## $ js        <chr> "\"{\\\"a\\\":1}\""

~~Presto~~Athena arrays and maps and rows and JSON come across as characters from the Athena driver and they’re formatted so badly that there’s little hope of full R support for list columns for them. But, you do get real, big integers with metis along with full support for all other current Athena types.

R folk who may be users of the old, standalone metis package need to be aware of some things.

First, dbConnect() has breaking changes. The snake_case names that still exist in the higher-level athena_jdbc() function are gone. In exchange for this pain, you now have full naming-parity with all the Athena JDBC connection properties and can more easily use alternate credential providers which metis‘ cousin totally cannot do for you which is illustrated in the example above and in the package README.

The metis package also makes it easier to see documentation for all available Athena connection properties since it has a vignette with a descriptive table of all of them (rendered here).

There is also nascent support for the “streaming API” (TLDR: faster result set downloads) but that won’t be fully tested until some AWS policy tweaks happen this week.

Gotcha. But, Why Not Just Two?

As awesome as it is (including base Docker image support) the tidyverse is not without overhead in terms of compilation time and dependencies, both of which are especially painful on Linux systems and some Docker environments. You can absolutely get by with some well-crafted SQL and JDBC and the thinner the image the easier it is to deploy and scale.

But! The tidyverse is so helpful that ensuring smooth support for Athena is critical. On its own, metis wires up to dplyr/dbplyr fine, but by providing (in metis.tidy) some enhanced db_data_type() support (primarily for BIGINT) and some extra 💙 in sql_translate_env() )for those of us who continue to mindlessly use R-only verbs like grep() or as.POSIXct() in non-R contexts) we can level-up interactive() use and tidyverse-infused service use:

library(metis.tidy)
library(dbplyr)
library(dplyr)

metis::dbConnect(
  metis::Athena(),
  Schema = "sampledb",
  AwsCredentialsProviderClass = "com.simba.athena.amazonaws.auth.PropertiesFileCredentialsProvider",
  AwsCredentialsProviderArguments = path.expand("~/.aws/athenaCredentials.props")
) -> con

elb_logs <- tbl(con, "elb_logs")

filter(elb_logs, grepl("20", elbresponsecode)) %>%
  mutate(
    tsday = as.Date(substring(timestamp, 1L, 10L)),
    host = url_extract_host(url),
    proto_version = regexp_extract(protocol, "([[:digit:]\\.]+)"),
  ) %>%
  select(tsday, host, receivedbytes, requestprocessingtime, proto_version) %>%
  head(1) %>%
  glimpse()
## Observations: ??
## Variables: 5
## Database: AthenaConnection
## $ tsday                 <date> 2014-09-29
## $ host                  <chr> "www.abcxyz.com"
## $ receivedbytes         <S3: integer64> 0
## $ requestprocessingtime <dbl> 9.5e-05
## $ proto_version         <chr> "1.1"

FIN

A fairly big impetus for this radical refactoring was the need to use the Athena JDBC interface in R at $DAYJOB in a serverless context. So, if I/we needed it, others may as well. All three packages have tests (that work with my personal Athena setup which is easily replicated since it’s just the default schema & table you get when you enable Athena), pass CRAN checks and will be live in a real production environment by the time you read this.

Note that I do have CRAN plans for these three amigos, but all three packages will need to go in at the same time and I need to get tests into- and prove tests are live in Travis before submitting. Now’s the time for feature requests, problem reports or issues. Until SourceHut’s (sr.ht) API is finished, said contributions are best left to GitLab (preferably) or GitHub (if you must continue to fill the coffers of giant multional companies that undermine your freedom).

POSTSCRIPT

One other reason for re-visiting metis was this R-crashing rJava issue that is really a Simba Athena implementation issue (OS signals in a JDBC driver, rly?)

This Rprofile entry:

options(
  "java.parameters" = c(getOption("java.parameters", default = NULL), "-Xrs")
)

has been a solid workaround until rJava is updated. Note that metis.jars warns about this on load if it detects your setup is at risk.

Using the ropendata R Package to Access Petabytes of Free Internet Telemetry Data from Rapid7

I’ve got a post up over at $DAYJOB’s blog on using the ropendata🔗 package to access the ginormous and ever-increasing amount of internet telemetry (scan) data via the Rapid7 Open Data API.

It’s super-R-code-heavy but renders surprisingly well in Ghost (the blogging platform we use at work) and covers everything from where to sign up for an account, installing the package (it’s on CRAN!), enriching scan data with free geolocation databases and rgeolocate plus even shows how to cobble together some fairly intricate R + Rcpp code so you can decode gnarly, binary UDP responses from esoteric protocol exchanges.

Give it a look, grab the companion Rmd and drop any comments there or here if you have any questions, or reach out to research@rapid7.com.

Quick Hit: Speeding Up a Slow/Mundane Task with a Little Rcpp

Over at $DAYJOB’s blog I’ve queued up a post that shows how to use our new ropendata? package to work with our Open Data portal’s API. I’m not super-sure when it’s going to be posted so keep an RSS reader fixed on https://blog.rapid7.com/ if you’re interested in seeing it (I may make a small note of it here so it can wind its way into R Weekly & R-bloggers).

The example data used in the post is the public version of what I talked about in a recent post here, namely the devices discovered exposing the Ubiquity Discovery Protocol.

I’m quite blessed at work since we have virtually all of our icky payload data pre-processed and in parquet map columns in Athena so I don’t really have to do much data wrangling once we’ve fully baked a new study.

The format of the public data for the Ubiquiti discovery protocol scan results is a bit different than the base64 encoded data in the previous post in that the payload response is a hex-encoded character string; e.g.

0100009302000a002722bccf9db126fa9a02000a002722bdcf9dc0a80101010006002722bccf9d0a000400006ae40b000c626a732e6572656e696c646f0c00064147352d48500d00104d6f72726f5f446f757261646f5f30330e000102030022584d2e6172373234302e76352e362e332e32383539312e3135313133302e31373439100002e24514000d41697247726964204d35204850

So, every two characters is a byte (e.g. "01" is 0x01).

R has a nice strtoi() function for converting a hex-encoded byte into a raw value but it only works for one byte. We can split a string (like the one above) into a character vector of length 2 hex strings in many ways, one of which is using helper functions from the stringi package:

library(stringi)
library(magrittr) # for %>%

x <- "0100009302000a002722bccf9db126fa9a02000a002722bdcf9dc0a80101010006002722bccf9d0a000400006ae40b000c626a732e6572656e696c646f0c00064147352d48500d00104d6f72726f5f446f757261646f5f30330e000102030022584d2e6172373234302e76352e362e332e32383539312e3135313133302e31373439100002e24514000d41697247726964204d35204850"

stri_sub(x, seq(1, stri_length(x), by = 2), length = 2)
##   [1] "01" "00" "00" "93" "02" "00" "0a" "00" "27" "22" "bc" "cf" "9d" "b1" "26" "fa" "9a"
##  [18] "02" "00" "0a" "00" "27" "22" "bd" "cf" "9d" "c0" "a8" "01" "01" "01" "00" "06" "00"
##  [35] "27" "22" "bc" "cf" "9d" "0a" "00" "04" "00" "00" "6a" "e4" "0b" "00" "0c" "62" "6a"
##  [52] "73" "2e" "65" "72" "65" "6e" "69" "6c" "64" "6f" "0c" "00" "06" "41" "47" "35" "2d"
##  [69] "48" "50" "0d" "00" "10" "4d" "6f" "72" "72" "6f" "5f" "44" "6f" "75" "72" "61" "64"
##  [86] "6f" "5f" "30" "33" "0e" "00" "01" "02" "03" "00" "22" "58" "4d" "2e" "61" "72" "37"
## [103] "32" "34" "30" "2e" "76" "35" "2e" "36" "2e" "33" "2e" "32" "38" "35" "39" "31" "2e"
## [120] "31" "35" "31" "31" "33" "30" "2e" "31" "37" "34" "39" "10" "00" "02" "e2" "45" "14"
## [137] "00" "0d" "41" "69" "72" "47" "72" "69" "64" "20" "4d" "35" "20" "48" "50"

We still need to run that through strtoi() and turn it into a raw vector (at least for this use-case):

stri_sub(x, seq(1, stri_length(x), by = 2), length = 2) %>%
  strtoi(base = 16) %>%
  as.raw()
##   [1] 01 00 00 93 02 00 0a 00 27 22 bc cf 9d b1 26 fa 9a 02 00 0a 00 27 22 bd cf 9d c0 a8 01
##  [30] 01 01 00 06 00 27 22 bc cf 9d 0a 00 04 00 00 6a e4 0b 00 0c 62 6a 73 2e 65 72 65 6e 69
##  [59] 6c 64 6f 0c 00 06 41 47 35 2d 48 50 0d 00 10 4d 6f 72 72 6f 5f 44 6f 75 72 61 64 6f 5f
##  [88] 30 33 0e 00 01 02 03 00 22 58 4d 2e 61 72 37 32 34 30 2e 76 35 2e 36 2e 33 2e 32 38 35
## [117] 39 31 2e 31 35 31 31 33 30 2e 31 37 34 39 10 00 02 e2 45 14 00 0d 41 69 72 47 72 69 64
## [146] 20 4d 35 20 48 50

On one of my systems, an individual use of that full processing pipeline with the sample string takes about 170μs which is not bad. But, what if we have half a million of them (as was the case with the blog post for work)? I mean, sure, it’s only about a minute and a half of processing time (with some variance as each bit of input will be of different lengths), but that’s a painful interactive 1.5 minutes and we still need to wrap that bit of code in a function with some vectorization so it can be used easily.

This is a good example of where the complexity introduced by using a little C++/Rcpp may be warranted, especially since the BH package—which brings us a ton of capabilities from the Boost C++ library—has some handy string utilities, including an boost::algorithm::unhex() function.

Here’s one way to attack the problem in C++/Rcpp within a plain ol’ R session:

library(Rcpp)

cppFunction(depends = "BH", '
  List dehexify_cpp(StringVector input) {

    List out(input.size()); // make room for our return value

    for (unsigned int i=0; i<input.size(); i++) { // iterate over the input 

      if (StringVector::is_na(input[i]) || (input[i].size() == 0)) {
        out[i] = StringVector::create(NA_STRING); // bad input
      } else if (input[i].size() % 2 == 0) { // likey to be ok input

        RawVector tmp(input[i].size() / 2); // only need half the space
        std::string h = boost::algorithm::unhex(Rcpp::as<std::string>(input[i])); // do the work
        std::copy(h.begin(), h.end(), tmp.begin()); // copy it to our raw vector

        out[i] = tmp; // save it to the List

      } else {
        out[i] =  StringVector::create(NA_STRING); // bad input
      }

    }

    return(out);

  }
', includes = c('#include <boost/algorithm/hex.hpp>')
)

Now, we have a dehexify_cpp() function in our environment, so we can use it on any valid R data. Let’s see if we get the same results as the stringi R version:

dehexify_cpp(x)
## [[1]]
##   [1] 01 00 00 93 02 00 0a 00 27 22 bc cf 9d b1 26 fa 9a 02 00 0a 00 27 22 bd cf 9d c0 a8 01
##  [30] 01 01 00 06 00 27 22 bc cf 9d 0a 00 04 00 00 6a e4 0b 00 0c 62 6a 73 2e 65 72 65 6e 69
##  [59] 6c 64 6f 0c 00 06 41 47 35 2d 48 50 0d 00 10 4d 6f 72 72 6f 5f 44 6f 75 72 61 64 6f 5f
##  [88] 30 33 0e 00 01 02 03 00 22 58 4d 2e 61 72 37 32 34 30 2e 76 35 2e 36 2e 33 2e 32 38 35
## [117] 39 31 2e 31 35 31 31 33 30 2e 31 37 34 39 10 00 02 e2 45 14 00 0d 41 69 72 47 72 69 64
## [146] 20 4d 35 20 48 50

Apart from it being a list (since we took care of vectorization at the same time) it is, indeed, the same data.

With that tiny bit of fairly straightforward Rcpp/C++ code we get a substantially faster execution time of around 4μs. Yep, that’s not a typo: four microseconds.

We’ll give it a real world test with the payload data from work:

# This assumes you have a "~/Data" directory. Put it somewhere
# else if you don't have a "~/Data" directory.

if (!file.exists("~/Data/dehexify-sample-data.txt.gz")) {
  download.file(
    url = "https://rud.is/dl/dehexify-sample-data.txt.gz", 
    destfile = "~/Data/dehexify-sample-data.txt.gz"
  )
}

char_hex_lines <- readr::read_lines("~/Data/dehexify-sample-data.txt.gz")

length(char_hex_lines)
## [1] 501926

res <- dehexify_cpp(char_hex_lines)

That took just over a second to run on my main development system. But, did it really work? I chose index 998 at random so let’s poke at it with the tool from the other blog post:

udpprobe::parse_ubnt_discovery_response(res[[998]])
## [Model: N5N; Firmware: XW.ar934x.v5.5.9.21734.140403.1801; Uptime: 13.1 (hrs)

Aye, it did, indeed, work.

FIN

It’s still early in 2019 and if you haven’t settled on any resolutions yet or want to substitute out one that isn’t working so well (who wants to drive to the gym anyway?) with another, perhaps add “experiment with Rcpp” to the list since a tiny dose of it can go a very long way into speeding up some tasks.

Send UDP Probes (with payloads) and Receive/Process Responses in R

We worked pretty hard over at $DAYJOB on helping to quantify and remediate a fairly significant configuration weakness in Ubiquiti network work gear attached to the internet.

Ubiquiti network gear — routers, switches, wireless access points, etc. — are enterprise grade components and are a joy to work with. Our home network is liberally populated with this gear. Ubiquiti nodes have a “discovery service” so they can be identified and brought under management of a controller. About half-a-million folks neglected to either join a controller or just left the discovery service running and accessible via the internet interface. ?

We use a myriad of tech to perform discovery and content scans on the internet, everything from zmap to Python and other bits in-between. It’s super easy to craft a quick UDP client in Ruby or Python to talk to nodes that speak UDP and get a response. Unfortunately, R only has built-in connection support for TCP communications. This makes a ton of sense given how R came to be and the primary uses of it for the bulk of its existence. I wanted to be able to test my own, non-internet-exposed Ubiquiti gear and do it from R, thus begat the udpprobe? package.

You can install it via your social coding platform of choice (after checking out the source code since you shouldn’t blindly trust any internet git repo):

devtools::install_git("https://git.sr.ht/~hrbrmstr/udpprobe")
# or
devtools::install_gitlab("hrbrmstr/udpprobe")
# or
devtools::install_github("hrbrmstr/udpprobe")

Some good news for you Windows folks is that it actually works on your legacy OS as well!

What can one do with this package?

I’m glad you asked.

We’ll get to the Ubiquiti portion in a bit. First, we’ll hand craft the payload for a DNS lookup for the address of example.com. You make thousands of DNS lookups every day but likely never poked at what they really look like. These lookups are still generally performed over UDP and the protocol and record formats are [thoroughly documented]. Here’s what that A record request for example.com looks like:

library(udpprobe)

c(
  0xaa, 0xaa, 0x01, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 
  0x00, 0x00, 0x07, 0x65, 0x78, 0x61, 0x6d, 0x70, 0x6c, 0x65, 
  0x03, 0x63, 0x6f, 0x6d, 0x00, 0x00, 0x01, 0x00, 0x01
) -> dns_req

The string example.com is 0x65 0x78 0x61 0x6d 0x70 0x6c 0x65 0x2e 0x63 0x6f 0x6d and you can find it at position 14 in the raw vector (technically it’s still a numeric vector but I made it in hex so it’s easier to refer to it that way). Right after it is a terminator then a 2 byte sequence to tell the DNS server we’re looking for an A record. Let’s make Google do some work:

(resp <- udp_send_payload("8.8.8.8", 53, dns_req))
##  [1] aa aa 81 80 00 01 00 01 00 00 00 00 07 65 78 61 6d 70
## [19] 6c 65 03 63 6f 6d 00 00 01 00 01 c0 0c 00 01 00 01 00
## [37] 00 0c 47 00 04 5d b8 d8 22

You should be able to find example.com in there again if you look closely. We’ll assume the response was OK and yank out the IP address it sent back:

paste0(as.integer(tail(resp, 4)), collapse = ".")
## [1] "93.184.216.34"

and, verify it with Jeroen’s spiffy curl package:

curl::nslookup("example.com")
## [1] "93.184.216.34"

Having some fun with Ubiquiti kit

If you read the aforelinked blog post you’d know that to talk to Ubiquiti gear we need to send 0x01 0x00 0x00 0x00 to UDP port 10001. Since I plan on expanding the ubpprobe package to include helper functions for standard probes, we’ll use the singular one provided so far to talk to a known exposed system. Rather than give you the IP address I’ve stored it in an environment variable:

(x <- ubnt_discovery_probe(Sys.getenv("UBNT_TEST_HOST")))
##   [1] 01 00 00 a0 02 00 0a dc 9f db 3a 5f 09 8a ff bd a9 02
##  [19] 00 0a dc 9f db 3b 5f 09 c0 a8 02 01 01 00 06 dc 9f db
##  [37] 3a 5f 09 0a 00 04 00 01 cb 0d 0b 00 15 39 36 39 20 2d
##  [55] 20 4a 75 76 65 6e 61 6c 20 52 69 62 65 69 72 6f 0c 00
##  [73] 03 4c 4d 35 0d 00 11 4e 45 54 53 55 50 45 52 2d 53 49
##  [91] 51 55 45 49 52 41 0e 00 01 02 03 00 22 58 4d 2e 61 72
## [109] 37 32 34 30 2e 76 35 2e 36 2e 35 2e 32 39 30 33 33 2e
## [127] 31 36 30 35 31 35 2e 32 31 31 39 10 00 02 e8 a5 14 00
## [145] 13 4e 61 6e 6f 53 74 61 74 69 6f 6e 20 4c 6f 63 6f 20
## [163] 4d 35

That shortcut function just calls:

udp_send_payload(Sys.getenv("UBNT_TEST_HOST"), 10001L, c(0x01, 0x00, 0x00, 0x00))

Unlike DNS, the Ubiquiti response payload is not formally documented, but folks on the Ubiquiti forums figured most of it out and we added some additional coverage for more “unknown” fields. We can use the built-in parser for these payload responses to see what kind of device it is and what firmware it’s running:

(y <- parse_ubnt_discovery_response(x))
## [Model: LM5; Firmware: XM.ar7240.v5.6.5.29033.160515.2119; Uptime: 1.4 (hrs)

Yep, it even has a pretty-printer. Here’s some of what’s under the hood (again, I’ve redacted things you shouldn’t know about since it could harm the target:

str(y, 1)
List of 10
 $ name       : chr "969 - Juvenal Ribeiro"
 $ model_long : chr "NanoStation Loco M5"
 $ model_short: chr "LM5"
 $ firmware   : chr "XM.ar7240.v5.6.5.29033.160515.2119"
 $ essid      : chr "NETSUPER-SIQUEIRA"
 $ uptime     : int 118619
 $ macs       : chr [1:2] "dc:9f:db:3a:xx:xx" "dc:9f:db:3b:xx:xx"
 $ ips        : chr [1:2] "REDACTED", "REDACTED"
 $ ux0e       : raw 02
 $ ux10       : raw [1:2] e8 a5
 - attr(*, "class")= chr "ubnt_d"

That’s way too much info to be leaking to the internet and 500,000 nodes were gleefully giving it away for anyone that asked.

FIN

~~I need to add support for UDP timeouts and dynamic response sizes (there’s a temporary hard-coded limit of 4K)~~. There is support for UDP timeouts and dynamic response sizes (provided you give a good enough buffer hint). I tested it on a Windows VM and it does work but more testing would be appreciated by those of you on that platform.

Kick the tyres, file issues & PRs and welcome to the world of UDP in R!

R Package Update: urlscan

The urlscan? package (an interface to the urlscan.io API) is now at version 0.2.0 and supports urlscan.io’s authentication requirement when submitting a link for analysis. The service is handy if you want to learn about the details — all the gory technical details — for a website.

For instance, say you wanted to check on r-project.org. You could manually go to the site, enter that into the request bar and wait for the result:

Or, you can use R!. First pick your preferred social coding site, read through the source (this is going to be new advice for every post starting with this one. Don’t blindly trust code from any social coding site) an then install the urlscan package:

devtools::install_git("https://git.sr.ht/~hrbrmstr/urlscan")
# or
devtools::install_gitlab("hrbrmstr/urlscan")
# or
devtools::install_github("hrbrmstr/urlscan")

Next, head on over back to urlscan.io and grab an API key (it’s free). Stick that in your ~/.Renviron under URLSCAN_API_KEY and then readRenviron("~/.Renviron") in your R console.

Now, let’s check out r-project.org.

library(urlscan)
library(tidyverse)

rproj <- urlscan_search("r-project.org")

rproj
##   URL Submitted: https://r-project.org/
##   Submission ID: eb2a5da1-dc0d-43e9-8236-dbc340b53772
## Submission Type: public
## Submission Note: Submission successful

There is more data in that rproj object but we have enough to get more detailed results back. Note that site will return an error when you use urlscan_result() if it hasn’t finished the analysis yet.

rproj_res <- urlscan_result("eb2a5da1-dc0d-43e9-8236-dbc340b53772", include_shot = TRUE)

rproj_res
##             URL: https://www.r-project.org/
##         Scan ID: eb2a5da1-dc0d-43e9-8236-dbc340b53772
##       Malicious: FALSE
##      Ad Blocked: FALSE
##     Total Links: 8
## Secure Requests: 19
##    Secure Req %: 100%

That rproj_res holds quite a bit of data and makes no assumptions about how you want to use it so you will need to do some wrangling with it to find out. The rproj_res$scan_result entry contains entries with the following information:

task: Information about the submission: Time, method, options, links to screenshot/DOM
page: High-level information about the page: Geolocation, IP, PTR
lists: Lists of domains, IPs, URLs, ASNs, servers, hashes
data: All of the requests/responses, links, cookies, messages
meta: Processor output: ASN, GeoIP, AdBlock, Google Safe Browsing
stats: Computed stats (by type, protocol, IP, etc.)

Let’s see how many domains the R Core folks are allowing to track you (if you’re not on legacy Windows OS you can find curlparse at https://git.sr.ht/~hrbrmstr/curlparse or git[la|hu]b/hrbrmstr/curlparse:

curlparse::domain(rproj_res$scan_result$lists$urls) %>% # you can use urltools::domain() instead of curlparse
  table(dnn = "domain") %>% 
  broom::tidy() %>% 
  arrange(desc(n))
## # A tibble: 7 x 2
##   domain                        n
##   <chr>                     <int>
## 1 platform.twitter.com          7
## 2 www.r-project.org             5
## 3 pbs.twimg.com                 3
## 4 syndication.twitter.com       2
## 5 ajax.googleapis.com           1
## 6 cdn.syndication.twimg.com     1
## 7 r-project.org                 1

Ironically, this is also how I learned that they allow Twitter to insecurely (no subresource integrity nor any content security policy) execute javascript in your browser (twitter javascript is blocked via multiple means at the hrbrmstr compound so I couldn’t see the widget).

Since I added the include_shot = TRUE option, we also get a page screenshot back (as a magick object):

rproj_res$screenshot

FIN

There’s tons of metadata to explore about web sites by using this package so jump in, kick the tyres, have fun! and file issues/PRs as needed.

Homebrew 2.0.0 Released == homebrewanalytics package updated

A major new release of Homebrew has landed and now includes support for Linux as well as Windows! via the Windows Subsystem for Linux. There are overall stability and speed improvements baked in as well. The aforelinked notification has all the info you need to see the minutiae. Unless you’ve been super-lax in updating, brew update will get you the latest release.

There are extra formulae analytics endpoints and the homebrewanalytics? R package has been updated to handle them. A change worth noting in the package is that all the API calls are memoised to avoid hammering the Homebrew servers (though the “API” is really just file endpoints and they aren’t big files but bandwidth is bandwidth). Use the facilities in the memoise package to invalidate the cache if you have long running scripts.

Use your favorite social coding site to install it (If I don’t maintain mirrors on your open social coding platform of choice just drop a note in the comments and I’ll start mirroring there as well):

devtools::install_git("https://git.sr.ht/~hrbrmstr/homebrewanalytics")
# or
devtools::install_gitlab("hrbrmstr/homebrewanalytics")
# or
devtools::install_github("hrbrmstr/homebrewanalytics")

The README and in-package manual pages provide basic examples of retrieving data. But we can improve upon those here, such as finding out the dependency distribution of Homebrew formulae:

library(hrbrthemes)
library(homebrewanalytics) # git.sr.hr/~hrbrmstr ; git[la|hu]b/hrbrmstr
library(tidyverse)

f <- brew_formulae()

mutate(f, n_dependencies = lengths(build_dependencies)) %>% 
  count(n_dependencies) %>% 
  mutate(n_dependencies = factor(n_dependencies)) %>% 
  ggplot() +
  geom_col(aes(n_dependencies, n), fill = ft_cols$slate, width = 0.65) +
  scale_y_comma("# formulae") +
  labs(
    x = "# Dependencies",
    title = "Dependency distribution for Homebrew formulae"
  ) +
  theme_ft_rc(grid="Y")

Given how long it sometimes takes to upgrade my various Homebrew installations I was surprised to see 0 be so prevalent, but one of the major changes in 2.0.0 is going to be more binary installs (unless you really need custom builds) so that is likely part of my experience, especially with the formulae I need to support cybersecurity and spatial operations.

We can also see which formuale are in the top 50%:

unlist(f$dependencies) %>% 
  table(dnn = "library") %>% 
  broom::tidy() %>% 
  arrange(desc(n)) %>% 
  mutate(pct = n/sum(n), cpct = cumsum(pct)) %>% 
  filter(cpct <= 0.5) %>% 
  mutate(pct = scales::percent(pct)) %>% 
  mutate(library = factor(library, levels = rev(library))) %>% 
  ggplot(aes(n, library)) +
  geom_segment(aes(xend=0, yend=library), color = ft_cols$slate, size=3.5) +
  geom_text(
    aes(x = (n+max(n)*0.005), label = sprintf("%s (%s)", n, pct)), 
    hjust = 0, size = 3, family = font_rc, color = ft_cols$gray
  ) +
  scale_x_comma(position = "top", limits=c(0, 500)) +
  labs(
    x = "# package using the library", y = NULL,
    title = "Top 50% of libraries used across Homebrew formulae"
  ) +
  theme_ft_rc(grid="X") +
  theme(axis.text.y = element_text(family = "mono"))

It seems openssl is pretty popular (not surprising but always good to see cybersecurity things at the top of good lists for a change)! macOS ships with an even more dreadful (I know that’s hard to imagine) default Python setup than usual so it being number 2 is not unexpected.

And, finally, we can also check on how frequently formulae are installed. Let’s look back on the last 90 days:

ggplot() +
  geom_density(
    aes(x = installs$count, y = stat(count)),
    color = ft_cols$slate, fill = alpha(ft_cols$slate, 1/2)
  ) +
  scale_x_comma("# install events", trans = "log10") +
  scale_y_comma("# formulae") +
  labs(
    title = "Homebrew Formulate 'Install Events' Distribution (Past 90 days)"
  ) +
  theme_ft_rc(grid="XY")

I’ll let you play with the package to find out who the heavy hitters are and explore more about the Homebrew ecosystem.

FIN

Kick the tyres. File issues & PRs and a hearty “Welcome!” to the Homebrew ecosystem for Linux and Windows users. My hope is that the WSL availability will eventually make it easier to develop for Windows systems and avoid the “download the kinda sketchy compiled windows libraries from github on package install” practice we have today.

If you crank out some analytics using the packages don’t forget to blog about it and drop a link in the comments!

Fast Static Maps Built with R

Luke Whyte posted an article (apologies for a Medium link) over on Towards Data Science showing how to use a command line workflow involving curl, node and various D3 libraries and javascript source files to build a series of SVG static maps. It’s well written and you should give it a read especially since he provides the code and data.

We can do all of that in R with the help of a couple packages and by using a free geocoding service which will also allow us to put more data on the map, albeit with some extra work due to it returning weird values for Hawaii locations.

library(albersusa) # git.sr.ht/~hrbrmstr/albersusa | git[la|hu]b/hrbrmstr/albersusa
library(rgeocodio) # git.sr.ht/~hrbrmstr/rgeocodio | git[la|hu]b/hrbrmstr/rgeocodio
library(tidyverse)

# the data url from the original blog
fil <- "https://query.data.world/s/awyahzfiikyoqi5ygpvauqslwmqltr"

read_csv(fil, col_types = "cd") %>% 
  select(area=1, pct=2) %>% 
  mutate(pct = pct/100) -> xdf # make percents proper percents

gc <- gio_batch_geocode(xdf$area)

The result of the geocoding is a data frame that has various confidences associated with the result. We’ll pick the top one for each and then correct for the errant Hawaii longitude it gives back:

map2_df(gc$query, gc$response_results, ~{
  out <- .y[1,,]
  out$area <- .x
  out
}) %>% 
  filter(!is.na(location.lat)) %>% 
  select(area, state = address_components.state, lat=location.lat, lon=location.lng) %>% 
  mutate(
    lat = ifelse(grepl("Honolu", area), 21.3069, lat),
    lon = ifelse(grepl("Honolu", area), -157.8583, lon)
  ) %>% 
  left_join(xdf) %>% 
  as_tibble() -> area_pct

area_pct
## # A tibble: 47 x 5
##    area                                        state   lat    lon   pct
##    <chr>                                       <chr> <dbl>  <dbl> <dbl>
##  1 McAllen-Edinburg-Mission, TX                TX     26.2  -98.1 0.102
##  2 Houston-The Woodlands-Sugar Land, TX        TX     29.6  -95.8 0.087
##  3 Santa Maria-Santa Barbara, CA               CA     34.4 -120.  0.081
##  4 Las Vegas-Henderson-Paradise, NV            NV     36.1 -115.  0.08 
##  5 Los Angeles-Long Beach-Anaheim, CA          CA     33.9 -118.  0.075
##  6 Miami-Fort Lauderdale-West Palm Beach, FL   FL     26.6  -80.1 0.073
##  7 Dallas-Fort Worth-Arlington, TX             TX     33.3  -98.4 0.069
##  8 Washington-Arlington-Alexandria, DC-VA-MD-… WV     39.2  -81.7 0.068
##  9 Bridgeport-Stamford-Norwalk, CT             CT     41.3  -73.1 0.067
## 10 San Jose-Sunnyvale-Santa Clara, CA          CA     37.4 -122.  0.065
## # … with 37 more rows

The albersusa package provides base maps with Alaska & Hawaii elided into a composite U.S. map. As such, we need to elide any points that are in Alaska & Hawaii:

us <- usa_composite()
us_map <- fortify(us, region="name")

hi <- select(filter(area_pct, state == "HI"), lon, lat)
(hi <- points_elided(hi))
area_pct[area_pct$state == "HI", c("lon", "lat")] <- hi

Then, it’s just a matter of using ggplot2:

ggplot() +
  geom_map(
    data = us@data, map=us_map,
    aes(map_id=name), 
    fill = "white", color = "#2b2b2b", size = 0.1
  ) +
  geom_point(
    data = area_pct, aes(lon, lat, size = pct), 
    fill = alpha("#b30000", 1/2), color = "#b30000", shape=21
  ) +
  ggalt::coord_proj(us_laea_proj) +
  scale_y_continuous(expand=c(0, 3)) +
  scale_radius(
    name = NULL, label = scales::percent_format(1)
  ) +
  labs(x = "Estimated percent of undocumented residents in U.S. metro areas. Source: Pew Research Center") +
  theme_minimal() +
  theme(axis.text = element_blank()) +
  theme(axis.title.x = element_text(hjust=0.5, size = 8)) +
  theme(axis.title.y = element_blank()) +
  theme(panel.grid = element_blank()) +
  theme(legend.position = c(0.9, 0.3)) -> gg

ggsave(filename = "map.svg", device = "svg", plot = gg, height = 5, width = 7)

Unlike the post’s featured image (which has to be a bitmap…grrr) the resultant SVG is below:

FIN

There is absolutely nothing wrong with working where you’re most comfortable and capable and Luke definitely wields the command line and javascript incredibly well. This alternate way of doing things in R may help other data journalists who are more comfortable in R or want to increase their R knowledge replicate and expand upon Luke’s process.

If you’ve got alternate ways of doing this in R or even (gasp) Python, drop a note in the comments with a link to your blog so folks who are comfortable neither in R nor the command line can see even more ways of producing this type of content.