hrbrmstr, Author at rud.is

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

SODD — StackOverflow Driven-Development

I occasionally hang out on StackOverflow and often use an answer as an opportunity to fill a package void for a particular need. docxtractr and qrencoder are two (of many) packages that were birthed from SO answers. I usually try to answer with inline code first then expand the functionality into a package (if warranted). Some make it to CRAN (like those two), others stay on GitHub.

This (short) post is about two new ones: webhose? and pigeon?.

The webhose package is an API interface package to https://webhose.io/, which is an interesting service that scrapes the web & “dark web” and provides a short but handy API to retrieving the content using a fairly intuitive query language.

The pigeon package is a hastily-hacked-together wrapper around pgn-extract, a cross-platform utility written in C for working with chess game data in PGN format.

I’m not going to have any time to round out the corners on either of those packages but will gladly make time to help anyone who wants to jump on board to either (or both!) of them.

Working on either package will let you get your feet wet (or, wetter) with R package development. They both need:

more functions + docs!
test harness setup
Travis-CI, code coverage & AppVeyor configs

Working on webhose will give you experience dealing with HTTP APIs (and their API is super clean to work with) and possibly introduce you to an area of research you’re not already in.

Working on pigeon will give you experience integrating C[++] & R code (and the C-library I’ve hack-wrapped definitely needs some care & feeding to ensure no memory leaks are present).

Neither is “mission critical”. The world will gladly go on w/o either of them. But, if you wanted a judgement-free place to hone your R package skills or try/learn new things (and ask questions along the way), file an issue, drop a note in the comments or hit me up on Twitter. If you’re an experienced R package-r and want to “own” either of these, that’s ? as well!

I’d encourage all nascent R coders to adopt “SODD” and use SO as a place to hone your skills while you help others (and you don’t need to write a package for every answer :-).

Speeding Up Digital Arachnids

spiderbar, spiderbar 
Reads robots rules from afar.
Crawls the web, any size; 
Fetches with respect, never lies.
Look Out! 
Here comes the spiderbar.

Is it fast? 
Listen bud, 
It's got C++ under the hood.
Can you scrape, from a site?
Test with can_fetch(), TRUE == alright
Hey, there 
There goes the spiderbar.

(Check the end of the post if you don’t recognize the lyrical riff.)

Face front, true believer!

I’ve used and blogged about Peter Meissner’s most excellent robotstxt package before. It’s an essential tool for any ethical web scraper.

But (there’s always a “but“, right?), it was a definite bottleneck for an unintended package use case earlier this year (yes, I still have not rounded out the corners on my “crawl delay” forthcoming post).

I needed something faster for my bulk Crawl-Delay analysis which led me to this small, spiffy C++ library for parsing robots.txt files. After a tiny bit of wrangling, that C++ library has turned into a small R package spiderbar which is now hitting a CRAN mirror near you, soon. (CRAN — rightly so — did not like the unoriginal name rep).

How much faster?

I’m glad you asked!

Let’s take a look at one benchmark: parsing robots.txt and extracting Crawl-delay entries. Just how much faster is spiderbar?

library(spiderbar)
library(robotstxt)
library(microbenchmark)
library(tidyverse)
library(hrbrthemes)

rob <- get_robotstxt("imdb.com")

microbenchmark(

  robotstxt = {
    x <- parse_robotstxt(rob)
    x$crawl_delay
  },

  spiderbar = {
    y <- robxp(rob)
    crawl_delays(y)
  }

) -> mb1

update_geom_defaults("violin", list(colour = "#4575b4", fill="#abd9e9"))

autoplot(mb1) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for parsing 'robots.txt' and extracting 'Crawl-delay' entries",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

As you can see, it’s just a tad bit faster ?.

Now, you won’t notice that temporal gain in an interactive context but you absolutely will if you are cranking through a few million of them across a few thousand WARC files from the Common Crawl.

But, I don’t care about `Crawl-Delay`!

OK, fine. Do you care about fetchability? We can speed that up, too!

rob_txt <- parse_robotstxt(rob)
rob_spi <- robxp(rob)

microbenchmark(

  robotstxt = {
    robotstxt:::path_allowed(rob_txt$permissions, "/Vote")
  },

  spiderbar = {
    can_fetch(rob_spi, "/Vote")
  }

) -> mb2

autoplot(mb2) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for testing resource 'fetchability'",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

Vectorized or it didn’t happen.

(Gosh, even Spider-Man got more respect!)

OK, this is a tough crowd, but we’ve got vectorization covered as well:

microbenchmark(

  robotstxt = {
    paths_allowed(c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"), "imdb.com")
  },

  spiderbar = {
    can_fetch(rob_spi, c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"))
  }

) -> mb3

autoplot(mb3) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for testing multiple resource 'fetchability'",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

Excelsior!

Peter’s package does more than this one since it helps find the robots.txt files and provides helpful data frames for more robots exclusion protocol content. And, we’ve got some plans for package interoperability. So, stay tuned, true believer, for more spider-y goodness.

You can check out the code and leave package questions or comments on GitHub.

(Hrm…Peter Parker was Spider-Man and Peter Meissner wrote robotstxt which is all about spiders. Coincidence?! I think not!)

Pirating Web Content Responsibly With R

2017-09-19 – 07:28
Posted in data wrangling, R, TLAPD, web scraping
Tagged post
Comments (20)

International ~~Code~~ Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back.

There will be no ‘rrrrrr’ abuse in this post, I’m afraid, but there will be plenty of R code.

We’re going to combine pirate day with “pirating” data, in the sense that I’m going to show one way on how to use the web scraping powers of R responsibly to collect data on and explore modern-day pirate encounters.

Scouring The Seas Web For Pirate Data

Interestingly enough, there are many of sources for pirate data. I’ve blogged a few in the past, but I came across a new (to me) one by the International Chamber of Commerce. Their Commercial Crime Services division has something called the Live Piracy & Armed Robbery Report:

(site png snapshot taken with splashr)

I fiddled a bit with the URL and — sure enough — if you work a bit you can get data going back to late 2013, all in the same general format, so I jotted down base URLs and start+end record values and filed them away for future use:

library(V8)
library(stringi)
library(httr)
library(rvest)
library(robotstxt)
library(jwatr) # github/hrbrmstr/jwatr
library(hrbrthemes)
library(purrrlyr)
library(rprojroot)
library(tidyverse)

report_urls <- read.csv(stringsAsFactors=FALSE, header=TRUE, text="url,start,end
https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/, 1345, 1459
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/151/, 1137, 1339
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/details/146/, 885, 1138
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/144/, 625, 884
https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/133/, 337, 623")

by_row(report_urls, ~sprintf(.x$url %s+% "%s", .x$start:.x$end), .to="url_list") %>%
  pull(url_list) %>%
  flatten_chr() -> target_urls

head(target_urls)
## [1] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1345"
## [2] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1346"
## [3] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1347"
## [4] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1348"
## [5] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1349"
## [6] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1350"

Time to pillage some details!

But…Can We Really Do It?

I poked around the site’s terms of service/terms and conditions and automated retrieval was not discouraged. Yet, those aren’t the only sea mines we have to look out for. Perhaps they use their robots.txt to stop pirates. Let’s take a look:

robotstxt::get_robotstxt("https://www.icc-ccs.org/")
## # If the Joomla site is installed within a folder such as at
## # e.g. www.example.com/joomla/ the robots.txt file MUST be
## # moved to the site root at e.g. www.example.com/robots.txt
## # AND the joomla folder name MUST be prefixed to the disallowed
## # path, e.g. the Disallow rule for the /administrator/ folder
## # MUST be changed to read Disallow: /joomla/administrator/
## #
## # For more information about the robots.txt standard, see:
## # http://www.robotstxt.org/orig.html
## #
## # For syntax checking, see:
## # http://www.sxw.org.uk/computing/robots/check.html
##
## User-agent: *
## Disallow: /administrator/
## Disallow: /cache/
## Disallow: /cli/
## Disallow: /components/
## Disallow: /images/
## Disallow: /includes/
## Disallow: /installation/
## Disallow: /language/
## Disallow: /libraries/
## Disallow: /logs/
## Disallow: /media/
## Disallow: /modules/
## Disallow: /plugins/
## Disallow: /templates/
## Disallow: /tmp/

Ahoy! We’ve got a license to pillage!

But, we don’t have a license to abuse their site.

While I still haven’t had time to follow up on an earlier post about ‘crawl-delay’ settings across the internet I have done enough work on it to know that a 5 or 10 second delay is the most common setting (when sites bother to have this directive in their robots.txt file). ICC’s site does not have this setting defined, but we’ll still ~~pirate~~ crawl responsibly and use a 5 second delay between requests:

s_GET <- safely(GET)

pb <- progress_estimated(length(target_urls))
map(target_urls, ~{
  pb$tick()$print()
  Sys.sleep(5)
  s_GET(.x)
}) -> httr_raw_responses

write_rds(httr_raw_responses, "data/2017-icc-ccs-raw-httr-responses.rds")

good_responses <- keep(httr_raw_responses, ~!is.null(.x$result))

jwatr::response_list_to_warc_file(good_responses, "data/icc-good")

There are more “safety” measures you can use with httr::GET() but this one is usually sufficient. It just prevents the iteration from dying when there are hard retrieval errors.

I also like to save off the crawl results so I can go back to the raw file (if needed) vs re-scrape the site (this crawl takes a while). I do it two ways here, first using raw httr response objects (including any “broken” ones) and then filtering out the “complete” responses and saving them in WARC format so it’s in a more common format for sharing with others who may not use R.

Digging For Treasure

Did I mention that while the site looks like it’s easy to scrape it’s really not easy to scrape? That nice looking table is a sea mirage ready to trap unwary ~~sailors~~ crawlers in a pit of despair. The UX is built dynamically from on-page javascript content, a portion of which is below:

Now, you’re likely thinking: “Don’t we need to re-scrape the site with seleniumPipes or splashr?”

Fear not, stout yeoman! We can do this with the content we have if we don’t mind swabbing the decks first. Let’s put the ~~map~~ code up first and then dig into the details:

# make field names great again
mfga <- function(x) {
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  x <- make.unique(x, sep = "_")
  x
}

# I know the columns I want and this makes getting them into the types I want easier
cols(
  attack_number = col_character(),
  attack_posn_map = col_character(),
  date = col_datetime(format = ""),
  date_time = col_datetime(format = ""),
  id = col_integer(),
  location_detail = col_character(),
  narrations = col_character(),
  type_of_attack = col_character(),
  type_of_vessel = col_character()
) -> pirate_cols

# iterate over the good responses with a progress bar
pb <- progress_estimated(length(good_responses))
map_df(good_responses, ~{

  pb$tick()$print()

  # `safely` hides the data under `result` so expose it
  doc <- content(.x$result)

  # target the `<script>` tag that has our data, carve out the target lines, do some data massaging and evaluate the javascript with V8
  html_nodes(doc, xpath=".//script[contains(., 'requirejs')]") %>%
    html_text() %>%
    stri_split_lines() %>%
    .[[1]] %>%
    grep("narrations_ro", ., value=TRUE) %>%
    sprintf("var dat = %s;", .) %>%
    ctx$eval()

  p <- ctx$get("dat", flatten=TRUE)

  # now, process that data, turing the ugly returned list content into something we can put in a data frame
  keep(p[[1]], is.list) %>%
    map_df(~{
      list(
        field = mfga(.x[[3]]$label),
        value = .x[[3]]$value
      )
    }) %>%
    filter(value != "") %>%
    distinct(field, .keep_all = TRUE) %>%
    spread(field, value)

}) %>%
  type_convert(col_types = pirate_cols) %>%
  filter(stri_detect_regex(attack_number, "^[[:digit:]]")) %>%
  filter(lubridate::year(date) > 2012) %>%
  mutate(
    attack_posn_map = stri_replace_last_regex(attack_posn_map, ":.*$", ""),
    attack_posn_map = stri_replace_all_regex(attack_posn_map, "[\\(\\) ]", "")
  ) %>%
  separate(attack_posn_map, sep=",", into=c("lat", "lng")) %>%
  mutate(lng = as.numeric(lng), lat = as.numeric(lat)) -> pirate_df

write_rds(pirate_df, "data/pirate_df.rds")

The first bit there is a function to “make field names great again”. We’re processing some ugly list data and it’s not all uniform across all years so this will help make the data wrangling idiom more generic.

Next, I setup a cols object because we’re going to be extracting data from text as text and I think it’s cleaner to type_convert at the end vs have a slew of as.numeric() (et al) statements in-code (for small mumnging). You’ll note at the end of the munging pipeline I still need to do some manual conversions.

Now we can iterate over the good (complete) responses.

The purrr::safely function shoves the real httr response in result so we focus on that then “surgically” extract the target data from the <script> tag. Once we have it, we get it into a form we can feed into the V8 javascript engine and then retrieve the data from said evaluation.

Because ICC used the same Joomla plugin over the years, the data is uniform, but also can contain additional fields, so we extract the fields in a generic manner. During the course of data wrangling, I noticed there were often multiple Date: fields, so we throw in some logic to help avoid duplicate field names as well.

That whole process goes really quickly, but why not save off the clean data at the end for good measure?

Gotta Have A Pirate Map

Now we can begin to explore the data. I’ll leave most of that to you (since I’m providing the scraped data oh github), but here are a few views. First, just some simple counts per month:

mutate(pirate_df, year = lubridate::year(date), year_mon = as.Date(format(date, "%Y-%m-01"))) %>%
  count(year_mon) %>%
  ggplot(aes(year_mon, n)) +
  geom_segment(aes(xend=year_mon, yend=0)) +
  scale_y_comma() +
  labs(x=NULL, y=NULL,
       title="(Confirmed) Piracy Incidents per Month",
       caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") +
  theme_ipsum_rc(grid="Y")

And, finally, a map showing pirate encounters but colored by year:

world <- map_data("world")

mutate(pirate_df, year = lubridate::year(date)) %>%
  arrange(year) %>%
  mutate(year = factor(year)) -> plot_df

ggplot() +
  geom_map(data = world, map = world, aes(x=long, y=lat, map_id=region), fill="#b2b2b2") +
  geom_point(data = plot_df, aes(lng, lat, color=year), size=2, alpha=1/3) +
  ggalt::coord_proj("+proj=wintri") +
  viridis::scale_color_viridis(name=NULL, discrete=TRUE) +
  labs(x=NULL, y=NULL,
       title="Piracy Incidents per Month (Confirmed)",
       caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") +
  theme_ipsum_rc(grid="XY") +
  theme(legend.position = "bottom")

Taking Up The Mantle of the Dread Pirate Hrbrmstr

Hopefully this post shed some light on scraping responsibly and using different techniques to get to hidden data in web pages.

There’s some free-form text and more than a few other ways to look at the data. You can find the code and data on Github and don’t hesitate to ask questions in the comments or file an issue. If you make something blog it! Share your ideas and creations with the rest of the R (or other language) communities!

Mapping Fall Foliage with sf

I was socially engineered by @yoniceedee into creating today’s post due to being prodded with this tweet:

Where to see the best fall foliage, based on your location: https://t.co/12pQU29ksB pic.twitter.com/JiywYVpmno

— Vox (@voxdotcom) September 18, 2017

Since there aren’t nearly enough sf and geom_sf examples out on the wild, wild #rstats web, here’s a short one that shows how to do basic sf operations, including how to plot sf objects in ggplot2 and animate a series of them with magick.

I’m hoping someone riffs off of this to make an interactive version with Shiny. If you do, definitely drop a link+note in the comments!

(If folks want some exposition here, let me know in the comments and I’ll rig up an R Markdown document with more step-by-step details.)

Full RStudio project file (with pre-cached data) is on GitHub.

library(rprojroot)
library(sf)
library(magick)
library(tidyverse) # NOTE: Needs github version of ggplot2

root <- find_rstudio_root_file()

# "borrow" the files from SmokyMountains.com, but be nice and cache them to
# avoid hitting their web server for every iteration

c("https://smokymountains.com/wp-content/themes/smcom-2015/to-delete/js/us.json",
  "https://smokymountains.com/wp-content/themes/smcom-2015/js/foliage2.tsv",
  "https://smokymountains.com/wp-content/themes/smcom-2015/js/foliage-2017.csv") %>%
  walk(~{
    sav_tmp <- file.path(root, "data", basename(.x))
    if (!file.exists(sav_tmp)) download.file(.x, sav_tmp)
  })

# next, we read in the GeoJSON file twice. first, to get the counties
states_sf <- read_sf(file.path(root, "data", "us.json"), "states", stringsAsFactors = FALSE)

# we only want the continental US
states_sf <- filter(states_sf, !(id %in% c("2", "15", "72", "78")))

# it doesn't have a CRS so we give it one
st_crs(states_sf) <- 4326

# I ran into hiccups using coord_sf() to do this, so we convert it to Albers here
states_sf <- st_transform(states_sf, 5070)


# next we read in the states
counties_sf <- read_sf(file.path(root, "data", "us.json"), "counties", stringsAsFactors = FALSE)
st_crs(counties_sf) <- 4326
counties_sf <- st_transform(counties_sf, 5070)

# now, we read in the foliage data
foliage <- read_tsv(file.path(root, "data", "foliage-2017.csv"),
                    col_types = cols(.default=col_double(), id=col_character()))

# and, since we have a lovely sf tidy data frame, bind it together
left_join(counties_sf, foliage, "id") %>%
  filter(!is.na(rate1)) -> foliage_sf

# now, we do some munging so we have better labels and so we can
# iterate over the weeks
gather(foliage_sf, week, value, -id, -geometry) %>%
  mutate(value = factor(value)) %>%
  filter(week != "rate1") %>%
  mutate(week = factor(week,
                       levels=unique(week),
                       labels=format(seq(as.Date("2017-08-26"),
                                         as.Date("2017-11-11"), "1 week"),
                                     "%b %d"))) -> foliage_sf

# now we make a ggplot object for each week and save it out to a png
pb <- progress_estimated(nlevels(foliage_sf$week))
walk(1:nlevels(foliage_sf$week), ~{

  pb$tick()$print()

  xdf <- filter(foliage_sf, week == levels(week)[.x])

  ggplot() +
    geom_sf(data=xdf, aes(fill=value), size=0.05, color="#2b2b2b") +
    geom_sf(data=states_sf, color="white", size=0.125, fill=NA) +
    viridis::scale_fill_viridis(
      name=NULL,
      discrete = TRUE,
      labels=c("No Change", "Minimal", "Patchy", "Partial", "Near Peak", "Peak", "Past Peak"),
      drop=FALSE
    ) +
    labs(title=sprintf("Foliage: %s ", unique(xdf$week))) +
    ggthemes::theme_map() +
    theme(panel.grid=element_line(color="#00000000")) +
    theme(panel.grid.major=element_line(color="#00000000")) +
    theme(legend.position="right") -> gg

  ggsave(sprintf("%02d.png", .x), gg, width=5, height=3)

})

# we read them all back in and animate the foliage
sprintf("%02d.png", 1:nlevels(foliage_sf$week)) %>%
  map(image_read) %>%
  image_join() %>%
  image_animate(1)

Armchair Quarterbacking Systemic Organization and Industry Failures

2017-09-17 – 11:30
Posted in Cybersecurity, Information Security
Tagged post
Leave a Comment

insert(post, "{ 'standard_disclaimer' : 'My opinion, not my employer\'s' }")

This is a post about the fictional company FredCo. If the context or details presented by the post seem familiar, it’s purely coincidental. This is, again, a fictional story.

Let’s say FredCo had a pretty big breach that (fictionally) garnered media, Twitterverse, tech-world and Government-level attention and that we have some spurious details that let us sit back in our armchairs to opine about. What might have helped create the debacle at FredCo?

Despite (fictional) endless mainstream media coverage and a good chunk of ‘on background’ infosec-media clandestine blatherings we know very little about the breach itself (though it’s been fictionally, officially blamed on failure to patch Apache Struts). We know even less (fictionally officially) about the internal reach of the breach (apart from the limited consumer impact official disclosures). We know even less than that (fictionally officially) about how FredCo operates internally (process-wise).

But, I’ve (fictionally) seen:

a detailed breakdown of the number of domains, subdomains, and hosts FredCo “manages”.
the open port/service configurations of the public components of those domains
public information from individuals who are more willing to (fictionally) violate the CFAA than I am to get more than just port configuration information
a 2012/3 SAS 1 Type II report about FredCo controls
testimonies from FredCo execs regarding efficacy of $SECURITY_TECHOLOGY and 3 videos purporting to be indicative of expert opine on how to use BIIGG DATERZ to achieve cybersecurity success
the board & management structure + senior management bonus structures, complete with incentive-based objectives they were graded on

so, I’m going to blather a bit about how this fictional event should finally tear down the Potemkin village that is the combination of the Regulatory+Audit Industrial Complex and the Cybersecurity Industrial Complex.

“Tear down” with respect to the goal being to help individuals understand that a significant portion of organizations you entrust with your data are not incentivized or equipped to protect your data and that these same conditions exist in more critical areas — such as transportation, health care, and critical infrastructure — and you should expect a failure on the scale of FredCo — only with real, harmful impact — if nothing ends up changing soon.

From the top

There is boilerplate mention of “security” in the objectives of the senior executives between 2015 & 2016 14A filings:

CEO: “Employing advanced analytics and technology to help drive client growth, security, efficiency and profitability.”
CFO: “Continuing to advance and execute global enterprise risk management processes, including directing increased investment in data security, disaster recovery and regulatory compliance capabilities.”
CLO: “Continuing to refine and build out the Company’s global security organization.”
President, Workforce Solutions: None
CHRO: None
President – US Information Services: None

You’ll be happy to know that they all received either “Distinguished” or “Exceeds” on their appraisals and received a multiplier of their bonus & compensation targets as a result.

Furthermore, there is no one in the make-up of FredCo’s board of directors who has shown an interest or specialization in cybersecurity.

From the camera-positioned 50-yard line on instant replay, the board and shareholders of FredCo did not think protection of your identity and extremely personal information was important enough to include on three top executive directives and performance measure and was given little more than boilerplate mention for others. Investigators who look into FredCo’s breach should dig deep into the last decade of the detailed measures for these objectives. I have first-hand experience how these types of HR processes are managed in large orgs, which is why I’m encouraging this area for investigation.

“Security” is a terrible term, but it only works when it is an emergent property of the business processes of an organization. That means it must be contextual for every worker. Some colleagues suggest individual workers should not have to care about cybersecurity when making decisions or doing work, but even minimum-wage retail and grocery store clerks are educated about shoplifting risks and are given tools, tips and techniques to prevent loss. When your HR organizations is not incentivized to help create and maintain a cybersecurity-aware culture from the top you’re going to have problems, and when there are no cyberscurity-oriented targets for the CIO or even business process owners, don’t expect your holey screen door to keep out predators.

Awwwdit, Part I

NOTE: I’m not calling out any particular audit organization as I’ve only seen one fictional official report.

The Regulatory+Audit Industrial Complex is a lucrative business cabal. Governments and large business meta-agencies create structures where processes can be measured, verified and given a big green ✅. This validation exercise is generally done in one or more ways:

simple questionnaire, very high level questions, no veracity validation
more detailed questionnaire, mid-level questions, usually some in-person lightweight checking
detailed questionnaire, but with topics that can be sliced-and-diced by the legal+technical professions to mean literally anything, measured in-person by (usually) extremely junior reviewers with little-to-no domain expertise who follow review playbooks, get overwhelmed with log entries and scope-refinement+reduction and who end up being steered towards “important” but non-material findings

Sure, there are good audits and good auditors, but I will posit they are the rare diamonds in a bucket of zirconia.

We need to cover some technical ground before covering this further, though.

Shocking Struts

We’ll take the stated breach cause at face-value: failure to patch an remote-accessible vulnerability with Apache Struts. This was presented as the singular issue enabling attackers to walk (with crutches) away with scads of identify-theft-enabling personal data, administrator passwords, database passwords, and the recipe for the winning entry in the macaroni salad competition at last year’s HR annual picnic. Who knew one Java library had so much power!

We don’t know the architecture of all the web apps at FredCo. However, your security posture should not be a Jenga game tower, easily destroyed by removing one peg. These are all (generally) components of externally-facing applications at the scale of FredCo:

routers
switches
firewalls
load balancers
operating systems
application servers
middleware servers
database servers
customized code

These are mimicked (to varying levels of efficacy) across:

development
test
staging
production

environments.

They may coexist (in various layers of the network) with:

HR systems
Finance systems
Intranet servers
Active Directory
General user workstations
Executive workstations
Developer workstations
Mobile devices
Remote access infrastructure (i.e. VPNs)

A properly incentivized organization ensures there are logical and physical separation between/isolation of “stuff that matters” and that varying levels of authentication & authorization are applied to ensure access is restricted.

Keeping all that “secure” requires:

managing thousands of devices (servers, network components, laptops, desktops, mobile devices)
managing thousands of identities
managing thousands of configurations across systems, networks and devices
managing hundreds to thousands of connections between internal and external networks
managing thousands of rules
managing thousands of vulnerabilities (as they become known)
managing a secure development life cycle across hundreds or thousands of applications

Remember, though, that FredCo ostensibly managed all of that well and the data loss was solely due to one Java library.

If your executives (all of them) and workers (all of them) are not incentivized with that list in mind, you will have problems, but let’s talk about the security challenges back in the context of the audit role.

Awwwdit, Part II

The post is already long, so we’ll make this quick.

If I dropped you off — yes, you, because you’re likely as capable as the auditors mentioned in the previous section on audit — into that environment once a year, do you think you’d be able to ferret out issues based on convoluted network diagrams, poorly documented firewall rules and source code, non-standard checklists of user access management processes?

Let’s say I dropped you in months before the known Struts vulnerability and re-answer the question.

The burden placed on internal and — especially — external auditors is great and they are pretty much set up for failure from engagement number one.

Couple IT complexity with the fact that many orgs like FredCo aren’t required to do more than ensure financial reporting processes are ?.

But, even if there were more technical, security-oriented audits performed, you’d likely have ten different report findings by as many firms or auditors, especially if they were point-in-time audits. Furthermore, FredCo has has decades of point-in-time audits but hundreds of auditors and dozens of firms. The conditions of the breach were likely not net-new, so how did decades of systemic IT failures go unnoticed by this cabal?

IT audit functions are a multi-billion dollar business. FredCo is partially the result of the built-in cracks in the way verification is performed in orgs. In other words, I posit the Regulatory+Audit Industrial Complex bears some of the responsibility for FredCo’s breach.

Divisive Devices

From the (now removed) testimonials & videos, it was clear there may have been a “blinky light” problem in the mindset of those responsible for cybersecurity at FredCo. Relying solely on the capabilities of one or more devices (they are usually appliances with blinky lights) and thinking that storing petabytes of log data are going to stop “bad guys’ is a great recipe for a breach parfait.

But, the Cybersecurity Industrial Complex continues to dole out LED-laden boxes with the fervor of a U.S. doctor handing out opioids. Sure, they are just giving orgs what they want, but it doesn’t make it responsible behaviour. Just like the opioid problem, the “device” issue is likely causing cyber-sickness in more organizations that you’d like to admit. You may even know someone who works at an org with a box-addition.

I posit the Cybersecurity Industrial Complex bears some of the responsibility for FredCo’s breach, especially when you consider the hundreds of marketing e-mails I’ve seen post-FredCo breach telling me how CyberBox XJ9-11 would have stopped FredCo’s attackers cold.

A Matter of Trust

If removing a Struts peg from FredCo’s IT Jenga board caused the fictional tower to crash:

What do you think the B2B infrastructure looks like?
How do you think endpoints are managed?
What isolation, segmentation and access controls really exist?
How effective do you think their security awareness program is?
How many apps are architected & managed as poorly as the breached one?
How many shadow IT deployments exist in the ☁️ with your data in it?
How can you trust FredCo with anything of importance?

Fictional FIN

In this fictional world I’ve created one ending is:

all B2B connections to FredCo have been severed
lawyers at a thousand firms are working on language for filings to cancel all B2B contracts with FredCo
FredCo was de-listed from exchanges
FredCo executives are defending against a slew of criminal and civil charges
The U.S. Congress and U.K. Parliament have come together to undertake a joint review of regulatory and audit practices spanning both countries (since it impacted both countries and the Reg+Audit cabal spans both countries they decided to save time and money) resulting in sweeping changes
The SEC has mandated detailed cybersecurity objectives be placed on all senior management executives at all public companies and have forced results of those objectives assessments to be part of a new filing requirement.
The SEC has also mandated that at least one voting board member of public companies must have demonstrated experience with cybersecurity
The FTC creates and enforces standards on cybersecurity product advertising practices
You have understood that nobody has your back when it comes to managing your sensitive, personal data and that you must become an active participant in helping to ensure your elected representatives hold all organizations accountable when it comes to taking their responsibilities seriously.

but, another is:

FredCo’s stock bounces back
FredCo loses no business partners
FredCo’s current & former execs faced no civil or criminal charges
Congress makes a bit of opportunistic, temporary bluster for the sake of 2018 elections but doesn’t do anything more than berate FredCo publicly
You’re so tired of all these breaches and data loss that you go back to playing “Clash of Clans” on your mobile phone and do nothing.

It’s a FAKE (?)! Revisiting Trust In FOSS Ecosystems

I’ve blathered about trust before ¹ ², but said blatherings were in a “what if” context. Unfortunately, the if has turned into a when, which begged for further blathering on a recent FOSS ecosystem cybersecurity incident.

The gg_spiffy @thomasp85 linked to a post by the SK-CSIRT detailing the discovery and take-down of a series of malicious Python packages. Here’s their high-level incident summary:

SK-CSIRT identified malicious software libraries in the official Python package
repository, PyPI, posing as well known libraries. A prominent example is a fake
package urllib-1.21.1.tar.gz, based upon a well known package
urllib3-1.21.1.tar.gz.
Such packages may have been downloaded by unwitting developer or administrator
by various means, including the popular “pip” utility (pip install urllib).
There is evidence that the fake packages have indeed been downloaded and
incorporated into software multiple times between June 2017 and September 2017.

Words are great but, unlike some other FOSS projects (*cough* R *cough*) the PyPI folks have authoritative log data regarding package downloads from PyPI. This means we can begin to quantify the exposure. The Google BigQuery SQL was pretty straightforward:

SELECT timestamp, file.project as package, country_code, file.version AS version
FROM (
  (TABLE_DATE_RANGE([the-psf:pypi.downloads], TIMESTAMP('2016-01-22'), TIMESTAMP('2017-09-15')))
)
WHERE file.project IN ('acqusition', apidev-coop', 'bzip', 'crypt', 'django-server',
                       'pwd', 'setup-tools', 'telnet', 'urlib3', 'urllib')

Let’s see what the daily downloads of the malicious package look like:

Thanks to Curtis Doty (@dotysan on GH) I learned that the BigQuery table can be further filtered to exclude mirror-to-mirror traffic. The data for that is now in the GH repository and the chart in this callout shows that the exposure was very, very (very) limited:

But, we need counts of the mal-package dopplegangers (i.e. the good packages) to truly understand scope of exposure:

Thankfully, the SK-CSIRT folks caught this in time and the exposure was limited. But those are some popular tools that were targeted and it’s super-easy to sneak these into requirements.txt and scripts since the names are similar and the functionality is duplicated.

I’ll further note that the crypto package was “good” at some point in time then went away and was replaced with the nefarious one. That seems like a pretty big PyPI oversight (vis-a-vis package retirement & name re-use), but I’m not casting stones. R’s devtools::install_github() and wanton source()ing are just as bad, and the non-CRAN ecosystem is an even more varmint-prone “wild west” environment.

Furthermore, this is a potential exposure issue in many FOSS package repository ecosystems. On the one hand, these are open environments with tons of room for experimentation, creativity and collaboration. On the other hand, they’re all-too-easy targets for malicious hackers to prey upon.

I, unfortunately, have no quick-fix solutions to offer. “Review your code and dependencies” is about the best I can suggest until individual ecosystems work on better integrity & authenticity controls or there is a cross-ecosystem effort to establish “best practices” and perhaps even staffed, verified, audited, free services that work like a sheriff+notary to help ensure the safety of projects relying on open source components.

Python folks: double check that you weren’t a victim here (it’s super easy to type some of those package names wrong, and hopefully you’ve noticed builds failing if you had done so).

R folks: don’t be smug, watch your GitHub dependencies and double check your projects.

You can find the data and the scripts used to generate the charts (ironically enough) on GitHub.

Finally: I just want to close with a “thank you!” to PyPI’s Donald Stufft who (quickly!) pointed me to a blog post detailing the BigQuery setup.

Revisiting Readability With RStudio

I’ve blogged about my in-development R package hgr a before and it’s slowly getting to a CRAN release. There are two new features to it that are more useful in an interactive session than in a programmatic context. Since they build on each other, we’ll take them in order.

New S3 `print()` Method

Objects created with hgr::just_the_facts() used to be just list objects. Technically they are still list objects but they are also classed as hgr objects. The main reason for this was to support the new default print() method.

When you print() and hgr object, the new S3 method will extract the $content part of the object and pass it through some htmltools functions to display the readability-enhanced content in a browser (whatever R is configured to use for that on your system…you likely use RStudio if you read this blog and it will be in the Viewer pane for most users). This enables visual inspection of the content and (as previously stated) is pretty much only useful in an interactive context.

Rather than show you that now, we’ll see it in the context of the next feature.

‘Just The Facts’ RStudio Addin

I’m going to break with tradition and use a Medium post as an example, mostly because one of the things I detest about that content platform is just how little reading I can do on it (ironic since they touted things like typography when it launched). We’ll use this site as the example (an interesting JS article @timelyportfolio RT’d):

Now, that’s full-screen on a 13″ 2016 MacBook Pro and is pretty much useless. Now, I could (and, would normally) just use the Mercury Reader extension to strip away the cruft:

But, what if I wanted the content itself in R (for further processing) or didn’t feel like using (or policy disable the ability to use) a Chrome extension? Enter: “Just The Facts” RStudio Addin edition.

Just copy a text URL to the clipboard and choose the ‘Just The Facts’ addin:

and, you’ll get a few items as a result:

code executed in your R console that will…
add a new, unique object in your global environment with the results of a call to hgr::just_the_facts() on the copied URL, and…
the object auto-print()ed in the console, which…
displays the content in your RStudio Viewer

An image (or two) is definitely worth a 1,000 (ok, 48) words:

FIN

I’m likely going to crank out another addin that works more like a browser (with a URL bar in-Viewer) but also keep an in-environment log of requests so you can use or archive them afterwards.

Take it for a spin and don’t be sy in the GH issues page!

Increasing Output Buffer Size in Apache Drill UDFs Custom (Simple) Functions

Putting this here to make it easier for others who try to Google this topic to find it w/o having to find and tediously search through other UDFs (user-defined functions).

I was/am making a custom UDF for base64 decoding/encoding and ran into:

SYSTEM ERROR: IndexOutOfBoundsException: index: 0, length: #### (expected: range(0, 256))

It’s incredibly easy to “fix” (and, if my Java weren’t so rusty I’d likely have seen it sooner) but I found this idiom in the spatial UDFs for Drill that enables increasing the default buffer size:

buffer = out.buffer = buffer.reallocIfNeeded(outputSize);

Hopefully this will prevent someone else from spinning a few minutes trying to tackle this use-case. I even had looked at the source for the DrillBuf class and did not manage to put 2 + 2 together for some reason.