Skip navigation

International Code Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back.

There will be no ‘rrrrrr’ abuse in this post, I’m afraid, but there will be plenty of R code.

We’re going to combine pirate day with “pirating” data, in the sense that I’m going to show one way on how to use the web scraping powers of R responsibly to collect data on and explore modern-day pirate encounters.

Scouring The Seas Web For Pirate Data

Interestingly enough, there are many of sources for pirate data. I’ve blogged a few in the past, but I came across a new (to me) one by the International Chamber of Commerce. Their Commercial Crime Services division has something called the Live Piracy & Armed Robbery Report:

(site png snapshot taken with splashr)

I fiddled a bit with the URL and — sure enough — if you work a bit you can get data going back to late 2013, all in the same general format, so I jotted down base URLs and start+end record values and filed them away for future use:

library(V8)
library(stringi)
library(httr)
library(rvest)
library(robotstxt)
library(jwatr) # github/hrbrmstr/jwatr
library(hrbrthemes)
library(purrrlyr)
library(rprojroot)
library(tidyverse)

report_urls <- read.csv(stringsAsFactors=FALSE, header=TRUE, text="url,start,end
https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/, 1345, 1459
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/151/, 1137, 1339
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/details/146/, 885, 1138
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/144/, 625, 884
https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/133/, 337, 623")

by_row(report_urls, ~sprintf(.x$url %s+% "%s", .x$start:.x$end), .to="url_list") %>%
  pull(url_list) %>%
  flatten_chr() -> target_urls

head(target_urls)
## [1] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1345"
## [2] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1346"
## [3] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1347"
## [4] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1348"
## [5] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1349"
## [6] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1350"

Time to pillage some details!

But…Can We Really Do It?

I poked around the site’s terms of service/terms and conditions and automated retrieval was not discouraged. Yet, those aren’t the only sea mines we have to look out for. Perhaps they use their robots.txt to stop pirates. Let’s take a look:

robotstxt::get_robotstxt("https://www.icc-ccs.org/")
## # If the Joomla site is installed within a folder such as at
## # e.g. www.example.com/joomla/ the robots.txt file MUST be
## # moved to the site root at e.g. www.example.com/robots.txt
## # AND the joomla folder name MUST be prefixed to the disallowed
## # path, e.g. the Disallow rule for the /administrator/ folder
## # MUST be changed to read Disallow: /joomla/administrator/
## #
## # For more information about the robots.txt standard, see:
## # http://www.robotstxt.org/orig.html
## #
## # For syntax checking, see:
## # http://www.sxw.org.uk/computing/robots/check.html
##
## User-agent: *
## Disallow: /administrator/
## Disallow: /cache/
## Disallow: /cli/
## Disallow: /components/
## Disallow: /images/
## Disallow: /includes/
## Disallow: /installation/
## Disallow: /language/
## Disallow: /libraries/
## Disallow: /logs/
## Disallow: /media/
## Disallow: /modules/
## Disallow: /plugins/
## Disallow: /templates/
## Disallow: /tmp/

Ahoy! We’ve got a license to pillage!

But, we don’t have a license to abuse their site.

While I still haven’t had time to follow up on an earlier post about ‘crawl-delay’ settings across the internet I have done enough work on it to know that a 5 or 10 second delay is the most common setting (when sites bother to have this directive in their robots.txt file). ICC’s site does not have this setting defined, but we’ll still pirate crawl responsibly and use a 5 second delay between requests:

s_GET <- safely(GET)

pb <- progress_estimated(length(target_urls))
map(target_urls, ~{
  pb$tick()$print()
  Sys.sleep(5)
  s_GET(.x)
}) -> httr_raw_responses

write_rds(httr_raw_responses, "data/2017-icc-ccs-raw-httr-responses.rds")

good_responses <- keep(httr_raw_responses, ~!is.null(.x$result))

jwatr::response_list_to_warc_file(good_responses, "data/icc-good")

There are more “safety” measures you can use with httr::GET() but this one is usually sufficient. It just prevents the iteration from dying when there are hard retrieval errors.

I also like to save off the crawl results so I can go back to the raw file (if needed) vs re-scrape the site (this crawl takes a while). I do it two ways here, first using raw httr response objects (including any “broken” ones) and then filtering out the “complete” responses and saving them in WARC format so it’s in a more common format for sharing with others who may not use R.

Digging For Treasure

Did I mention that while the site looks like it’s easy to scrape it’s really not easy to scrape? That nice looking table is a sea mirage ready to trap unwary sailors crawlers in a pit of despair. The UX is built dynamically from on-page javascript content, a portion of which is below:

Now, you’re likely thinking: “Don’t we need to re-scrape the site with seleniumPipes or splashr?”

Fear not, stout yeoman! We can do this with the content we have if we don’t mind swabbing the decks first. Let’s put the map code up first and then dig into the details:

# make field names great again
mfga <- function(x) {
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  x <- make.unique(x, sep = "_")
  x
}

# I know the columns I want and this makes getting them into the types I want easier
cols(
  attack_number = col_character(),
  attack_posn_map = col_character(),
  date = col_datetime(format = ""),
  date_time = col_datetime(format = ""),
  id = col_integer(),
  location_detail = col_character(),
  narrations = col_character(),
  type_of_attack = col_character(),
  type_of_vessel = col_character()
) -> pirate_cols

# iterate over the good responses with a progress bar
pb <- progress_estimated(length(good_responses))
map_df(good_responses, ~{

  pb$tick()$print()

  # `safely` hides the data under `result` so expose it
  doc <- content(.x$result)

  # target the `<script>` tag that has our data, carve out the target lines, do some data massaging and evaluate the javascript with V8
  html_nodes(doc, xpath=".//script[contains(., 'requirejs')]") %>%
    html_text() %>%
    stri_split_lines() %>%
    .[[1]] %>%
    grep("narrations_ro", ., value=TRUE) %>%
    sprintf("var dat = %s;", .) %>%
    ctx$eval()

  p <- ctx$get("dat", flatten=TRUE)

  # now, process that data, turing the ugly returned list content into something we can put in a data frame
  keep(p[[1]], is.list) %>%
    map_df(~{
      list(
        field = mfga(.x[[3]]$label),
        value = .x[[3]]$value
      )
    }) %>%
    filter(value != "") %>%
    distinct(field, .keep_all = TRUE) %>%
    spread(field, value)

}) %>%
  type_convert(col_types = pirate_cols) %>%
  filter(stri_detect_regex(attack_number, "^[[:digit:]]")) %>%
  filter(lubridate::year(date) > 2012) %>%
  mutate(
    attack_posn_map = stri_replace_last_regex(attack_posn_map, ":.*$", ""),
    attack_posn_map = stri_replace_all_regex(attack_posn_map, "[\\(\\) ]", "")
  ) %>%
  separate(attack_posn_map, sep=",", into=c("lat", "lng")) %>%
  mutate(lng = as.numeric(lng), lat = as.numeric(lat)) -> pirate_df

write_rds(pirate_df, "data/pirate_df.rds")

The first bit there is a function to “make field names great again”. We’re processing some ugly list data and it’s not all uniform across all years so this will help make the data wrangling idiom more generic.

Next, I setup a cols object because we’re going to be extracting data from text as text and I think it’s cleaner to type_convert at the end vs have a slew of as.numeric() (et al) statements in-code (for small mumnging). You’ll note at the end of the munging pipeline I still need to do some manual conversions.

Now we can iterate over the good (complete) responses.

The purrr::safely function shoves the real httr response in result so we focus on that then “surgically” extract the target data from the <script> tag. Once we have it, we get it into a form we can feed into the V8 javascript engine and then retrieve the data from said evaluation.

Because ICC used the same Joomla plugin over the years, the data is uniform, but also can contain additional fields, so we extract the fields in a generic manner. During the course of data wrangling, I noticed there were often multiple Date: fields, so we throw in some logic to help avoid duplicate field names as well.

That whole process goes really quickly, but why not save off the clean data at the end for good measure?

Gotta Have A Pirate Map

Now we can begin to explore the data. I’ll leave most of that to you (since I’m providing the scraped data oh github), but here are a few views. First, just some simple counts per month:

mutate(pirate_df, year = lubridate::year(date), year_mon = as.Date(format(date, "%Y-%m-01"))) %>%
  count(year_mon) %>%
  ggplot(aes(year_mon, n)) +
  geom_segment(aes(xend=year_mon, yend=0)) +
  scale_y_comma() +
  labs(x=NULL, y=NULL,
       title="(Confirmed) Piracy Incidents per Month",
       caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") +
  theme_ipsum_rc(grid="Y")

And, finally, a map showing pirate encounters but colored by year:

world <- map_data("world")

mutate(pirate_df, year = lubridate::year(date)) %>%
  arrange(year) %>%
  mutate(year = factor(year)) -> plot_df

ggplot() +
  geom_map(data = world, map = world, aes(x=long, y=lat, map_id=region), fill="#b2b2b2") +
  geom_point(data = plot_df, aes(lng, lat, color=year), size=2, alpha=1/3) +
  ggalt::coord_proj("+proj=wintri") +
  viridis::scale_color_viridis(name=NULL, discrete=TRUE) +
  labs(x=NULL, y=NULL,
       title="Piracy Incidents per Month (Confirmed)",
       caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") +
  theme_ipsum_rc(grid="XY") +
  theme(legend.position = "bottom")

Taking Up The Mantle of the Dread Pirate Hrbrmstr

Hopefully this post shed some light on scraping responsibly and using different techniques to get to hidden data in web pages.

There’s some free-form text and more than a few other ways to look at the data. You can find the code and data on Github and don’t hesitate to ask questions in the comments or file an issue. If you make something blog it! Share your ideas and creations with the rest of the R (or other language) communities!

I was socially engineered by @yoniceedee into creating today’s post due to being prodded with this tweet:

Since there aren’t nearly enough sf and geom_sf examples out on the wild, wild web, here’s a short one that shows how to do basic sf operations, including how to plot sf objects in ggplot2 and animate a series of them with magick.

I’m hoping someone riffs off of this to make an interactive version with Shiny. If you do, definitely drop a link+note in the comments!

(If folks want some exposition here, let me know in the comments and I’ll rig up an R Markdown document with more step-by-step details.)

Full RStudio project file (with pre-cached data) is on GitHub.

library(rprojroot)
library(sf)
library(magick)
library(tidyverse) # NOTE: Needs github version of ggplot2

root <- find_rstudio_root_file()

# "borrow" the files from SmokyMountains.com, but be nice and cache them to
# avoid hitting their web server for every iteration

c("https://smokymountains.com/wp-content/themes/smcom-2015/to-delete/js/us.json",
  "https://smokymountains.com/wp-content/themes/smcom-2015/js/foliage2.tsv",
  "https://smokymountains.com/wp-content/themes/smcom-2015/js/foliage-2017.csv") %>%
  walk(~{
    sav_tmp <- file.path(root, "data", basename(.x))
    if (!file.exists(sav_tmp)) download.file(.x, sav_tmp)
  })

# next, we read in the GeoJSON file twice. first, to get the counties
states_sf <- read_sf(file.path(root, "data", "us.json"), "states", stringsAsFactors = FALSE)

# we only want the continental US
states_sf <- filter(states_sf, !(id %in% c("2", "15", "72", "78")))

# it doesn't have a CRS so we give it one
st_crs(states_sf) <- 4326

# I ran into hiccups using coord_sf() to do this, so we convert it to Albers here
states_sf <- st_transform(states_sf, 5070)


# next we read in the states
counties_sf <- read_sf(file.path(root, "data", "us.json"), "counties", stringsAsFactors = FALSE)
st_crs(counties_sf) <- 4326
counties_sf <- st_transform(counties_sf, 5070)

# now, we read in the foliage data
foliage <- read_tsv(file.path(root, "data", "foliage-2017.csv"),
                    col_types = cols(.default=col_double(), id=col_character()))

# and, since we have a lovely sf tidy data frame, bind it together
left_join(counties_sf, foliage, "id") %>%
  filter(!is.na(rate1)) -> foliage_sf

# now, we do some munging so we have better labels and so we can
# iterate over the weeks
gather(foliage_sf, week, value, -id, -geometry) %>%
  mutate(value = factor(value)) %>%
  filter(week != "rate1") %>%
  mutate(week = factor(week,
                       levels=unique(week),
                       labels=format(seq(as.Date("2017-08-26"),
                                         as.Date("2017-11-11"), "1 week"),
                                     "%b %d"))) -> foliage_sf

# now we make a ggplot object for each week and save it out to a png
pb <- progress_estimated(nlevels(foliage_sf$week))
walk(1:nlevels(foliage_sf$week), ~{

  pb$tick()$print()

  xdf <- filter(foliage_sf, week == levels(week)[.x])

  ggplot() +
    geom_sf(data=xdf, aes(fill=value), size=0.05, color="#2b2b2b") +
    geom_sf(data=states_sf, color="white", size=0.125, fill=NA) +
    viridis::scale_fill_viridis(
      name=NULL,
      discrete = TRUE,
      labels=c("No Change", "Minimal", "Patchy", "Partial", "Near Peak", "Peak", "Past Peak"),
      drop=FALSE
    ) +
    labs(title=sprintf("Foliage: %s ", unique(xdf$week))) +
    ggthemes::theme_map() +
    theme(panel.grid=element_line(color="#00000000")) +
    theme(panel.grid.major=element_line(color="#00000000")) +
    theme(legend.position="right") -> gg

  ggsave(sprintf("%02d.png", .x), gg, width=5, height=3)

})

# we read them all back in and animate the foliage
sprintf("%02d.png", 1:nlevels(foliage_sf$week)) %>%
  map(image_read) %>%
  image_join() %>%
  image_animate(1)
insert(post, "{ 'standard_disclaimer' : 'My opinion, not my employer\'s' }")

This is a post about the fictional company FredCo. If the context or details presented by the post seem familiar, it’s purely coincidental. This is, again, a fictional story.

Let’s say FredCo had a pretty big breach that (fictionally) garnered media, Twitterverse, tech-world and Government-level attention and that we have some spurious details that let us sit back in our armchairs to opine about. What might have helped create the debacle at FredCo?

Despite (fictional) endless mainstream media coverage and a good chunk of ‘on background’ infosec-media clandestine blatherings we know very little about the breach itself (though it’s been fictionally, officially blamed on failure to patch Apache Struts). We know even less (fictionally officially) about the internal reach of the breach (apart from the limited consumer impact official disclosures). We know even less than that (fictionally officially) about how FredCo operates internally (process-wise).

But, I’ve (fictionally) seen:

  • a detailed breakdown of the number of domains, subdomains, and hosts FredCo “manages”.
  • the open port/service configurations of the public components of those domains
  • public information from individuals who are more willing to (fictionally) violate the CFAA than I am to get more than just port configuration information
  • a 2012/3 SAS 1 Type II report about FredCo controls
  • testimonies from FredCo execs regarding efficacy of $SECURITY_TECHOLOGY and 3 videos purporting to be indicative of expert opine on how to use BIIGG DATERZ to achieve cybersecurity success
  • the board & management structure + senior management bonus structures, complete with incentive-based objectives they were graded on

so, I’m going to blather a bit about how this fictional event should finally tear down the Potemkin village that is the combination of the Regulatory+Audit Industrial Complex and the Cybersecurity Industrial Complex.

“Tear down” with respect to the goal being to help individuals understand that a significant portion of organizations you entrust with your data are not incentivized or equipped to protect your data and that these same conditions exist in more critical areas — such as transportation, health care, and critical infrastructure — and you should expect a failure on the scale of FredCo — only with real, harmful impact — if nothing ends up changing soon.

From the top

There is boilerplate mention of “security” in the objectives of the senior executives between 2015 & 2016 14A filings:

  • CEO: “Employing advanced analytics and technology to help drive client growth, security, efficiency and profitability.”
  • CFO: “Continuing to advance and execute global enterprise risk management processes, including directing increased investment in data security, disaster recovery and regulatory compliance capabilities.”
  • CLO: “Continuing to refine and build out the Company’s global security organization.”
  • President, Workforce Solutions: None
  • CHRO: None
  • President – US Information Services: None

You’ll be happy to know that they all received either “Distinguished” or “Exceeds” on their appraisals and received a multiplier of their bonus & compensation targets as a result.

Furthermore, there is no one in the make-up of FredCo’s board of directors who has shown an interest or specialization in cybersecurity.

From the camera-positioned 50-yard line on instant replay, the board and shareholders of FredCo did not think protection of your identity and extremely personal information was important enough to include on three top executive directives and performance measure and was given little more than boilerplate mention for others. Investigators who look into FredCo’s breach should dig deep into the last decade of the detailed measures for these objectives. I have first-hand experience how these types of HR processes are managed in large orgs, which is why I’m encouraging this area for investigation.

“Security” is a terrible term, but it only works when it is an emergent property of the business processes of an organization. That means it must be contextual for every worker. Some colleagues suggest individual workers should not have to care about cybersecurity when making decisions or doing work, but even minimum-wage retail and grocery store clerks are educated about shoplifting risks and are given tools, tips and techniques to prevent loss. When your HR organizations is not incentivized to help create and maintain a cybersecurity-aware culture from the top you’re going to have problems, and when there are no cyberscurity-oriented targets for the CIO or even business process owners, don’t expect your holey screen door to keep out predators.

Awwwdit, Part I

NOTE: I’m not calling out any particular audit organization as I’ve only seen one fictional official report.

The Regulatory+Audit Industrial Complex is a lucrative business cabal. Governments and large business meta-agencies create structures where processes can be measured, verified and given a big green ✅. This validation exercise is generally done in one or more ways:

  • simple questionnaire, very high level questions, no veracity validation
  • more detailed questionnaire, mid-level questions, usually some in-person lightweight checking
  • detailed questionnaire, but with topics that can be sliced-and-diced by the legal+technical professions to mean literally anything, measured in-person by (usually) extremely junior reviewers with little-to-no domain expertise who follow review playbooks, get overwhelmed with log entries and scope-refinement+reduction and who end up being steered towards “important” but non-material findings

Sure, there are good audits and good auditors, but I will posit they are the rare diamonds in a bucket of zirconia.

We need to cover some technical ground before covering this further, though.

Shocking Struts

We’ll take the stated breach cause at face-value: failure to patch an remote-accessible vulnerability with Apache Struts. This was presented as the singular issue enabling attackers to walk (with crutches) away with scads of identify-theft-enabling personal data, administrator passwords, database passwords, and the recipe for the winning entry in the macaroni salad competition at last year’s HR annual picnic. Who knew one Java library had so much power!

We don’t know the architecture of all the web apps at FredCo. However, your security posture should not be a Jenga game tower, easily destroyed by removing one peg. These are all (generally) components of externally-facing applications at the scale of FredCo:

  • routers
  • switches
  • firewalls
  • load balancers
  • operating systems
  • application servers
  • middleware servers
  • database servers
  • customized code

These are mimicked (to varying levels of efficacy) across:

  • development
  • test
  • staging
  • production

environments.

They may coexist (in various layers of the network) with:

  • HR systems
  • Finance systems
  • Intranet servers
  • Active Directory
  • General user workstations
  • Executive workstations
  • Developer workstations
  • Mobile devices
  • Remote access infrastructure (i.e. VPNs)

A properly incentivized organization ensures there are logical and physical separation between/isolation of “stuff that matters” and that varying levels of authentication & authorization are applied to ensure access is restricted.

Keeping all that “secure” requires:

  • managing thousands of devices (servers, network components, laptops, desktops, mobile devices)
  • managing thousands of identities
  • managing thousands of configurations across systems, networks and devices
  • managing hundreds to thousands of connections between internal and external networks
  • managing thousands of rules
  • managing thousands of vulnerabilities (as they become known)
  • managing a secure development life cycle across hundreds or thousands of applications

Remember, though, that FredCo ostensibly managed all of that well and the data loss was solely due to one Java library.

If your executives (all of them) and workers (all of them) are not incentivized with that list in mind, you will have problems, but let’s talk about the security challenges back in the context of the audit role.

Awwwdit, Part II

The post is already long, so we’ll make this quick.

If I dropped you off — yes, you, because you’re likely as capable as the auditors mentioned in the previous section on audit — into that environment once a year, do you think you’d be able to ferret out issues based on convoluted network diagrams, poorly documented firewall rules and source code, non-standard checklists of user access management processes?

Let’s say I dropped you in months before the known Struts vulnerability and re-answer the question.

The burden placed on internal and — especially — external auditors is great and they are pretty much set up for failure from engagement number one.

Couple IT complexity with the fact that many orgs like FredCo aren’t required to do more than ensure financial reporting processes are ?.

But, even if there were more technical, security-oriented audits performed, you’d likely have ten different report findings by as many firms or auditors, especially if they were point-in-time audits. Furthermore, FredCo has has decades of point-in-time audits but hundreds of auditors and dozens of firms. The conditions of the breach were likely not net-new, so how did decades of systemic IT failures go unnoticed by this cabal?

IT audit functions are a multi-billion dollar business. FredCo is partially the result of the built-in cracks in the way verification is performed in orgs. In other words, I posit the Regulatory+Audit Industrial Complex bears some of the responsibility for FredCo’s breach.

Divisive Devices

From the (now removed) testimonials & videos, it was clear there may have been a “blinky light” problem in the mindset of those responsible for cybersecurity at FredCo. Relying solely on the capabilities of one or more devices (they are usually appliances with blinky lights) and thinking that storing petabytes of log data are going to stop “bad guys’ is a great recipe for a breach parfait.

But, the Cybersecurity Industrial Complex continues to dole out LED-laden boxes with the fervor of a U.S. doctor handing out opioids. Sure, they are just giving orgs what they want, but it doesn’t make it responsible behaviour. Just like the opioid problem, the “device” issue is likely causing cyber-sickness in more organizations that you’d like to admit. You may even know someone who works at an org with a box-addition.

I posit the Cybersecurity Industrial Complex bears some of the responsibility for FredCo’s breach, especially when you consider the hundreds of marketing e-mails I’ve seen post-FredCo breach telling me how CyberBox XJ9-11 would have stopped FredCo’s attackers cold.

A Matter of Trust

If removing a Struts peg from FredCo’s IT Jenga board caused the fictional tower to crash:

  • What do you think the B2B infrastructure looks like?
  • How do you think endpoints are managed?
  • What isolation, segmentation and access controls really exist?
  • How effective do you think their security awareness program is?
  • How many apps are architected & managed as poorly as the breached one?
  • How many shadow IT deployments exist in the ☁️ with your data in it?
  • How can you trust FredCo with anything of importance?

Fictional FIN

In this fictional world I’ve created one ending is:

  • all B2B connections to FredCo have been severed
  • lawyers at a thousand firms are working on language for filings to cancel all B2B contracts with FredCo
  • FredCo was de-listed from exchanges
  • FredCo executives are defending against a slew of criminal and civil charges
  • The U.S. Congress and U.K. Parliament have come together to undertake a joint review of regulatory and audit practices spanning both countries (since it impacted both countries and the Reg+Audit cabal spans both countries they decided to save time and money) resulting in sweeping changes
  • The SEC has mandated detailed cybersecurity objectives be placed on all senior management executives at all public companies and have forced results of those objectives assessments to be part of a new filing requirement.
  • The SEC has also mandated that at least one voting board member of public companies must have demonstrated experience with cybersecurity
  • The FTC creates and enforces standards on cybersecurity product advertising practices
  • You have understood that nobody has your back when it comes to managing your sensitive, personal data and that you must become an active participant in helping to ensure your elected representatives hold all organizations accountable when it comes to taking their responsibilities seriously.

but, another is:

  • FredCo’s stock bounces back
  • FredCo loses no business partners
  • FredCo’s current & former execs faced no civil or criminal charges
  • Congress makes a bit of opportunistic, temporary bluster for the sake of 2018 elections but doesn’t do anything more than berate FredCo publicly
  • You’re so tired of all these breaches and data loss that you go back to playing “Clash of Clans” on your mobile phone and do nothing.

I’ve blathered about trust before 1 2, but said blatherings were in a “what if” context. Unfortunately, the if has turned into a when, which begged for further blathering on a recent FOSS ecosystem cybersecurity incident.

The gg_spiffy @thomasp85 linked to a post by the SK-CSIRT detailing the discovery and take-down of a series of malicious Python packages. Here’s their high-level incident summary:

SK-CSIRT identified malicious software libraries in the official Python package
repository, PyPI, posing as well known libraries. A prominent example is a fake
package urllib-1.21.1.tar.gz, based upon a well known package
urllib3-1.21.1.tar.gz.
Such packages may have been downloaded by unwitting developer or administrator
by various means, including the popular “pip” utility (pip install urllib).
There is evidence that the fake packages have indeed been downloaded and
incorporated into software multiple times between June 2017 and September 2017.

Words are great but, unlike some other FOSS projects (*cough* R *cough*) the PyPI folks have authoritative log data regarding package downloads from PyPI. This means we can begin to quantify the exposure. The Google BigQuery SQL was pretty straightforward:

SELECT timestamp, file.project as package, country_code, file.version AS version
FROM (
  (TABLE_DATE_RANGE([the-psf:pypi.downloads], TIMESTAMP('2016-01-22'), TIMESTAMP('2017-09-15')))
)
WHERE file.project IN ('acqusition', apidev-coop', 'bzip', 'crypt', 'django-server',
                       'pwd', 'setup-tools', 'telnet', 'urlib3', 'urllib')

Let’s see what the daily downloads of the malicious package look like:

Thanks to Curtis Doty (@dotysan on GH) I learned that the BigQuery table can be further filtered to exclude mirror-to-mirror traffic. The data for that is now in the GH repository and the chart in this callout shows that the exposure was very, very (very) limited:

But, we need counts of the mal-package dopplegangers (i.e. the good packages) to truly understand scope of exposure:

Thankfully, the SK-CSIRT folks caught this in time and the exposure was limited. But those are some popular tools that were targeted and it’s super-easy to sneak these into requirements.txt and scripts since the names are similar and the functionality is duplicated.

I’ll further note that the crypto package was “good” at some point in time then went away and was replaced with the nefarious one. That seems like a pretty big PyPI oversight (vis-a-vis package retirement & name re-use), but I’m not casting stones. R’s devtools::install_github() and wanton source()ing are just as bad, and the non-CRAN ecosystem is an even more varmint-prone “wild west” environment.

Furthermore, this is a potential exposure issue in many FOSS package repository ecosystems. On the one hand, these are open environments with tons of room for experimentation, creativity and collaboration. On the other hand, they’re all-too-easy targets for malicious hackers to prey upon.

I, unfortunately, have no quick-fix solutions to offer. “Review your code and dependencies” is about the best I can suggest until individual ecosystems work on better integrity & authenticity controls or there is a cross-ecosystem effort to establish “best practices” and perhaps even staffed, verified, audited, free services that work like a sheriff+notary to help ensure the safety of projects relying on open source components.

Python folks: double check that you weren’t a victim here (it’s super easy to type some of those package names wrong, and hopefully you’ve noticed builds failing if you had done so).

R folks: don’t be smug, watch your GitHub dependencies and double check your projects.

You can find the data and the scripts used to generate the charts (ironically enough) on GitHub.

Finally: I just want to close with a “thank you!” to PyPI’s Donald Stufft who (quickly!) pointed me to a blog post detailing the BigQuery setup.

I’ve blogged about my in-development R package hgr a before and it’s slowly getting to a CRAN release. There are two new features to it that are more useful in an interactive session than in a programmatic context. Since they build on each other, we’ll take them in order.

New S3 print() Method

Objects created with hgr::just_the_facts() used to be just list objects. Technically they are still list objects but they are also classed as hgr objects. The main reason for this was to support the new default print() method.

When you print() and hgr object, the new S3 method will extract the $content part of the object and pass it through some htmltools functions to display the readability-enhanced content in a browser (whatever R is configured to use for that on your system…you likely use RStudio if you read this blog and it will be in the Viewer pane for most users). This enables visual inspection of the content and (as previously stated) is pretty much only useful in an interactive context.

Rather than show you that now, we’ll see it in the context of the next feature.

‘Just The Facts’ RStudio Addin

I’m going to break with tradition and use a Medium post as an example, mostly because one of the things I detest about that content platform is just how little reading I can do on it (ironic since they touted things like typography when it launched). We’ll use this site as the example (an interesting JS article @timelyportfolio RT’d):

Now, that’s full-screen on a 13″ 2016 MacBook Pro and is pretty much useless. Now, I could (and, would normally) just use the Mercury Reader extension to strip away the cruft:

But, what if I wanted the content itself in R (for further processing) or didn’t feel like using (or policy disable the ability to use) a Chrome extension? Enter: “Just The Facts” RStudio Addin edition.

Just copy a text URL to the clipboard and choose the ‘Just The Facts’ addin:

and, you’ll get a few items as a result:

  • code executed in your R console that will…
  • add a new, unique object in your global environment with the results of a call to hgr::just_the_facts() on the copied URL, and…
  • the object auto-print()ed in the console, which…
  • displays the content in your RStudio Viewer

An image (or two) is definitely worth a 1,000 (ok, 48) words:

FIN

I’m likely going to crank out another addin that works more like a browser (with a URL bar in-Viewer) but also keep an in-environment log of requests so you can use or archive them afterwards.

Take it for a spin and don’t be sy in the GH issues page!

Putting this here to make it easier for others who try to Google this topic to find it w/o having to find and tediously search through other UDFs (user-defined functions).

I was/am making a custom UDF for base64 decoding/encoding and ran into:

SYSTEM ERROR: IndexOutOfBoundsException: index: 0, length: #### (expected: range(0, 256))

It’s incredibly easy to “fix” (and, if my Java weren’t so rusty I’d likely have seen it sooner) but I found this idiom in the spatial UDFs for Drill that enables increasing the default buffer size:

buffer = out.buffer = buffer.reallocIfNeeded(outputSize);

Hopefully this will prevent someone else from spinning a few minutes trying to tackle this use-case. I even had looked at the source for the DrillBuf class and did not manage to put 2 + 2 together for some reason.

Earlier this year, the GDELT Project released their Television Explorer that enabled API access to closed-caption tedt from television news broadcasts. They’ve done an incredible job expanding and stabilizing the API and just recently released “top trending tables” which summarise what the “top” topics and phrases are across news stations every fifteen minutes. You should read that (long-ish) intro as there are many caveats to the data source and I’ve also found that the files aren’t always available (i.e. there are often gaps when retrieving a sequence of files).

The R newsflash package has been able to work with the GDELT Television Explorer API since the inception of the service. It now has the ability work with this new “top topics” resource directly from R.

There are two interfaces to the top topics, but I’ll show you the easiest one to use in this post. Let’s chart the top 25 topics per day for the past ~3 days (this post was generated ~mid-day 2017-09-09).

To start, we’ll need the data!

We provide start and end POSIXct times in the current time zone (the top_trending_range() function auto-converts to GMT which is how the file timestamps are stored by GDELT). The function takes care of generating the proper 15-minute sequences.

library(newsflash) # devtools::install_github("hrbrmstr/newsflash")
library(hrbrthemes)
library(tidyverse)

from <- as.POSIXct("2017-09-07 00:00:00")
to <- as.POSIXct("2017-09-09 12:00:00")

trends <- top_trending_range(from, to)

glimpse(trends)
## Observations: 233
## Variables: 5
## $ ts                       <dttm> 2017-09-07 00:00:00, 2017-09-07 00:15:00, 2017-...
## $ overall_trending_topics  <list> [<"florida", "irma", "barbuda", "puerto rico", ...
## $ station_trending_topics  <list> [<c("CNN", "BLOOMBERG", "CNBC", "FBC", "FOXNEWS...
## $ station_top_topics       <list> [<c("CNN", "BLOOMBERG", "CNBC", "FBC", "FOXNEWS...
## $ overall_trending_phrases <list> [<"debt ceiling", "legalize daca", "florida key...

The glimpse view shows a compact, nested data frame. I encourage you to explore the individual nested elements to see the gems they contain, but we’re going to focus on the station_top_topics:

glimpse(trends$station_top_topics[[1]])
## Variables: 2
## $ Station <chr> "CNN", "BLOOMBERG", "CNBC", "FBC", "FOXNEWS", "MSNBC", "BBCNEWS"
## $ Topics  <list> [<"florida", "irma", "daca", "north korea", "harvey", "united st...

Each individual data frame has the top topics of each tracked station.

To get the top 25 topics per day, we’re going to bust out this structure, count up the topic “mentions” (not 100% accurate term, but good enough for now) per day and slice out the top 25. It’s a pretty straightforward process with tidyverse ops:

select(trends, ts, station_top_topics) %>% 
  unnest() %>% 
  unnest() %>% 
  mutate(day = as.Date(ts)) %>% 
  rename(station=Station, topic=Topics) %>% 
  count(day, topic) %>% 
  group_by(day) %>% 
  top_n(25) %>% 
  slice(1:25) %>% 
  arrange(day, desc(n)) %>% 
  mutate(rnk = 25:1) -> top_25_trends

glimpse(top_25_trends)
## Observations: 75
## Variables: 4
## $ day   <date> 2017-09-07, 2017-09-07, 2017-09-07, 2017-09-07, 2017-09-07, 2017-0...
## $ topic <chr> "florida", "irma", "harvey", "north korea", "america", "daca", "chi...
## $ n     <int> 546, 546, 468, 464, 386, 362, 356, 274, 217, 210, 200, 156, 141, 13...
## $ rnk   <int> 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, ...

Now, it’s just a matter of some ggplotting:

ggplot(top_25_trends, aes(day, rnk, label=topic, size=n)) +
  geom_text(vjust=0.5, hjust=0.5) +
  scale_x_date(expand=c(0,0.5)) +
  scale_size(name=NULL, range=c(3,8)) +
  labs(
    x=NULL, y=NULL, 
    title="Top 25 Trending Topics Per Day",
    subtitle="Topic placed by rank and sized by frequency",
    caption="GDELT Television Explorer & #rstats newsflash package github.com/hrbrmstr/newsflash"
  ) +
  theme_ipsum_rc(grid="") +
  theme(axis.text.y=element_blank()) +
  theme(legend.position=c(0.75, 1.05)) +
  theme(legend.direction="horizontal")

Hopefully you’ll have some fun with the new “API”. Make sure to blog your own creations!

UPDATE

As a result of a tweet by @arnicas, you can find a per-day, per-station view (top 10 only) here.

I recently posted about using a Python module to convert HTML to usable text. Since then, a new package has hit CRAN dubbed htm2txt that is 100% R and uses regular expressions to strip tags from text.

I gave it a spin so folks could compare some basic output, but you should definitely give htm2txt a try on your own conversion needs since each method produces different results.

On my macOS systems, the htm2txt calls ended up invoking XQuartz (the X11 environment on macOS) and they felt kind of sluggish (base R regular expressions don’t have a “compile” feature and can be sluggish compared to other types of regular expression computations).

I decided to spend some of Labor Day (in the U.S.) laboring (not for long, though) on a (currently small) rJava-based R package dubbed jericho which builds upon work created by Martin Jericho which is used in at-scale initiatives like the Internet Archive. Yes, I’m trading Java for Python, but the combination of Java+R has been around for much longer and there are many solved problems in Java-space that don’t need to be re-invented (if you do know a header-only, cross-platform, C++ HTML-to-text library, definitely leave a comment).

Is it worth it to get rJava up and running to use jericho vs htm2txt? Let’s take a look:

library(jericho) # devtools::install_github("hrbrmstr/jericho")
library(microbenchmark)
library(htm2txt)
library(tidyverse)

c(
  "https://medium.com/starts-with-a-bang/science-knows-if-a-nation-is-testing-nuclear-bombs-ec5db88f4526",
  "https://en.wikipedia.org/wiki/Timeline_of_antisemitism",
  "http://www.healthsecuritysolutions.com/2017/09/04/watch-out-more-ransomware-attacks-incoming/"
) -> urls

map_chr(urls, ~paste0(read_lines(.x), collapse="\n")) -> sites_html

microbenchmark(
  jericho_txt = {
    a <- html_to_text(sites_html[1])
  },
  jericho_render = {
    a <- render_html_to_text(sites_html[1])
  },
  htm2txt = {
    a <- htm2txt(sites_html[1])
  },
  times = 10
) -> mb1

# microbenchmark(
#   jericho_txt = {
#     a <- html_to_text(sites_html[2])
#   },
#   jericho_render = {
#     a <- render_html_to_text(sites_html[2])
#   },
#   htm2txt = {
#     a <- htm2txt(sites_html[2])
#   },
#   times = 10
# ) -> mb2

microbenchmark(
  jericho_txt = {
    a <- html_to_text(sites_html[3])
  },
  jericho_render = {
    a <- render_html_to_text(sites_html[3])
  },
  htm2txt = {
    a <- htm2txt(sites_html[3])
  },
  times = 10
) -> mb3

The second benchmark is commented out because I really didn’t have time wait for it to complete (FWIW jericho goes fast in that test). Here’s what the other two look like:

mb1
## Unit: milliseconds
##            expr         min          lq        mean      median          uq         max neval
##     jericho_txt    4.121872    4.294953    4.567241    4.405356    4.734923    5.621142    10
##  jericho_render    5.446296    5.564006    5.927956    5.719971    6.357465    6.785791    10
##         htm2txt 1014.858678 1021.575316 1035.342729 1029.154451 1042.642065 1082.340132    10

mb3
## Unit: milliseconds
##            expr        min         lq       mean     median         uq        max neval
##     jericho_txt   2.641352   2.814318   3.297543   3.034445   3.488639   5.437411    10
##  jericho_render   3.034765   3.143431   4.708136   3.746157   5.953550   8.931072    10
##         htm2txt 417.429658 437.493406 446.907140 445.622242 451.443907 484.563958    10

You should run the conversion functions on your own systems to compare the results (they’re somewhat large to incorporate here). I’m fairly certain they do a comparable — if not better — job of extracting clean, pertinent text.

I need to separate the package into two (one for the base JAR and the other for the conversion functions) and add some more tests before a CRAN submission, but I think this would be a good addition to the budding arsenal of HTML-to-text conversion options in R.