International Code Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back.
There will be no ‘rrrrrr’ abuse in this post, I’m afraid, but there will be plenty of R code.
We’re going to combine pirate day with “pirating” data, in the sense that I’m going to show one way on how to use the web scraping powers of R responsibly to collect data on and explore modern-day pirate encounters.
Scouring The Seas Web For Pirate Data
Interestingly enough, there are many of sources for pirate data. I’ve blogged a few in the past, but I came across a new (to me) one by the International Chamber of Commerce. Their Commercial Crime Services division has something called the Live Piracy & Armed Robbery Report:
(site png snapshot taken with splashr
)
I fiddled a bit with the URL and — sure enough — if you work a bit you can get data going back to late 2013, all in the same general format, so I jotted down base URLs and start+end record values and filed them away for future use:
library(V8)
library(stringi)
library(httr)
library(rvest)
library(robotstxt)
library(jwatr) # github/hrbrmstr/jwatr
library(hrbrthemes)
library(purrrlyr)
library(rprojroot)
library(tidyverse)
report_urls <- read.csv(stringsAsFactors=FALSE, header=TRUE, text="url,start,end
https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/, 1345, 1459
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/151/, 1137, 1339
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/details/146/, 885, 1138
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/144/, 625, 884
https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/133/, 337, 623")
by_row(report_urls, ~sprintf(.x$url %s+% "%s", .x$start:.x$end), .to="url_list") %>%
pull(url_list) %>%
flatten_chr() -> target_urls
head(target_urls)
## [1] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1345"
## [2] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1346"
## [3] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1347"
## [4] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1348"
## [5] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1349"
## [6] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1350"
Time to pillage some details!
But…Can We Really Do It?
I poked around the site’s terms of service/terms and conditions and automated retrieval was not discouraged. Yet, those aren’t the only sea mines we have to look out for. Perhaps they use their robots.txt
to stop pirates. Let’s take a look:
robotstxt::get_robotstxt("https://www.icc-ccs.org/")
## # If the Joomla site is installed within a folder such as at
## # e.g. www.example.com/joomla/ the robots.txt file MUST be
## # moved to the site root at e.g. www.example.com/robots.txt
## # AND the joomla folder name MUST be prefixed to the disallowed
## # path, e.g. the Disallow rule for the /administrator/ folder
## # MUST be changed to read Disallow: /joomla/administrator/
## #
## # For more information about the robots.txt standard, see:
## # http://www.robotstxt.org/orig.html
## #
## # For syntax checking, see:
## # http://www.sxw.org.uk/computing/robots/check.html
##
## User-agent: *
## Disallow: /administrator/
## Disallow: /cache/
## Disallow: /cli/
## Disallow: /components/
## Disallow: /images/
## Disallow: /includes/
## Disallow: /installation/
## Disallow: /language/
## Disallow: /libraries/
## Disallow: /logs/
## Disallow: /media/
## Disallow: /modules/
## Disallow: /plugins/
## Disallow: /templates/
## Disallow: /tmp/
Ahoy! We’ve got a license to pillage!
But, we don’t have a license to abuse their site.
While I still haven’t had time to follow up on an earlier post about ‘crawl-delay’ settings across the internet I have done enough work on it to know that a 5 or 10 second delay is the most common setting (when sites bother to have this directive in their robots.txt
file). ICC’s site does not have this setting defined, but we’ll still pirate crawl responsibly and use a 5 second delay between requests:
s_GET <- safely(GET)
pb <- progress_estimated(length(target_urls))
map(target_urls, ~{
pb$tick()$print()
Sys.sleep(5)
s_GET(.x)
}) -> httr_raw_responses
write_rds(httr_raw_responses, "data/2017-icc-ccs-raw-httr-responses.rds")
good_responses <- keep(httr_raw_responses, ~!is.null(.x$result))
jwatr::response_list_to_warc_file(good_responses, "data/icc-good")
There are more “safety” measures you can use with httr::GET()
but this one is usually sufficient. It just prevents the iteration from dying when there are hard retrieval errors.
I also like to save off the crawl results so I can go back to the raw file (if needed) vs re-scrape the site (this crawl takes a while). I do it two ways here, first using raw httr
response
objects (including any “broken” ones) and then filtering out the “complete” responses and saving them in WARC format so it’s in a more common format for sharing with others who may not use R.
Digging For Treasure
Did I mention that while the site looks like it’s easy to scrape it’s really not easy to scrape? That nice looking table is a sea mirage ready to trap unwary sailors crawlers in a pit of despair. The UX is built dynamically from on-page javascript content, a portion of which is below:
Now, you’re likely thinking: “Don’t we need to re-scrape the site with seleniumPipes
or splashr
?”
Fear not, stout yeoman! We can do this with the content we have if we don’t mind swabbing the decks first. Let’s put the map code up first and then dig into the details:
# make field names great again
mfga <- function(x) {
x <- tolower(x)
x <- gsub("[[:punct:][:space:]]+", "_", x)
x <- gsub("_+", "_", x)
x <- gsub("(^_|_$)", "", x)
x <- make.unique(x, sep = "_")
x
}
# I know the columns I want and this makes getting them into the types I want easier
cols(
attack_number = col_character(),
attack_posn_map = col_character(),
date = col_datetime(format = ""),
date_time = col_datetime(format = ""),
id = col_integer(),
location_detail = col_character(),
narrations = col_character(),
type_of_attack = col_character(),
type_of_vessel = col_character()
) -> pirate_cols
# iterate over the good responses with a progress bar
pb <- progress_estimated(length(good_responses))
map_df(good_responses, ~{
pb$tick()$print()
# `safely` hides the data under `result` so expose it
doc <- content(.x$result)
# target the `<script>` tag that has our data, carve out the target lines, do some data massaging and evaluate the javascript with V8
html_nodes(doc, xpath=".//script[contains(., 'requirejs')]") %>%
html_text() %>%
stri_split_lines() %>%
.[[1]] %>%
grep("narrations_ro", ., value=TRUE) %>%
sprintf("var dat = %s;", .) %>%
ctx$eval()
p <- ctx$get("dat", flatten=TRUE)
# now, process that data, turing the ugly returned list content into something we can put in a data frame
keep(p[[1]], is.list) %>%
map_df(~{
list(
field = mfga(.x[[3]]$label),
value = .x[[3]]$value
)
}) %>%
filter(value != "") %>%
distinct(field, .keep_all = TRUE) %>%
spread(field, value)
}) %>%
type_convert(col_types = pirate_cols) %>%
filter(stri_detect_regex(attack_number, "^[[:digit:]]")) %>%
filter(lubridate::year(date) > 2012) %>%
mutate(
attack_posn_map = stri_replace_last_regex(attack_posn_map, ":.*$", ""),
attack_posn_map = stri_replace_all_regex(attack_posn_map, "[\\(\\) ]", "")
) %>%
separate(attack_posn_map, sep=",", into=c("lat", "lng")) %>%
mutate(lng = as.numeric(lng), lat = as.numeric(lat)) -> pirate_df
write_rds(pirate_df, "data/pirate_df.rds")
The first bit there is a function to “make field names great again”. We’re processing some ugly list data and it’s not all uniform across all years so this will help make the data wrangling idiom more generic.
Next, I setup a cols
object because we’re going to be extracting data from text as text and I think it’s cleaner to type_convert
at the end vs have a slew of as.numeric()
(et al) statements in-code (for small mumnging). You’ll note at the end of the munging pipeline I still need to do some manual conversions.
Now we can iterate over the good (complete) responses.
The purrr::safely
function shoves the real httr
response in result
so we focus on that then “surgically” extract the target data from the <script
> tag. Once we have it, we get it into a form we can feed into the V8
javascript engine and then retrieve the data from said evaluation.
Because ICC used the same Joomla plugin over the years, the data is uniform, but also can contain additional fields, so we extract the fields in a generic manner. During the course of data wrangling, I noticed there were often multiple Date:
fields, so we throw in some logic to help avoid duplicate field names as well.
That whole process goes really quickly, but why not save off the clean data at the end for good measure?
Gotta Have A Pirate Map
Now we can begin to explore the data. I’ll leave most of that to you (since I’m providing the scraped data oh github), but here are a few views. First, just some simple counts per month:
mutate(pirate_df, year = lubridate::year(date), year_mon = as.Date(format(date, "%Y-%m-01"))) %>%
count(year_mon) %>%
ggplot(aes(year_mon, n)) +
geom_segment(aes(xend=year_mon, yend=0)) +
scale_y_comma() +
labs(x=NULL, y=NULL,
title="(Confirmed) Piracy Incidents per Month",
caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") +
theme_ipsum_rc(grid="Y")
And, finally, a map showing pirate encounters but colored by year:
world <- map_data("world")
mutate(pirate_df, year = lubridate::year(date)) %>%
arrange(year) %>%
mutate(year = factor(year)) -> plot_df
ggplot() +
geom_map(data = world, map = world, aes(x=long, y=lat, map_id=region), fill="#b2b2b2") +
geom_point(data = plot_df, aes(lng, lat, color=year), size=2, alpha=1/3) +
ggalt::coord_proj("+proj=wintri") +
viridis::scale_color_viridis(name=NULL, discrete=TRUE) +
labs(x=NULL, y=NULL,
title="Piracy Incidents per Month (Confirmed)",
caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") +
theme_ipsum_rc(grid="XY") +
theme(legend.position = "bottom")
Taking Up The Mantle of the Dread Pirate Hrbrmstr
Hopefully this post shed some light on scraping responsibly and using different techniques to get to hidden data in web pages.
There’s some free-form text and more than a few other ways to look at the data. You can find the code and data on Github and don’t hesitate to ask questions in the comments or file an issue. If you make something blog it! Share your ideas and creations with the rest of the R (or other language) communities!
8 Comments
nice! They are indeed pirates as most of them like to have an access to the sea according to the final map
Awesome post as usual – thanks! I’ve learned much about web scraping from your posts and your
splashr
package has saved me soo much trouble (thanks for that too, it’s great package!) – however i got intrigued by your approach here. I kind of understood what you did there with extracting the js code first, then evaluating it with theV8
pkg but would like to know and learn more… would you have any pointers to resources? Also – are there pros/cons when compared to thesplashr
approach?greetings,
david
I problems installing jwatr :(
https://github.com/hrbrmstr/jwatr/issues (or any other social coding site it’s on) wld be where to post those.
Excellent post!
A ridiculously stupid question, please:
Q1. In sampling your wonderful scripts, is it possible to install R packages at runtime?
Q2. Or, is the only avenue, to locate and install each package one by one using CRAN?
Greetings! You have done a great job!
I just have a problem at the first. I find difficult to download jwatr package. I would appreciate any help. Thank you.
Which is precisely what git[lab|hub|tea] issues are for.
Can you give me any advice how to deal with this problem?
12 Trackbacks/Pingbacks
[…] leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data […]
[…] article was first published on R – rud.is, and kindly contributed to […]
[…] article was first published on R – rud.is, and kindly contributed to […]
[…] This past Tuesday was Talk Like A Pirate Date, the unofficial holiday of R (aRRR!) users worldwide. In recognition of the day, Bob Rudis used R to create this map of worldwide piracy incidents from 2013 to 2017. […]
[…] This past Tuesday was Talk Like A Pirate Date, the unofficial holiday of R (aRRR!) users worldwide. In recognition of the day, Bob Rudis used R to create this map of worldwide piracy incidents from 2013 to 2017. […]
[…] This past Tuesday was Talk Like A Pirate Date, the unofficial holiday of R (aRRR!) users worldwide. In recognition of the day, Bob Rudis used R to create this map of worldwide piracy incidents from 2013 to 2017. […]
[…] This past Tuesday was Talk Like A Pirate Date, the unofficial holiday of R (aRRR!) users worldwide. In recognition of the day, Bob Rudis used R to create this map of worldwide piracy incidents from 2013 to 2017. […]
[…] This past Tuesday was Talk Like A Pirate Date, the unofficial holiday of R (aRRR!) users worldwide. In recognition of the day, Bob Rudis used R to create this map of worldwide piracy incidents from 2013 to 2017. […]
[…] used and blogged about Peter Meissner’s most excellent robotstxt package before. It’s an essential tool for any ethical web […]
[…] responsible (and elegant) webscraping from Bob Rudis, and decided to use the tool he mentioned in this blog post, the robotstxt package which “makes it easy to check if bots (spiders, crawler, scrapers, …) […]
[…] responsible (and elegant) webscraping from Bob Rudis, and decided to use the tool he mentioned in this blog post, the robotstxt package which “makes it easy to check if bots (spiders, crawler, scrapers, …) […]
[…] responsible (and elegant) webscraping from Bob Rudis, and decided to use the tool he mentioned in this blog post, the robotstxt package which “makes it easy to check if bots (spiders, crawler, scrapers, …) […]