Skip navigation

Category Archives: web scraping

Modern websites are complex beasts. They house photo galleries, interactive visualizations, web fonts, analytics code and other diverse types of content. Despite the potential for diversity, many web sites share similar “tech stacks” — the components that come together to make them what they are. These stacks consist of web servers (often with special capabilities), cache managers and a wide array of front-end web components. Unless a site goes to great lengths to cloak what they are using, most of these stack components leave a fingerprint — bits and pieces that we can piece together to identify them.

Wappalyzer is one tool that we can use to take these fingerprints and match them against a database of known components. If you’re not familiar with that service, go there now and enter in the URL of your own blog or company and come back to see why I’ve mentioned it.

Back? Good.

If you explored that site a bit, you likely saw a link to their GitHub repo. There you’ll find JavaScript source code and even some browser plugins along with the core “fingerprint database”, apps.json?. If you poke around on their repo, you’ll find the repo’s wiki and eventually see that folks have made various unofficial ports to other languages.

By now, you’ve likely guessed that this blog post is introducing an R port of “wappalyzer” and if you have made such a guess, you’re correct! The rappalyzer? package is a “work-in-progress” but it’s usable enough to get some feedback on the package API, solicit contributors and show some things you can do with it.

Just rappalyze something already!

For the moment, one real function is exposed: rappalyze(). Feed it a URL string or an httr response object and — if the analysis was fruitful — you’ll get back a data frame with all the identified tech stack components. Let’s see what jquery.com? is made of:

devtools::install_github("hrbrmstr/rappalyzer")

library(rappalyzer)

rappalyze("https://jquery.com/")
## # A tibble: 8 x 5
##         tech              category match_type version confidence
##        <chr>                 <chr>      <chr>   <chr>      <dbl>
## 1 CloudFlare                   CDN    headers    <NA>        100
## 2     Debian     Operating Systems    headers    <NA>        100
## 3  Modernizr JavaScript Frameworks     script    <NA>        100
## 4      Nginx           Web Servers    headers    <NA>        100
## 5        PHP Programming Languages    headers  5.4.45        100
## 6  WordPress                   CMS       meta   4.5.2        100
## 7  WordPress                 Blogs       meta   4.5.2        100
## 8     jQuery JavaScript Frameworks     script  1.11.3        100

If you played around on the Wappalyzer site you saw that it “spiders” many URLs to try to get a very complete picture of components that a given site may use. I just used the main jQuery site in the above example and we managed to get quite a bit of information from just that interaction.

Wappalyzer (and, hence rappalyzer) works by comparing sets of regular expressions against different parts of a site’s content. For now, rappalyzer checks:

  • the site URL (the one after a site’s been contacted & content retrieved)
  • HTTP headers (an invisible set of metadata browsers/crawlers & servers share)
  • HTML content
  • scripts
  • meta tags

Wappalyzer-proper runs in a browser-context, so it can pick up DOM elements and JavaScript environment variables which can provide some extra information not present in the static content.

As you can see from the returned data frame, each tech component has one or more associated categories as well as a version number (if it could be guessed) and a value indicating how confident the guess was. It will also show where the guess came from (headers, scripts, etc).

A peek under the hood

I started making a pure R port of Wappalyzer but there are some frustrating “gotchas” with the JavaScript regular expressions that would have meant spending time identifying all the edge cases. Since the main Wappalyzer source is in JavaScript it was trivial to down-port it to a V8?-compatible version and expose it via an R function. A longer-term plan is to deal with the regular expression conversion, but I kinda needed the functionality sooner than later (well, wanted it in R, anyway).

On the other hand, keeping it in JavaScript has some advantages, especially with the advent of chimera? (a phantomjs alternative that can put a full headless browser into a V8 engine context). Using that would mean getting the V8 package ported to more modern version of the V8 source code which isn’t trivial, but doable. Using chimera would make it possible to identify even more tech stack components.

Note also that these tech stack analyses can be dreadfully slow. That’s not due to using V8. The ones that slow are slow in all the language ports. This is due to the way the regular expressions are checked against the site content and for just using regular expressions to begin with. I’ve got some ideas to speed things up and may also introduce some “guide rails” to prevent lengthy operations and avoiding some checks if page characteristics meet certain criteria. Drop an issue if you have ideas as well.

Why is this useful?

Well, you can interactively see what a site uses like we did above. But, we can also look for commonalities across a number of sites. We’ll do this in a small example now. You can skip the expository and just work with the RStudio project if that’s how you’d rather roll.

Crawling 1,000 orgs

I have a recent-ish copy of a list of Fortune 1000 companies and their industry sectors along with their website URLs. Let’s see how many of those sites we can still access and what we identify.

I’m going to pause for just a minute to revisit some “rules” in the context of this operation.

Specifically, I’m not:

  • reading each site’s terms of service/terms & conditions
  • checking each site’s robots.txt file
  • using a uniquely identifying non-browser string crawler user-agent (since we need the site to present content like it would to a browser)

However, I’m also not:

  • using the site content in any way except to identify tech stack components (something dozens of organizations do)
  • scraping past the index page HTML content, so there’s no measurable resource usage

I believe I’ve threaded the ethical needle properly (and, I do this for a living), but if you’re uncomfortable with this you should use a different target URL list.

Back to the code. We’ll need some packages:

library(hrbrthemes)
library(tidyverse)
library(curl)
library(httr)
library(rvest)
library(stringi)
library(urltools)
library(rappalyzer) # devtools::install_github("hrbrmstr/rappalyzer")
library(rprojroot)
# I'm also using a user agent shortcut from the splashr package which is on CRAN

Now, I’ve included data in the RStudio project GH repo since it can be used later on to test out changes to rappalyzer. That’s the reason for the if/file.exists tests. As I’ve said before, be kind to sites you scrape and cache content whenever possible to avoid consuming resources that you don’t own.

Let’s take a look at our crawler code:

rt <- find_rstudio_root_file()

if (!file.exists(file.path(rt, "data", "f1k_gets.rds"))) {

  f1k <- read_csv(file.path(rt, "data", "f1k.csv"))

  targets <- pull(f1k, website)

  results <- list()
  errors <- list()

  OK <- function(res) {
    cat(".", sep="")
    results <<- c(results, list(res))
  }

  BAD <- function(err_msg) {
    cat("X", sep="")
    errors <<- c(errors, list(err_msg))
  }

  pool <- multi_set(total_con = 20)

  walk(targets, ~{
    multi_add(
      new_handle(url = .x, useragent = splashr::ua_macos_chrome, followlocation = TRUE, timeout = 60),
      OK, BAD
    )
  })

  multi_run(pool = pool)

  write_rds(results, file.path(rt, "data", "f1k_gets.rds"), compress = "xz")

} else {
  results <- read_rds(file.path(rt, "data", "f1k_gets.rds"))
}

If you were expecting read_html() and/or GET() calls you’re likely surprised at what you see. We’re trying to pull content from a thousand web sites, some of which may not be there anymore (for good or just temporarily). Sequential calls would need many minutes with error handling. We’re using the asynchronous/parallel capabilities in the curl? package to setup all 1,000 requests with the ability to capture the responses. There’s a hack-ish “progress” bar that uses . for “good” requests and X for ones that didn’t work. It still takes a little time (and you may need to tweak the total_con value on older systems) but way less than sequential GETs would have, and any errors are implicitly handled (ignored).

The next block does the tech stack identification:

if (!file.exists(file.path(rt, "data", "rapp_results.rds"))) {

  results <- keep(results, ~.x$status_code < 300)
  map(results, ~{
    list(
      url = .x$url,
      status_code = .x$status_code,
      content = .x$content,
      headers = httr:::parse_headers(.x$headers)
    ) -> res
    class(res) <- "response"
    res
  }) -> results

  # this takes a *while*
  pb <- progress_estimated(length(results))
  map_df(results, ~{
    pb$tick()$print()
    rap_df <- rappalyze(.x)
    if (nrow(rap_df) > 0) rap_df <- mutate(rap_df, url = .x$url)
  }) -> rapp_results

  write_rds(rapp_results, file.path(rt, "data", "rapp_results.rds"))

} else {
  rapp_results <- read_rds(file.path(rt, "data", "rapp_results.rds"))
}

We filter out non-“OK” (HTTP 200-ish response) curl results and turn them into just enough of an httr request object to be able to work with them.

Then we let rappalyzer work. I had time to kill so I let it run sequentially. In a production or time-critical context, I’d use the future? package.

We’re almost down to data we can play with! Let’s join it back up with the original metadata:

left_join(
  mutate(rapp_results, host = domain(rapp_results$url)) %>%
    bind_cols(suffix_extract(.$host))
  ,
  mutate(f1k, host = domain(website)) %>%
    bind_cols(suffix_extract(.$host)),
  by = c("domain", "suffix")
) %>%
  filter(!is.na(name)) -> rapp_results

length(unique(rapp_results$name))
## [1] 754

Note that some orgs seem to have changed hands/domains so we’re left with ~750 sites to play with (we could have tried to match more, but this is just an exercise). If you do the same curl-fetching exercise on your own, you may get fewer or more total sites. Internet crawling is fraught with peril and it’s not really possible to conveniently convey 100% production code in a blog post.

Comparing “Fortune 1000” org web stacks

Let’s see how many categories we picked up:

xdf <- distinct(rapp_results, name, sector, category, tech)

sort(unique(xdf$category))
##  [1] "Advertising Networks"  "Analytics"             "Blogs"
##  [4] "Cache Tools"           "Captchas"              "CMS"
##  [7] "Ecommerce"             "Editors"               "Font Scripts"
## [10] "JavaScript Frameworks" "JavaScript Graphics"   "Maps"
## [13] "Marketing Automation"  "Miscellaneous"         "Mobile Frameworks"
## [16] "Programming Languages" "Tag Managers"          "Video Players"
## [19] "Web Frameworks"        "Widgets"

Plenty to play with!

Now, what do you think the most common tech component is across all these diverse sites?

count(xdf, tech, sort=TRUE)
## # A tibble: 115 x 2
##                        tech     n
##                       <chr> <int>
##  1                   jQuery   572
##  2       Google Tag Manager   220
##  3                Modernizr   197
##  4        Microsoft ASP.NET   193
##  5          Google Font API   175
##  6        Twitter Bootstrap   162
##  7                WordPress   150
##  8                jQuery UI   143
##  9             Font Awesome   118
## 10 Adobe Experience Manager    69
## # ... with 105 more rows

I had suspected it was jQuery (and hinted at that when I used it as the initial rappalyzer example). In a future blog post we’ll look at just how vulnerable orgs are based on which CDNs they use (many use similar jQuery and other resource CDNs, and use them insecurely).

Let’s see how broadly each tech stack category is used across the sectors:

group_by(xdf, sector) %>%
  count(category) %>%
  ungroup() %>%
  arrange(category) %>%
  mutate(category = factor(category, levels=rev(unique(category)))) %>%
  ggplot(aes(category, n)) +
  geom_boxplot() +
  scale_y_comma() +
  coord_flip() +
  labs(x=NULL, y="Tech/Services Detected across Sectors",
       title="Usage of Tech Stack Categories by Sector") +
  theme_ipsum_rc(grid="X")

That’s alot of JavaScript. But, there’s also large-ish use of CMS, Web Frameworks and other components.

I wonder what the CMS usage looks like:

filter(xdf, category=="CMS") %>%
  count(tech, sort=TRUE)
## # A tibble: 15 x 2
##                        tech     n
##                       <chr> <int>
##  1                WordPress    75
##  2 Adobe Experience Manager    69
##  3                   Drupal    68
##  4               Sitefinity    26
##  5     Microsoft SharePoint    19
##  6                 Sitecore    14
##  7                      DNN    11
##  8                Concrete5     6
##  9     IBM WebSphere Portal     4
## 10        Business Catalyst     2
## 11                    Hippo     2
## 12                TYPO3 CMS     2
## 13               Contentful     1
## 14              Orchard CMS     1
## 15             SilverStripe     1

Drupal and WordPress weren’t shocking to me (again, I do this for a living) but they may be to others when you consider this is the Fortune 1000 we’re talking about.

What about Web Frameworks?

filter(xdf, category=="Web Frameworks") %>%
  count(tech, sort=TRUE)
## # A tibble: 7 x 2
##                          tech     n
##                         <chr> <int>
## 1           Microsoft ASP.NET   193
## 2           Twitter Bootstrap   162
## 3             ZURB Foundation    61
## 4               Ruby on Rails     2
## 5            Adobe ColdFusion     1
## 6                    Kendo UI     1
## 7 Woltlab Community Framework     1

Yikes! Quite a bit of ASP.NET. But, many enterprises haven’t migrated to more modern, useful platforms (yet).

FIN

As noted earlier, the data & code are on GitHub. There are many questions you can ask and answer with this data set and definitely make sure to share any findings you discover!

We’ll continue exploring different aspects of what you can glean from looking at web sites in a different way in future posts.

I’ll leave you with one interactive visualization that lets you explore the tech stacks by sector.

devtools::install_github("jeromefroe/circlepackeR")
library(circlepackeR)
library(data.tree)
library(treemap)

cpdf <- count(xdf, sector, tech)

cpdf$pathString <- paste("rapp", cpdf$sector, cpdf$tech, sep = "/")
stacks <- as.Node(cpdf)

circlepackeR(stacks, size = "n", color_min = "hsl(56,80%,80%)",
             color_max = "hsl(341,30%,40%)", width = "100%", height="800px")

spiderbar, spiderbar 
Reads robots rules from afar.
Crawls the web, any size; 
Fetches with respect, never lies.
Look Out! 
Here comes the spiderbar.

Is it fast? 
Listen bud, 
It's got C++ under the hood.
Can you scrape, from a site?
Test with can_fetch(), TRUE == alright
Hey, there 
There goes the spiderbar. 

(Check the end of the post if you don’t recognize the lyrical riff.)

Face front, true believer!

I’ve used and blogged about Peter Meissner’s most excellent robotstxt package before. It’s an essential tool for any ethical web scraper.

But (there’s always a “but“, right?), it was a definite bottleneck for an unintended package use case earlier this year (yes, I still have not rounded out the corners on my “crawl delay” forthcoming post).

I needed something faster for my bulk Crawl-Delay analysis which led me to this small, spiffy C++ library for parsing robots.txt files. After a tiny bit of wrangling, that C++ library has turned into a small R package spiderbar which is now hitting a CRAN mirror near you, soon. (CRAN — rightly so — did not like the unoriginal name rep).

How much faster?

I’m glad you asked!

Let’s take a look at one benchmark: parsing robots.txt and extracting Crawl-delay entries. Just how much faster is spiderbar?

library(spiderbar)
library(robotstxt)
library(microbenchmark)
library(tidyverse)
library(hrbrthemes)

rob <- get_robotstxt("imdb.com")

microbenchmark(

  robotstxt = {
    x <- parse_robotstxt(rob)
    x$crawl_delay
  },

  spiderbar = {
    y <- robxp(rob)
    crawl_delays(y)
  }

) -> mb1

update_geom_defaults("violin", list(colour = "#4575b4", fill="#abd9e9"))

autoplot(mb1) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for parsing 'robots.txt' and extracting 'Crawl-delay' entries",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

As you can see, it’s just a tad bit faster ?.

Now, you won’t notice that temporal gain in an interactive context but you absolutely will if you are cranking through a few million of them across a few thousand WARC files from the Common Crawl.

But, I don’t care about Crawl-Delay!

OK, fine. Do you care about fetchability? We can speed that up, too!

rob_txt <- parse_robotstxt(rob)
rob_spi <- robxp(rob)

microbenchmark(

  robotstxt = {
    robotstxt:::path_allowed(rob_txt$permissions, "/Vote")
  },

  spiderbar = {
    can_fetch(rob_spi, "/Vote")
  }

) -> mb2

autoplot(mb2) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for testing resource 'fetchability'",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

Vectorized or it didn’t happen.

(Gosh, even Spider-Man got more respect!)

OK, this is a tough crowd, but we’ve got vectorization covered as well:

microbenchmark(

  robotstxt = {
    paths_allowed(c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"), "imdb.com")
  },

  spiderbar = {
    can_fetch(rob_spi, c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"))
  }

) -> mb3

autoplot(mb3) +
  scale_y_comma(name="nanoseconds", trans="log10") +
  labs(title="Microbenchmark results for testing multiple resource 'fetchability'",
       subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
  theme_ipsum_rc(grid="Xx")

Excelsior!

Peter’s package does more than this one since it helps find the robots.txt files and provides helpful data frames for more robots exclusion protocol content. And, we’ve got some plans for package interoperability. So, stay tuned, true believer, for more spider-y goodness.

You can check out the code and leave package questions or comments on GitHub.

(Hrm…Peter Parker was Spider-Man and Peter Meissner wrote robotstxt which is all about spiders. Coincidence?! I think not!)

International Code Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back.

There will be no ‘rrrrrr’ abuse in this post, I’m afraid, but there will be plenty of R code.

We’re going to combine pirate day with “pirating” data, in the sense that I’m going to show one way on how to use the web scraping powers of R responsibly to collect data on and explore modern-day pirate encounters.

Scouring The Seas Web For Pirate Data

Interestingly enough, there are many of sources for pirate data. I’ve blogged a few in the past, but I came across a new (to me) one by the International Chamber of Commerce. Their Commercial Crime Services division has something called the Live Piracy & Armed Robbery Report:

(site png snapshot taken with splashr)

I fiddled a bit with the URL and — sure enough — if you work a bit you can get data going back to late 2013, all in the same general format, so I jotted down base URLs and start+end record values and filed them away for future use:

library(V8)
library(stringi)
library(httr)
library(rvest)
library(robotstxt)
library(jwatr) # github/hrbrmstr/jwatr
library(hrbrthemes)
library(purrrlyr)
library(rprojroot)
library(tidyverse)

report_urls <- read.csv(stringsAsFactors=FALSE, header=TRUE, text="url,start,end
https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/, 1345, 1459
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/151/, 1137, 1339
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-map/details/146/, 885, 1138
https://www.icc-ccs.org/piracy-reporting-centre/live-piracy-report/details/144/, 625, 884
https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/133/, 337, 623")

by_row(report_urls, ~sprintf(.x$url %s+% "%s", .x$start:.x$end), .to="url_list") %>%
  pull(url_list) %>%
  flatten_chr() -> target_urls

head(target_urls)
## [1] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1345"
## [2] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1346"
## [3] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1347"
## [4] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1348"
## [5] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1349"
## [6] "https://www.icc-ccs.org/index.php/piracy-reporting-centre/live-piracy-report/details/169/1350"

Time to pillage some details!

But…Can We Really Do It?

I poked around the site’s terms of service/terms and conditions and automated retrieval was not discouraged. Yet, those aren’t the only sea mines we have to look out for. Perhaps they use their robots.txt to stop pirates. Let’s take a look:

robotstxt::get_robotstxt("https://www.icc-ccs.org/")
## # If the Joomla site is installed within a folder such as at
## # e.g. www.example.com/joomla/ the robots.txt file MUST be
## # moved to the site root at e.g. www.example.com/robots.txt
## # AND the joomla folder name MUST be prefixed to the disallowed
## # path, e.g. the Disallow rule for the /administrator/ folder
## # MUST be changed to read Disallow: /joomla/administrator/
## #
## # For more information about the robots.txt standard, see:
## # http://www.robotstxt.org/orig.html
## #
## # For syntax checking, see:
## # http://www.sxw.org.uk/computing/robots/check.html
##
## User-agent: *
## Disallow: /administrator/
## Disallow: /cache/
## Disallow: /cli/
## Disallow: /components/
## Disallow: /images/
## Disallow: /includes/
## Disallow: /installation/
## Disallow: /language/
## Disallow: /libraries/
## Disallow: /logs/
## Disallow: /media/
## Disallow: /modules/
## Disallow: /plugins/
## Disallow: /templates/
## Disallow: /tmp/

Ahoy! We’ve got a license to pillage!

But, we don’t have a license to abuse their site.

While I still haven’t had time to follow up on an earlier post about ‘crawl-delay’ settings across the internet I have done enough work on it to know that a 5 or 10 second delay is the most common setting (when sites bother to have this directive in their robots.txt file). ICC’s site does not have this setting defined, but we’ll still pirate crawl responsibly and use a 5 second delay between requests:

s_GET <- safely(GET)

pb <- progress_estimated(length(target_urls))
map(target_urls, ~{
  pb$tick()$print()
  Sys.sleep(5)
  s_GET(.x)
}) -> httr_raw_responses

write_rds(httr_raw_responses, "data/2017-icc-ccs-raw-httr-responses.rds")

good_responses <- keep(httr_raw_responses, ~!is.null(.x$result))

jwatr::response_list_to_warc_file(good_responses, "data/icc-good")

There are more “safety” measures you can use with httr::GET() but this one is usually sufficient. It just prevents the iteration from dying when there are hard retrieval errors.

I also like to save off the crawl results so I can go back to the raw file (if needed) vs re-scrape the site (this crawl takes a while). I do it two ways here, first using raw httr response objects (including any “broken” ones) and then filtering out the “complete” responses and saving them in WARC format so it’s in a more common format for sharing with others who may not use R.

Digging For Treasure

Did I mention that while the site looks like it’s easy to scrape it’s really not easy to scrape? That nice looking table is a sea mirage ready to trap unwary sailors crawlers in a pit of despair. The UX is built dynamically from on-page javascript content, a portion of which is below:

Now, you’re likely thinking: “Don’t we need to re-scrape the site with seleniumPipes or splashr?”

Fear not, stout yeoman! We can do this with the content we have if we don’t mind swabbing the decks first. Let’s put the map code up first and then dig into the details:

# make field names great again
mfga <- function(x) {
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  x <- make.unique(x, sep = "_")
  x
}

# I know the columns I want and this makes getting them into the types I want easier
cols(
  attack_number = col_character(),
  attack_posn_map = col_character(),
  date = col_datetime(format = ""),
  date_time = col_datetime(format = ""),
  id = col_integer(),
  location_detail = col_character(),
  narrations = col_character(),
  type_of_attack = col_character(),
  type_of_vessel = col_character()
) -> pirate_cols

# iterate over the good responses with a progress bar
pb <- progress_estimated(length(good_responses))
map_df(good_responses, ~{

  pb$tick()$print()

  # `safely` hides the data under `result` so expose it
  doc <- content(.x$result)

  # target the `<script>` tag that has our data, carve out the target lines, do some data massaging and evaluate the javascript with V8
  html_nodes(doc, xpath=".//script[contains(., 'requirejs')]") %>%
    html_text() %>%
    stri_split_lines() %>%
    .[[1]] %>%
    grep("narrations_ro", ., value=TRUE) %>%
    sprintf("var dat = %s;", .) %>%
    ctx$eval()

  p <- ctx$get("dat", flatten=TRUE)

  # now, process that data, turing the ugly returned list content into something we can put in a data frame
  keep(p[[1]], is.list) %>%
    map_df(~{
      list(
        field = mfga(.x[[3]]$label),
        value = .x[[3]]$value
      )
    }) %>%
    filter(value != "") %>%
    distinct(field, .keep_all = TRUE) %>%
    spread(field, value)

}) %>%
  type_convert(col_types = pirate_cols) %>%
  filter(stri_detect_regex(attack_number, "^[[:digit:]]")) %>%
  filter(lubridate::year(date) > 2012) %>%
  mutate(
    attack_posn_map = stri_replace_last_regex(attack_posn_map, ":.*$", ""),
    attack_posn_map = stri_replace_all_regex(attack_posn_map, "[\\(\\) ]", "")
  ) %>%
  separate(attack_posn_map, sep=",", into=c("lat", "lng")) %>%
  mutate(lng = as.numeric(lng), lat = as.numeric(lat)) -> pirate_df

write_rds(pirate_df, "data/pirate_df.rds")

The first bit there is a function to “make field names great again”. We’re processing some ugly list data and it’s not all uniform across all years so this will help make the data wrangling idiom more generic.

Next, I setup a cols object because we’re going to be extracting data from text as text and I think it’s cleaner to type_convert at the end vs have a slew of as.numeric() (et al) statements in-code (for small mumnging). You’ll note at the end of the munging pipeline I still need to do some manual conversions.

Now we can iterate over the good (complete) responses.

The purrr::safely function shoves the real httr response in result so we focus on that then “surgically” extract the target data from the <script> tag. Once we have it, we get it into a form we can feed into the V8 javascript engine and then retrieve the data from said evaluation.

Because ICC used the same Joomla plugin over the years, the data is uniform, but also can contain additional fields, so we extract the fields in a generic manner. During the course of data wrangling, I noticed there were often multiple Date: fields, so we throw in some logic to help avoid duplicate field names as well.

That whole process goes really quickly, but why not save off the clean data at the end for good measure?

Gotta Have A Pirate Map

Now we can begin to explore the data. I’ll leave most of that to you (since I’m providing the scraped data oh github), but here are a few views. First, just some simple counts per month:

mutate(pirate_df, year = lubridate::year(date), year_mon = as.Date(format(date, "%Y-%m-01"))) %>%
  count(year_mon) %>%
  ggplot(aes(year_mon, n)) +
  geom_segment(aes(xend=year_mon, yend=0)) +
  scale_y_comma() +
  labs(x=NULL, y=NULL,
       title="(Confirmed) Piracy Incidents per Month",
       caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") +
  theme_ipsum_rc(grid="Y")

And, finally, a map showing pirate encounters but colored by year:

world <- map_data("world")

mutate(pirate_df, year = lubridate::year(date)) %>%
  arrange(year) %>%
  mutate(year = factor(year)) -> plot_df

ggplot() +
  geom_map(data = world, map = world, aes(x=long, y=lat, map_id=region), fill="#b2b2b2") +
  geom_point(data = plot_df, aes(lng, lat, color=year), size=2, alpha=1/3) +
  ggalt::coord_proj("+proj=wintri") +
  viridis::scale_color_viridis(name=NULL, discrete=TRUE) +
  labs(x=NULL, y=NULL,
       title="Piracy Incidents per Month (Confirmed)",
       caption="Source: International Chamber of Commerce Commercial Crime Services <https://www.icc-ccs.org/>") +
  theme_ipsum_rc(grid="XY") +
  theme(legend.position = "bottom")

Taking Up The Mantle of the Dread Pirate Hrbrmstr

Hopefully this post shed some light on scraping responsibly and using different techniques to get to hidden data in web pages.

There’s some free-form text and more than a few other ways to look at the data. You can find the code and data on Github and don’t hesitate to ask questions in the comments or file an issue. If you make something blog it! Share your ideas and creations with the rest of the R (or other language) communities!

I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non-“main text content” from an HTML document and it usually does a good job but there are some pages that it fails miserably on since it’s more of a brute-force method than one that uses any real “intelligence” when performing the text node targeting.

Most modern browsers have inherent or plugin-able “readability” capability, and most of those are based — at least in part — on the seminal Arc90 implementation. Many programming languages have a package or module that use a similar methodology, but I’m not aware of any R ports.

What do I mean by “clean txt”? Well, I can’t show the URL I was having trouble processing but I can show an example using a recent rOpenSci blog post. Here’s what the raw HTML looks like after retrieving it:

library(xml2)
library(httr)
library(reticulate)
library(magrittr)

res <- GET("https://ropensci.org/blog/blog/2017/08/22/visdat")

content(res, as="text", endoding="UTF-8")
## [1] "\n \n<!DOCTYPE html>\n<html lang=\"en\">\n <head>\n <meta charset=\"utf-8\">\n <meta name=\"apple-mobile-web-app-capable\" content=\"yes\" />\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\" />\n <meta name=\"apple-mobile-web-app-status-bar-style\" content=\"black\" />\n <link rel=\"shortcut icon\" href=\"/assets/flat-ui/images/favicon.ico\">\n\n <link rel=\"alternate\" type=\"application/rss+xml\" title=\"RSS\" href=\"http://ropensci.org/feed.xml\" />\n\n <link rel=\"stylesheet\" href=\"/assets/flat-ui/bootstrap/css/bootstrap.css\">\n <link rel=\"stylesheet\" href=\"/assets/flat-ui/css/flat-ui.css\">\n\n <link rel=\"stylesheet\" href=\"/assets/common-files/css/icon-font.css\">\n <link rel=\"stylesheet\" href=\"/assets/common-files/css/animations.css\">\n <link rel=\"stylesheet\" href=\"/static/css/style.css\">\n <link href=\"/assets/css/ss-social/webfonts/ss-social.css\" rel=\"stylesheet\" />\n <link href=\"/assets/css/ss-standard/webfonts/ss-standard.css\" rel=\"stylesheet\"/>\n <link rel=\"stylesheet\" href=\"/static/css/github.css\">\n <script type=\"text/javascript\" src=\"//use.typekit.net/djn7rbd.js\"></script>\n <script type=\"text/javascript\">try{Typekit.load();}catch(e){}</script>\n <script src=\"/static/highlight.pack.js\"></script>\n <script>hljs.initHighlightingOnLoad();</script>\n\n <title>Onboarding visdat, a tool for preliminary visualisation of whole dataframes</title>\n <meta name=\"keywords\" content=\"R, software, package, review, community, visdat, data-visualisation\" />\n <meta name=\"description\" content=\"\" />\n <meta name=\"resource_type\" content=\"website\"/>\n <!– RDFa Metadata (in DublinCore) –>\n <meta property=\"dc:title\" content=\"Onboarding visdat, a tool for preliminary visualisation of whole dataframes\" />\n <meta property=\"dc:creator\" content=\"\" />\n <meta property=\"dc:date\" content=\"\" />\n <meta property=\"dc:format\" content=\"text/html\" />\n <meta property=\"dc:language\" content=\"en\" />\n <meta property=\"dc:identifier\" content=\"/blog/blog/2017/08/22/visdat\" />\n <meta property=\"dc:rights\" content=\"CC0\" />\n <meta property=\"dc:source\" content=\"\" />\n <meta property=\"dc:subject\" content=\"Ecology\" />\n <meta property=\"dc:type\" content=\"website\" />\n <!– RDFa Metadata (in OpenGraph) –>\n <meta property=\"og:title\" content=\"Onboarding visdat, a tool for preliminary visualisation of whole dataframes\" />\n <meta property=\"og:author\" content=\"/index.html#me\" /> <!– Should be Liquid? URI? –>\n <meta property=\"http://ogp.me/ns/profile#first_name\" content=\"\"/>\n <meta property=\"http://ogp.me/ns/profile#last_name\" content=\"\"/>\n

(it goes on for a bit, best to run the code locally)

We can use the reticulate package to load the Python readability module to just get the clean, article text:

readability <- import("readability") # pip install readability-lxml

doc <- readability$Document(httr::content(res, as="text", endoding="UTF-8"))

doc$summary() %>%
  read_xml() %>%
  xml_text()
# [1] "Take a look at the dataThis is a phrase that comes up when you first get a dataset.It is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?Starting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.These growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.The visdat package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".Making visdat was fun, and it was easy to use. But I couldn't help but think that maybe visdat could be more. I felt like the code was a little sloppy, and that it could be better. I wanted to know whether others found it useful.What I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.Too much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci onboarding process provides.rOpenSci onboarding basicsOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with JOSS.What's in it for the author?Feedback on your packageSupport from rOpenSci membersMaintain ownership of your packagePublicity from it being under rOpenSciContribute something to rOpenSciPotentially a publicationWhat can rOpenSci do that CRAN cannot?The rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN 1. Here's what rOpenSci does that CRAN cannot:Assess documentation readability / usabilityProvide a code review to find weak points / points of improvementDetermine whether a package is overlapping with another.

(again, it goes on for a bit, best to run the code locally)

That text is now in good enough shape to tidy.

Here’s the same version with clean_text():

# devtools::install_github("hrbrmstr/hgr")
hgr::clean_text(content(res, as="text", endoding="UTF-8"))
## [1] "Onboarding visdat, a tool for preliminary visualisation of whole dataframes\n \n \n \n \n  \n \n \n \n \n August 22, 2017 \n \n \n \n \nTake a look at the data\n\n\nThis is a phrase that comes up when you first get a dataset.\n\nIt is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?\n\nStarting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.\n\nThese growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.\n\nThe package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".\n\nMaking was fun, and it was easy to use. But I couldn't help but think that maybe could be more.\n\n I felt like the code was a little sloppy, and that it could be better.\n I wanted to know whether others found it useful.\nWhat I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.\n\nToo much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci provides.\n\nrOpenSci onboarding basics\n\nOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with .\n\nWhat's in it for the author?\n\nFeedback on your package\nSupport from rOpenSci members\nMaintain ownership of your package\nPublicity from it being under rOpenSci\nContribute something to rOpenSci\nPotentially a publication\nWhat can rOpenSci do that CRAN cannot?\n\nThe rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN . Here's what rOpenSci does that CRAN cannot:\n\nAssess documentation readability / usability\nProvide a code review to find weak points / points of improvement\nDetermine whether a package is overlapping with another.

(lastly, it goes on for a bit, best to run the code locally)

As you can see, even though that version is usable, readability does a much smarter job of cleaning the text.

The Python code is quite — heh — readable, and R could really use a native port (i.e. this would be a ++gd project or an aspiring package author to take on).

I’ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to work with WARC files it’s a tad fragile and improving it would mean reinventing many wheels (i.e. there are longstanding solid implementations of WARC libraries in many other languages that could be tapped vs writing a C++-backed implementation).

One of those implementations is JWAT, a library written in Java (as many WARC use-cases involve working in what would traditionally be called map-reduce environments). It has a small footprint and is structured well-enough that I decided to take it for a spin as a set of R packages that wrap it with rJava. There are two packages since it follows a recommended CRAN model of having one package for the core Java Archive (JAR) files — since they tend to not change as frequently as the functional R package would and they tend to take up a modest amount of disk space — and another for the actual package that does the work. They are:

I’ll exposit on the full package at some later date, but I wanted to post a snippet showng that you may have a use for WARC files that you hadn’t considered before: pairing WARC files with httr web scraping tasks to maintain a local cache of what you’ve scraped.

Web scraping consumes network & compute resources on the server end that you typically don’t own and — in many cases — do not pay for. While there are scraping tasks that need to access the latest possible data, many times tasks involve scraping data that won’t change.

The same principle works for caching the results of API calls, since you may make those calls and use some data, but then realize you wanted to use more data and make the same API calls again. Caching the raw API results can also help with reproducibility, especially if the site you were using goes offline (like the U.S. Government sites that are being taken down by the anti-science folks in the current administration).

To that end I’ve put together the beginning of some “WARC wrappers” for httr verbs that make it seamless to cache scraping or API results as you gather and process them. Let’s work through an example using the U.K. open data portal on crime and policing API.

First, we’ll need some helpers:

library(rJava)
library(jwatjars) # devtools::install_github("hrbrmstr/jwatjars")
library(jwatr) # devtools::install_github("hrbrmstr/jwatr")
library(httr)
library(jsonlite)
library(tidyverse)

Just doing library(jwatr) would have covered much of that but I wanted to show some of the work R does behind the scenes for you.

Now, we’ll grab some neighbourhood and crime info:

wf <- warc_file("~/Data/wrap-test")

res <- warc_GET(wf, "https://data.police.uk/api/leicestershire/neighbourhoods")

str(jsonlite::fromJSON(content(res, as="text")), 2)
## 'data.frame':	67 obs. of  2 variables:
##  $ id  : chr  "NC04" "NC66" "NC67" "NC68" ...
##  $ name: chr  "City Centre" "Cultural Quarter" "Riverside" "Clarendon Park" ...

res <- warc_GET(wf, "https://data.police.uk/api/crimes-street/all-crime",
                query = list(lat=52.629729, lng=-1.131592, date="2017-01"))

res <- warc_GET(wf, "https://data.police.uk/api/crimes-at-location",
                query = list(location_id="884227", date="2017-02"))

close_warc_file(wf)

As you can see, the standard httr response object is returned for processing, and the HTTP response itself is being stored away for us as we process it.

file.info("~/Data/wrap-test.warc.gz")$size
## [1] 76020

We can use these results later and, pretty easily, since the WARC file will be read in as a tidy R tibble (fancy data frame):

xdf <- read_warc("~/Data/wrap-test.warc.gz", include_payload = TRUE)

glimpse(xdf)
## Observations: 3
## Variables: 14
## $ target_uri                 <chr> "https://data.police.uk/api/leicestershire/neighbourhoods", "https://data.police.uk/api/crimes-street...
## $ ip_address                 <chr> "54.76.101.128", "54.76.101.128", "54.76.101.128"
## $ warc_content_type          <chr> "application/http; msgtype=response", "application/http; msgtype=response", "application/http; msgtyp...
## $ warc_type                  <chr> "response", "response", "response"
## $ content_length             <dbl> 2984, 511564, 688
## $ payload_type               <chr> "application/json", "application/json", "application/json"
## $ profile                    <chr> NA, NA, NA
## $ date                       <dttm> 2017-08-22, 2017-08-22, 2017-08-22
## $ http_status_code           <dbl> 200, 200, 200
## $ http_protocol_content_type <chr> "application/json", "application/json", "application/json"
## $ http_version               <chr> "HTTP/1.1", "HTTP/1.1", "HTTP/1.1"
## $ http_raw_headers           <list> [<48, 54, 54, 50, 2f, 31, 2e, 31, 20, 32, 30, 30, 20, 4f, 4b, 0d, 0a, 61, 63, 63, 65, 73, 73, 2d, 63...
## $ warc_record_id             <chr> "<urn:uuid:2ae3e851-a1cf-466a-8f73-9681aab25d0c>", "<urn:uuid:77b30905-37f7-4c78-a120-2a008e194f94>",...
## $ payload                    <list> [<5b, 7b, 22, 69, 64, 22, 3a, 22, 4e, 43, 30, 34, 22, 2c, 22, 6e, 61, 6d, 65, 22, 3a, 22, 43, 69, 74...

xdf$target_uri
## [1] "https://data.police.uk/api/leicestershire/neighbourhoods"                                   
## [2] "https://data.police.uk/api/crimes-street/all-crime?lat=52.629729&lng=-1.131592&date=2017-01"
## [3] "https://data.police.uk/api/crimes-at-location?location_id=884227&date=2017-02" 

The URLs are all there, so it will be easier to map the original calls to them.

Now, the payload field is the HTTP response body and there are a few ways we can decode and use it. First, since we know it’s JSON content (that’s what the API returns), we can just decode it:

for (i in 1:nrow(xdf)) {
  res <- jsonlite::fromJSON(readBin(xdf$payload[[i]], "character"))
  print(str(res, 2))
}
## 'data.frame': 67 obs. of  2 variables:
##  $ id  : chr  "NC04" "NC66" "NC67" "NC68" ...
##  $ name: chr  "City Centre" "Cultural Quarter" "Riverside" "Clarendon Park" ...
## NULL
## 'data.frame': 1318 obs. of  9 variables:
##  $ category        : chr  "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
##  $ location_type   : chr  "Force" "Force" "Force" "Force" ...
##  $ location        :'data.frame': 1318 obs. of  3 variables:
##   ..$ latitude : chr  "52.616961" "52.629963" "52.641646" "52.635184" ...
##   ..$ street   :'data.frame': 1318 obs. of  2 variables:
##   ..$ longitude: chr  "-1.120719" "-1.122291" "-1.131486" "-1.135455" ...
##  $ context         : chr  "" "" "" "" ...
##  $ outcome_status  :'data.frame': 1318 obs. of  2 variables:
##   ..$ category: chr  NA NA NA NA ...
##   ..$ date    : chr  NA NA NA NA ...
##  $ persistent_id   : chr  "" "" "" "" ...
##  $ id              : int  54163555 54167687 54167689 54168393 54168392 54168391 54168386 54168381 54168158 54168159 ...
##  $ location_subtype: chr  "" "" "" "" ...
##  $ month           : chr  "2017-01" "2017-01" "2017-01" "2017-01" ...
## NULL
## 'data.frame': 1 obs. of  9 variables:
##  $ category        : chr "violent-crime"
##  $ location_type   : chr "Force"
##  $ location        :'data.frame': 1 obs. of  3 variables:
##   ..$ latitude : chr "52.643950"
##   ..$ street   :'data.frame': 1 obs. of  2 variables:
##   ..$ longitude: chr "-1.143042"
##  $ context         : chr ""
##  $ outcome_status  :'data.frame': 1 obs. of  2 variables:
##   ..$ category: chr "Unable to prosecute suspect"
##   ..$ date    : chr "2017-02"
##  $ persistent_id   : chr "4d83433f3117b3a4d2c80510c69ea188a145bd7e94f3e98924109e70333ff735"
##  $ id              : int 54726925
##  $ location_subtype: chr ""
##  $ month           : chr "2017-02"
## NULL

We can also use a jwatr helper function — payload_content() — which mimics the httr::content() function:

for (i in 1:nrow(xdf)) {
  
  payload_content(
    xdf$target_uri[i], 
    xdf$http_protocol_content_type[i], 
    xdf$http_raw_headers[[i]], 
    xdf$payload[[i]], as = "text"
  ) %>% 
    jsonlite::fromJSON() -> res
  
  print(str(res, 2))
  
}

The same output is printed, so I’m saving some blog content space by not including it.

Future Work

I kept this example small, but ideally one would write a warcinfo record as the first WARC record to identify the file and I need to add options and functionality to store the a WARC request record as well as a responserecord`. But, I wanted to toss this out there to get feedback on the idiom and what possible desired functionality should be added.

So, please kick the tyres and file as many issues as you have time or interest to. I’m still designing the full package API and making refinements to existing function, so there’s plenty of opportunity to tailor this to the more data science-y and reproducibility use cases R folks have.

One of my tweets that referenced an excellent post about the ethics of web scraping garnered some interest:

If you load that up that tweet and follow the thread, you’ll see a really good question by @kennethrose82 regarding what an appropriate setting should be for a delay between crawls.

The answer is a bit nuanced as there are some written and unwritten “rules” for those who would seek to scrape web site content. For the sake of brevity in this post, we’ll only focus on “best practices” (ugh) for being kind to web site resources when it comes to timing requests, after a quick mention that “Step 0” must be to validate that the site’s terms & conditions or terms of service allow you to scrape & use data from said site.

Robot Roll Call

The absolute first thing you should do before scraping a site should be to check out their robots.txt file. What’s that? Well, I’ll let you read about it first from the vignette of the package we’re going to use to work with it.

Now that you know what such a file is, you also likely know how to peruse it since the vignette has excellent examples. But, we’ll toss one up here for good measure, focusing on one field that we’re going to talk about next:

library(tidyverse)
library(rvest)

robotstxt::robotstxt("seobook.com")$crawl_delay %>% 
  tbl_df()
## # A tibble: 114 x 3
##          field       useragent value
##          <chr>           <chr> <chr>
##  1 Crawl-delay               *    10
##  2 Crawl-delay        asterias    10
##  3 Crawl-delay BackDoorBot/1.0    10
##  4 Crawl-delay       BlackHole    10
##  5 Crawl-delay    BlowFish/1.0    10
##  6 Crawl-delay         BotALot    10
##  7 Crawl-delay   BuiltBotTough    10
##  8 Crawl-delay    Bullseye/1.0    10
##  9 Crawl-delay   BunnySlippers    10
## 10 Crawl-delay       Cegbfeieh    10
## # ... with 104 more rows

I chose that site since it has many entries for the Crawl-delay field, which defines the number of seconds a given site would like your crawler to wait between scrapes. For the continued sake of brevity, we’ll assume you’re going to be looking at the * entry when you perform your own scraping tasks (even though you should be setting your own User-Agent string). Let’s make a helper function for retrieving this value from a site, adding in some logic to provide a default value if no Crawl-Delay entry is found and to optimize the experience a bit (note that I keep varying the case of crawl-delay when I mention it to show that the field key is case-insensitive; be thankful robotstxt normalizes it for us!):

.get_delay <- function(domain) {
  
  message(sprintf("Refreshing robots.txt data for %s...", domain))
  
  cd_tmp <- robotstxt::robotstxt(domain)$crawl_delay
  
  if (length(cd_tmp) > 0) {
    star <- dplyr::filter(cd_tmp, useragent=="*")
    if (nrow(star) == 0) star <- cd_tmp[1,]
    as.numeric(star$value[1])
  } else {
    10L
  }

}

get_delay <- memoise::memoise(.get_delay)

The .get_delay() function could be made a bit more bulletproof, but I have to leave some work for y’all to do on your own. So, why both .get_delay() and the get_delay() functions, and what is this memoise? Well, even though robotstxt::robotstxt() will ultimately cache (in-memory, so only in the active R session) the robots.txt file it retrieved (if it retrieved one) we don’t want to do the filter/check/default/return all the time since it just wastes CPU clock cycles. The memoise() operation will check which parameter was sent and return the value that was computed vs going through that logic again. We can validate that on the seobook.com domain:

get_delay("seobook.com")
## Refreshing robots.txt data for seobook.com...
## [1] 10

get_delay("seobook.com")
## [1] 10

You can now use get_delay() in a Sys.sleep() call before your httr:GET() or rvest::read_html() operations.

Not So Fast…

Because you’re a savvy R coder and not some snake charmer, gem hoarder or go-getter, you likely caught the default 10L return value in .get_delay() and thought “Hrm… Where’d that come from?”.

I’m glad you asked!

I grabbed the first 400 robots.txt WARC files from the June 2017 Common Crawl, which ends up being ~1,000,000 sites. That sample ended up having ~80,000 sites with one or more CRAWL-DELAY entries. Some of those sites had entries that were not valid (in an attempt to either break, subvert or pwn a crawler) or set to a ridiculous value. I crunched through the data and made saner bins for the values to produce the following:

Sites seem to either want you to wait 10 seconds (or less) or about an hour between scraping actions. I went with the lower number purely for convenience, but would caution that this decision was based on the idea that your intention is to not do a ton of scraping (i.e. less than ~50-100 HTML pages). If you’re really going to do more than that, I strongly suggest you reach out to the site owner. Many folks are glad for the contact and could even arrange a better method for obtaining the data you seek.

FIN

So, remember:

  • check site ToS/T&C before scraping
  • check robots.txt before scraping (in general and for Crawl-Delay)
  • contact the site owner if you plan on doing a large amount of scraping
  • introduce some delay between page scrapes, even if the site does not have a specific crawl-delay entry), using the insights gained from the Common Crawl analysis to inform your decision

I’ll likely go through all the Common Crawl robots.txt WARC archives to get a fuller picture of the distribution of values and either update this post at a later date or do a quick new post on it.

(You also might want to run robotstxt::get_robotstxt("rud.is") #justsayin :-)

I’ve been doing intermittent prep work for a follow-up to an earlier post on store closings and came across this CNN Money “article” on it. Said “article” is a deliberately obfuscated or lazily crafted series of GIF images that contain all the Radio Shack impending store closings. It’s the most comprehensive list I’ve found, but the format is terrible and there’s no easy, in-browser way to download them all.

CNN has ToS that prevent automated data gathering from CNN-proper. But, they used Adobe Document Cloud for these images which has no similar restrictions from a quick glance at their ToS. That means you get an R⁶ post on how to grab the individual 38 images and combine them into one PDF. I did this all with the hopes of OCRing the text, which has not panned out too well since the image quality and font was likely deliberately set to make it hard to do precisely what I’m trying to do.

If you work through the example, you’ll get a feel for:

  • using sprintf() to take a template and build a vector of URLs
  • use dplyr progress bars
  • customize httr verb options to ensure you can get to content
  • use purrr to iterate through a process of turning raw image bytes into image content (via magick) and turn a list of images into a PDF
library(httr)
library(magick)
library(tidyverse)

url_template <- "https://assets.documentcloud.org/documents/1657793/pages/radioshack-convert-p%s-large.gif"

pb <- progress_estimated(38)

sprintf(url_template, 1:38) %>% 
  map(~{
    pb$tick()$print()
    GET(url = .x, 
        add_headers(
          accept = "image/webp,image/apng,image/*,*/*;q=0.8", 
          referer = "http://money.cnn.com/interactive/technology/radio-shack-closure-list/index.html", 
          authority = "assets.documentcloud.org"))    
  }) -> store_list_pages

map(store_list_pages, content) %>% 
  map(image_read) %>% 
  reduce(image_join) %>% 
  image_write("combined_pages.pdf", format = "pdf")

I figured out the Document Cloud links and necessary httr::GET() options by using Chrome Developer Tools and my curlconverter package.

If any academic-y folks have a test subjectsummer intern with a free hour and would be willing to have them transcribe this list and stick it on GitHub, you’d have my eternal thanks.

I caught a glimpse of a tweet by @dataandme on Friday:

Mara is — without a doubt — the best data science promoter in the Twitterverse. She seems to have her finger on the pulse of everything that’s happening in the data science world and is one of the most ardent amplifiers there is.

The post she linked to was a bit older (2015) and had a very “stream of consciousness” feel to it. I actually wish more R folks took to their blogs like this to post their explorations into various topics. The code in this post likely worked at the time it was posted and accomplished the desired goal (which means it was ultimately decent code). Said practice will ultimately help both you and others.

Makeover Time

As I’ve noted before, web scraping has some rules, even though they can be tough to find. This post made a very common mistake of not putting in a time delay between requests (a cardinal scraping rule) which we’ll fix in a moment.

There are a few other optimizations we can make. The first is moving from a for loop to something a bit more vectorized. Another is to figure out how many pages we need to scrape from information in the first set of results.

However, an even bigger one is to take advantage of the underlying XHR POST request that the new version of the site ultimately calls (it appears this site has undergone some changes since the blog post and it’s unlikely the code in the post actually works now).

Let’s start by setting up a function to grab individual pages:

library(httr)
library(rvest)
library(stringi)
library(tidyverse)

get_page <- function(i=1, pb=NULL) {
  
  if (!is.null(pb)) pb$tick()$print()
  
  POST(url = "http://www.propwall.my/wp-admin/admin-ajax.php", 
       body = list(action = "star_property_classified_list_change_ajax", 
                   tab = "Most Relevance", 
                   page = as.integer(i), location = "Mont Kiara", 
                   category = "", listing = "For Sale", 
                   price = "", keywords = "Mont Kiara, Kuala Lumpur", 
                   filter_id = "17", filter_type = "Location", 
                   furnishing = "", builtup = "", 
                   tenure = "", view = "list", 
                   map = "on", blurb = "0"), 
       encode = "form") -> res
  
  stop_for_status(res)
  
  res <- content(res, as="parsed") 
  
  Sys.sleep(sample(seq(0,2,0.5), 1))
  
  res
  
}

The i parameter gets passed into the body of the POST request. You can find that XHR POST request via the Network tab of your browser Developer Tools view. You can either transcribe it by hand or use the curlconverter package (which is temporarily off CRAN so you’ll need to get it from github) to auto-convert it to an httr::VERB request.

We also add a parameter (default to NULL) to support the use of a progress bar (so we can see what’s going on). If we pass in a populated dplyr progress bar, this will tick it down for us.

Now, we can use that to get the total number of listings.

get_page(1) %>% 
  html_node(xpath=".//a[contains(., 'Classifieds:')]") %>% 
  html_text() %>% 
  stri_match_last_regex("([[:digit:],]+)$") %>% 
  .[,2] %>% 
  stri_replace_all_fixed(",", "") %>% 
  as.numeric() -> classified_ct

total_pages <- 1 + (classified_ct %/% 20)

We’ll setup another function to extract the listing URLs and titles:

get_listings <- function(pg) {
  data_frame(
    link = html_nodes(pg, "div#list-content > div.media * h4.media-heading > a:nth-of-type(1)" ) %>%  html_attr("href"),
    description = html_nodes(pg, "div#list-content > div.media * h4.media-heading > a:nth-of-type(1)" ) %>% html_text(trim = TRUE)  
  )
}

Rather than chain calls to html_nodes() we take advantage of well-formed CSS selectors (which ultimately gets auto-translated to XPath strings). This has the advantage of speed (though that’s not necessarily an issue when web scraping) as well as brevity.

Now, we’ll scrape all the listings:

pb <- progress_estimated(total_pages)
listings_df <- map_df(1:total_pages, ~get_listings(get_page(.x, pb)))

Yep. That’s it. Everything’s been neatly abstracted into functions and we’ve taken advantage of some modern R idioms to accomplish our first task.

FIN

With the above code you should be able to do your own makeover of the remaining code in the original post. Remember to:

  • add a delay when you sequentially scrape pages from a site
  • abstract out common operations into functions
  • take advantage of purrr functions (or built-in *apply functions) to avoid for loops

I’ll close with a note about adhering to site terms of service / terms and conditions. Nothing I found when searching for ToS/ToC on the site suggested that scraping, automated grabbing or use of the underlying data in bulk was prohibited. Many sites have such restrictions — like IMDB (I mention that as it’s been used alot lately by R folks and it really shouldn’t be). LinkedIn recently sued scrapers for ToS such violations.

I fundamentally believe violating ToS is unethical behavior and should be avoided just on those grounds. When I come across sites I need information from that have restrictive ToS I contact the site owner (when I can find them) and ask them for permission and have only been refused a small handful of times. Given those recent legal actions, it’s also to better be safe than sorry.