Analyzing “Crawl-Delay” Settings in Common Crawl robots.txt Data with R

One of my tweets that referenced an excellent post about the ethics of web scraping garnered some interest:

If you load that up that tweet and follow the thread, you’ll see a really good question by @kennethrose82 regarding what an appropriate setting should be for a delay between crawls.

The answer is a bit nuanced as there are some written and unwritten “rules” for those who would seek to scrape web site content. For the sake of brevity in this post, we’ll only focus on “best practices” (ugh) for being kind to web site resources when it comes to timing requests, after a quick mention that “Step 0” must be to validate that the site’s terms & conditions or terms of service allow you to scrape & use data from said site.

Robot Roll Call

The absolute first thing you should do before scraping a site should be to check out their robots.txt file. What’s that? Well, I’ll let you read about it first from the vignette of the package we’re going to use to work with it.

Now that you know what such a file is, you also likely know how to peruse it since the vignette has excellent examples. But, we’ll toss one up here for good measure, focusing on one field that we’re going to talk about next:

library(tidyverse)
library(rvest)

robotstxt::robotstxt("seobook.com")$crawl_delay %>% 
  tbl_df()
## # A tibble: 114 x 3
##          field       useragent value
##          <chr>           <chr> <chr>
##  1 Crawl-delay               *    10
##  2 Crawl-delay        asterias    10
##  3 Crawl-delay BackDoorBot/1.0    10
##  4 Crawl-delay       BlackHole    10
##  5 Crawl-delay    BlowFish/1.0    10
##  6 Crawl-delay         BotALot    10
##  7 Crawl-delay   BuiltBotTough    10
##  8 Crawl-delay    Bullseye/1.0    10
##  9 Crawl-delay   BunnySlippers    10
## 10 Crawl-delay       Cegbfeieh    10
## # ... with 104 more rows

I chose that site since it has many entries for the Crawl-delay field, which defines the number of seconds a given site would like your crawler to wait between scrapes. For the continued sake of brevity, we’ll assume you’re going to be looking at the * entry when you perform your own scraping tasks (even though you should be setting your own User-Agent string). Let’s make a helper function for retrieving this value from a site, adding in some logic to provide a default value if no Crawl-Delay entry is found and to optimize the experience a bit (note that I keep varying the case of crawl-delay when I mention it to show that the field key is case-insensitive; be thankful robotstxt normalizes it for us!):

.get_delay <- function(domain) {
  
  message(sprintf("Refreshing robots.txt data for %s...", domain))
  
  cd_tmp <- robotstxt::robotstxt(domain)$crawl_delay
  
  if (length(cd_tmp) > 0) {
    star <- dplyr::filter(cd_tmp, useragent=="*")
    if (nrow(star) == 0) star <- cd_tmp[1,]
    as.numeric(star$value[1])
  } else {
    10L
  }

}

get_delay <- memoise::memoise(.get_delay)

The .get_delay() function could be made a bit more bulletproof, but I have to leave some work for y’all to do on your own. So, why both .get_delay() and the get_delay() functions, and what is this memoise? Well, even though robotstxt::robotstxt() will ultimately cache (in-memory, so only in the active R session) the robots.txt file it retrieved (if it retrieved one) we don’t want to do the filter/check/default/return all the time since it just wastes CPU clock cycles. The memoise() operation will check which parameter was sent and return the value that was computed vs going through that logic again. We can validate that on the seobook.com domain:

get_delay("seobook.com")
## Refreshing robots.txt data for seobook.com...
## [1] 10

get_delay("seobook.com")
## [1] 10

You can now use get_delay() in a Sys.sleep() call before your httr:GET() or rvest::read_html() operations.

Not So Fast…

Because you’re a savvy R coder and not some snake charmer, gem hoarder or go-getter, you likely caught the default 10L return value in .get_delay() and thought “Hrm… Where’d that come from?”.

I’m glad you asked!

I grabbed the first 400 robots.txt WARC files from the June 2017 Common Crawl, which ends up being ~1,000,000 sites. That sample ended up having ~80,000 sites with one or more CRAWL-DELAY entries. Some of those sites had entries that were not valid (in an attempt to either break, subvert or pwn a crawler) or set to a ridiculous value. I crunched through the data and made saner bins for the values to produce the following:

Sites seem to either want you to wait 10 seconds (or less) or about an hour between scraping actions. I went with the lower number purely for convenience, but would caution that this decision was based on the idea that your intention is to not do a ton of scraping (i.e. less than ~50-100 HTML pages). If you’re really going to do more than that, I strongly suggest you reach out to the site owner. Many folks are glad for the contact and could even arrange a better method for obtaining the data you seek.

FIN

So, remember:

  • check site ToS/T&C before scraping
  • check robots.txt before scraping (in general and for Crawl-Delay)
  • contact the site owner if you plan on doing a large amount of scraping
  • introduce some delay between page scrapes, even if the site does not have a specific crawl-delay entry), using the insights gained from the Common Crawl analysis to inform your decision

I’ll likely go through all the Common Crawl robots.txt WARC archives to get a fuller picture of the distribution of values and either update this post at a later date or do a quick new post on it.

(You also might want to run robotstxt::get_robotstxt("rud.is") #justsayin :-)

Cover image from Data-Driven Security
Amazon Author Page