One of my tweets that referenced an excellent post about the ethics of web scraping garnered some interest:
Apologies for a Medium link but if you do ANY web scraping, you need to read this #rstats // Ethics in Web Scraping https://t.co/y5YxvzB8Fd
— boB Rudis (@hrbrmstr) July 26, 2017
If you load that up that tweet and follow the thread, you’ll see a really good question by @kennethrose82 regarding what an appropriate setting should be for a delay between crawls.
The answer is a bit nuanced as there are some written and unwritten “rules” for those who would seek to scrape web site content. For the sake of brevity in this post, we’ll only focus on “best practices” (ugh) for being kind to web site resources when it comes to timing requests, after a quick mention that “Step 0” must be to validate that the site’s terms & conditions or terms of service allow you to scrape & use data from said site.
Robot Roll Call
The absolute first thing you should do before scraping a site should be to check out their robots.txt
file. What’s that? Well, I’ll let you read about it first from the vignette of the package we’re going to use to work with it.
Now that you know what such a file is, you also likely know how to peruse it since the vignette has excellent examples. But, we’ll toss one up here for good measure, focusing on one field that we’re going to talk about next:
library(tidyverse)
library(rvest)
robotstxt::robotstxt("seobook.com")$crawl_delay %>%
tbl_df()
## # A tibble: 114 x 3
## field useragent value
## <chr> <chr> <chr>
## 1 Crawl-delay * 10
## 2 Crawl-delay asterias 10
## 3 Crawl-delay BackDoorBot/1.0 10
## 4 Crawl-delay BlackHole 10
## 5 Crawl-delay BlowFish/1.0 10
## 6 Crawl-delay BotALot 10
## 7 Crawl-delay BuiltBotTough 10
## 8 Crawl-delay Bullseye/1.0 10
## 9 Crawl-delay BunnySlippers 10
## 10 Crawl-delay Cegbfeieh 10
## # ... with 104 more rows
I chose that site since it has many entries for the Crawl-delay
field, which defines the number of seconds a given site would like your crawler to wait between scrapes. For the continued sake of brevity, we’ll assume you’re going to be looking at the *
entry when you perform your own scraping tasks (even though you should be setting your own User-Agent
string). Let’s make a helper function for retrieving this value from a site, adding in some logic to provide a default value if no Crawl-Delay
entry is found and to optimize the experience a bit (note that I keep varying the case of crawl-delay
when I mention it to show that the field key is case-insensitive; be thankful robotstxt
normalizes it for us!):
.get_delay <- function(domain) {
message(sprintf("Refreshing robots.txt data for %s...", domain))
cd_tmp <- robotstxt::robotstxt(domain)$crawl_delay
if (length(cd_tmp) > 0) {
star <- dplyr::filter(cd_tmp, useragent=="*")
if (nrow(star) == 0) star <- cd_tmp[1,]
as.numeric(star$value[1])
} else {
10L
}
}
get_delay <- memoise::memoise(.get_delay)
The .get_delay()
function could be made a bit more bulletproof, but I have to leave some work for y’all to do on your own. So, why both .get_delay()
and the get_delay()
functions, and what is this memoise
? Well, even though robotstxt::robotstxt()
will ultimately cache (in-memory, so only in the active R session) the robots.txt
file it retrieved (if it retrieved one) we don’t want to do the filter/check/default/return all the time since it just wastes CPU clock cycles. The memoise()
operation will check which parameter was sent and return the value that was computed vs going through that logic again. We can validate that on the seobook.com
domain:
get_delay("seobook.com")
## Refreshing robots.txt data for seobook.com...
## [1] 10
get_delay("seobook.com")
## [1] 10
You can now use get_delay()
in a Sys.sleep()
call before your httr:GET()
or rvest::read_html()
operations.
Not So Fast…
Because you’re a savvy R coder and not some snake charmer, gem hoarder or go-getter, you likely caught the default 10L
return value in .get_delay()
and thought “Hrm… Where’d that come from?”.
I’m glad you asked!
I grabbed the first 400 robots.txt
WARC files from the June 2017 Common Crawl, which ends up being ~1,000,000 sites. That sample ended up having ~80,000 sites with one or more CRAWL-DELAY
entries. Some of those sites had entries that were not valid (in an attempt to either break, subvert or pwn a crawler) or set to a ridiculous value. I crunched through the data and made saner bins for the values to produce the following:
Sites seem to either want you to wait 10 seconds (or less) or about an hour between scraping actions. I went with the lower number purely for convenience, but would caution that this decision was based on the idea that your intention is to not do a ton of scraping (i.e. less than ~50-100 HTML pages). If you’re really going to do more than that, I strongly suggest you reach out to the site owner. Many folks are glad for the contact and could even arrange a better method for obtaining the data you seek.
FIN
So, remember:
- check site ToS/T&C before scraping
- check
robots.txt
before scraping (in general and forCrawl-Delay
) - contact the site owner if you plan on doing a large amount of scraping
- introduce some delay between page scrapes, even if the site does not have a specific
crawl-delay
entry), using the insights gained from the Common Crawl analysis to inform your decision
I’ll likely go through all the Common Crawl robots.txt
WARC archives to get a fuller picture of the distribution of values and either update this post at a later date or do a quick new post on it.
(You also might want to run robotstxt::get_robotstxt("rud.is")
#justsayin :-)
10 Trackbacks/Pingbacks
[…] 29 juillet 2017 : ce billet, provoqué par ce billet sur l’éthique du scraping, a aussi sa place […]
[…] article was first published on R – rud.is, and kindly contributed to […]
[…] article was first published on R – rud.is, and kindly contributed to […]
[…] blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on […]
[…] blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on […]
[…] I still haven’t had time to follow up on an earlier post about ‘crawl-delay’ settings across the internet I have done enough work on it to know that a 5 or 10 second delay is […]
[…] I still haven’t had time to follow up on an earlier post about ‘crawl-delay’ settings across the internet I have done enough work on it to know that a 5 or 10 second delay is […]
[…] I still haven’t had time to follow up on an earlier post about ‘crawl-delay’ settings across the internet I have done enough work on it to know that a 5 or 10 second delay is […]
[…] I still haven’t had time to follow up on an earlier post about ‘crawl-delay’ settings across the internet I have done enough work on it to know that a 5 or 10 second delay is […]
[…] (there’s always a “but“, right?), it was a definite bottleneck for an unintended package use case earlier this year (yes, I still have not rounded out the corners on my “crawl delay” […]