

{"id":6134,"date":"2017-07-28T10:37:22","date_gmt":"2017-07-28T15:37:22","guid":{"rendered":"https:\/\/rud.is\/b\/?p=6134"},"modified":"2018-03-10T07:53:51","modified_gmt":"2018-03-10T12:53:51","slug":"analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r","status":"publish","type":"post","link":"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/","title":{"rendered":"Analyzing &#8220;Crawl-Delay&#8221; Settings in Common Crawl robots.txt Data with R"},"content":{"rendered":"<p>One of my tweets that referenced an excellent post about the ethics of web scraping garnered some interest:<\/p>\n<blockquote class=\"twitter-tweet\" data-lang=\"en\">\n<p lang=\"en\" dir=\"ltr\">Apologies for a Medium link but if you do ANY web scraping, you need to read this <a href=\"https:\/\/mobile.twitter.com\/hashtag\/rstats?src=hash\">#rstats<\/a> \/\/ Ethics in Web Scraping <a href=\"https:\/\/t.co\/y5YxvzB8Fd\">https:\/\/t.co\/y5YxvzB8Fd<\/a><\/p>\n<p>&mdash; boB Rudis (@hrbrmstr) <a href=\"https:\/\/mobile.twitter.com\/hrbrmstr\/status\/890226176383475712\">July 26, 2017<\/a><\/p><\/blockquote>\n<p><script async src=\"\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/p>\n<p>If you load that up that tweet and follow the thread, you&#8217;ll see a <em>really good<\/em> question by @kennethrose82 regarding what an appropriate setting should be for a delay between crawls.<\/p>\n<p>The answer is a bit nuanced as there are some written and unwritten &#8220;rules&#8221; for those who would seek to scrape web site content. For the sake of brevity in this post, we&#8217;ll only focus on &#8220;best practices&#8221; (ugh) for being kind to web site resources when it comes to timing requests, after a quick mention that &#8220;Step 0&#8221; <em>must<\/em> be to validate that the site&#8217;s terms &amp; conditions or terms of service allow you to scrape &amp; use data from said site.<\/p>\n<h3>Robot Roll Call<\/h3>\n<p>The absolute first thing you should do before scraping a site should be to check out their <code>robots.txt<\/code> file. What&#8217;s that? Well, I&#8217;ll let you read about it first from the <a href=\"https:\/\/cran.rstudio.com\/web\/packages\/robotstxt\/vignettes\/using_robotstxt.html\">vignette of the package<\/a> we&#8217;re going to use to work with it.<\/p>\n<p>Now that you know what such a file is, you also likely know how to peruse it since the vignette has excellent examples. But, we&#8217;ll toss one up here for good measure, focusing on one field that we&#8217;re going to talk about next:<\/p>\n<pre id=\"robtxt01\"><code class=\"language-r\">library(tidyverse)\r\nlibrary(rvest)\r\n\r\nrobotstxt::robotstxt(&quot;seobook.com&quot;)$crawl_delay %&gt;% \r\n  tbl_df()\r\n## # A tibble: 114 x 3\r\n##          field       useragent value\r\n##          &lt;chr&gt;           &lt;chr&gt; &lt;chr&gt;\r\n##  1 Crawl-delay               *    10\r\n##  2 Crawl-delay        asterias    10\r\n##  3 Crawl-delay BackDoorBot\/1.0    10\r\n##  4 Crawl-delay       BlackHole    10\r\n##  5 Crawl-delay    BlowFish\/1.0    10\r\n##  6 Crawl-delay         BotALot    10\r\n##  7 Crawl-delay   BuiltBotTough    10\r\n##  8 Crawl-delay    Bullseye\/1.0    10\r\n##  9 Crawl-delay   BunnySlippers    10\r\n## 10 Crawl-delay       Cegbfeieh    10\r\n## # ... with 104 more rows<\/code><\/pre>\n<p>I chose that site since it has <em>many<\/em> entries for the <code>Crawl-delay<\/code> field, which defines the number of seconds a given site would like your crawler to wait between scrapes. For the continued sake of brevity, we&#8217;ll assume you&#8217;re going to be looking at the <code>*<\/code> entry when you perform your own scraping tasks (even though you <em>should<\/em> be setting your own <code>User-Agent<\/code> string). Let&#8217;s make a helper function for retrieving this value from a site, adding in some logic to provide a default value if no <code>Crawl-Delay<\/code> entry is found and to optimize the experience a bit (note that I keep varying the case of <code>crawl-delay<\/code> when I mention it to show that the field key is case-insensitive; be thankful <code>robotstxt<\/code> normalizes it for us!):<\/p>\n<pre id=\"robtxt02\"><code class=\"language-r\">.get_delay &lt;- function(domain) {\r\n  \r\n  message(sprintf(&quot;Refreshing robots.txt data for %s...&quot;, domain))\r\n  \r\n  cd_tmp &lt;- robotstxt::robotstxt(domain)$crawl_delay\r\n  \r\n  if (length(cd_tmp) &gt; 0) {\r\n    star &lt;- dplyr::filter(cd_tmp, useragent==&quot;*&quot;)\r\n    if (nrow(star) == 0) star &lt;- cd_tmp[1,]\r\n    as.numeric(star$value[1])\r\n  } else {\r\n    10L\r\n  }\r\n\r\n}\r\n\r\nget_delay &lt;- memoise::memoise(.get_delay)<\/code><\/pre>\n<p>The <code>.get_delay()<\/code> function could be made a bit more bulletproof, but I have to leave <em>some<\/em> work for y&#8217;all to do on your own. So, why both <code>.get_delay()<\/code> and the <code>get_delay()<\/code> functions, and what is this <code>memoise<\/code>? Well, even though <code>robotstxt::robotstxt()<\/code> will ultimately cache (in-memory, so only in the active R session) the <code>robots.txt<\/code> file it retrieved (if it retrieved one) we don&#8217;t want to do the filter\/check\/default\/return all the time since it just wastes CPU clock cycles. The <code>memoise()<\/code> operation will check which parameter was sent and return the value that was computed vs going through that logic again. We can validate that on the <code>seobook.com<\/code> domain:<\/p>\n<pre id=\"robtxt03\"><code class=\"language-r\">get_delay(&quot;seobook.com&quot;)\r\n## Refreshing robots.txt data for seobook.com...\r\n## [1] 10\r\n\r\nget_delay(&quot;seobook.com&quot;)\r\n## [1] 10<\/code><\/pre>\n<p>You can now use <code>get_delay()<\/code> in a <code>Sys.sleep()<\/code> call before your <code>httr:GET()<\/code> or <code>rvest::read_html()<\/code> operations.<\/p>\n<h3>Not So Fast&hellip;<\/h3>\n<p>Because you&#8217;re a savvy R coder and not some snake charmer, gem hoarder or go-getter, you likely caught the default <code>10L<\/code> return value in <code>.get_delay()<\/code> and thought <em>&#8220;Hrm&hellip; Where&#8217;d that come from?&#8221;<\/em>.<\/p>\n<p>I&#8217;m glad you asked!<\/p>\n<p>I grabbed the first 400 <code>robots.txt<\/code> WARC files from the June 2017 <a href=\"http:\/\/commoncrawl.org\/2017\/07\/june-2017-crawl-archive-now-available\/\">Common Crawl<\/a>, which ends up being ~1,000,000 sites. That sample ended up having ~80,000 sites with one or more <code>CRAWL-DELAY<\/code> entries. Some of those sites had entries that were not valid (in an attempt to either break, subvert or pwn a crawler) or set to a ridiculous value. I crunched through the data and made saner bins for the values to produce the following:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"6137\" data-permalink=\"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/cursor_and_rstudio-5\/\" data-orig-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1792%2C926&amp;ssl=1\" data-orig-size=\"1792,926\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Cursor_and_RStudio\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=510%2C264&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?resize=510%2C264&#038;ssl=1\" alt=\"\" width=\"510\" height=\"264\" class=\"aligncenter size-full wp-image-6137\" \/><\/a><\/p>\n<p>Sites seem to either want you to wait 10 seconds (or less) or about an hour between scraping actions. I went with the lower number purely for convenience, but would caution that this decision was based on the idea that your intention is to not do a ton of scraping (i.e. less than ~50-100 HTML pages). If you&#8217;re really going to do more than that, I strongly suggest you reach out to the site owner. Many folks are glad for the contact and could even arrange a better method for obtaining the data you seek.<\/p>\n<h3>FIN<\/h3>\n<p>So, remember:<\/p>\n<ul>\n<li>check site ToS\/T&amp;C before scraping<\/li>\n<li>check <code>robots.txt<\/code> before scraping (in general and for <code>Crawl-Delay<\/code>)<\/li>\n<li>contact the site owner if you plan on doing a large amount of scraping<\/li>\n<li>introduce <em>some<\/em> delay between page scrapes, even if the site does not have a specific <code>crawl-delay<\/code> entry), using the insights gained from the Common Crawl analysis to inform your decision<\/li>\n<\/ul>\n<p>I&#8217;ll likely go through all the Common Crawl <code>robots.txt<\/code> WARC archives to get a fuller picture of the distribution of values and either update this post at a later date or do a quick new post on it.<\/p>\n<p>(You also might want to run <code>robotstxt::get_robotstxt(\"rud.is\")<\/code> #justsayin :-)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of my tweets that referenced an excellent post about the ethics of web scraping garnered some interest: Apologies for a Medium link but if you do ANY web scraping, you need to read this #rstats \/\/ Ethics in Web Scraping https:\/\/t.co\/y5YxvzB8Fd &mdash; boB Rudis (@hrbrmstr) July 26, 2017 If you load that up that [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":6137,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":3,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":""},"categories":[91,725],"tags":[810],"class_list":["post-6134","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-r","category-web-scraping","tag-post"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Analyzing &quot;Crawl-Delay&quot; Settings in Common Crawl robots.txt Data with R - rud.is<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Analyzing &quot;Crawl-Delay&quot; Settings in Common Crawl robots.txt Data with R - rud.is\" \/>\n<meta property=\"og:description\" content=\"One of my tweets that referenced an excellent post about the ethics of web scraping garnered some interest: Apologies for a Medium link but if you do ANY web scraping, you need to read this #rstats \/\/ Ethics in Web Scraping https:\/\/t.co\/y5YxvzB8Fd &mdash; boB Rudis (@hrbrmstr) July 26, 2017 If you load that up that [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/\" \/>\n<meta property=\"og:site_name\" content=\"rud.is\" \/>\n<meta property=\"article:published_time\" content=\"2017-07-28T15:37:22+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-03-10T12:53:51+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1792%2C926&ssl=1\" \/>\n\t<meta property=\"og:image:width\" content=\"1792\" \/>\n\t<meta property=\"og:image:height\" content=\"926\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"hrbrmstr\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"hrbrmstr\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/07\\\/28\\\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/07\\\/28\\\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\\\/\"},\"author\":{\"name\":\"hrbrmstr\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"headline\":\"Analyzing &#8220;Crawl-Delay&#8221; Settings in Common Crawl robots.txt Data with R\",\"datePublished\":\"2017-07-28T15:37:22+00:00\",\"dateModified\":\"2018-03-10T12:53:51+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/07\\\/28\\\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\\\/\"},\"wordCount\":840,\"commentCount\":10,\"publisher\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"image\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/07\\\/28\\\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2017\\\/07\\\/Cursor_and_RStudio.png?fit=1792%2C926&ssl=1\",\"keywords\":[\"post\"],\"articleSection\":[\"R\",\"web scraping\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/07\\\/28\\\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/07\\\/28\\\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\\\/\",\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/07\\\/28\\\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\\\/\",\"name\":\"Analyzing \\\"Crawl-Delay\\\" Settings in Common Crawl robots.txt Data with R - rud.is\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/07\\\/28\\\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/07\\\/28\\\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2017\\\/07\\\/Cursor_and_RStudio.png?fit=1792%2C926&ssl=1\",\"datePublished\":\"2017-07-28T15:37:22+00:00\",\"dateModified\":\"2018-03-10T12:53:51+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/07\\\/28\\\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/07\\\/28\\\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/07\\\/28\\\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\\\/#primaryimage\",\"url\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2017\\\/07\\\/Cursor_and_RStudio.png?fit=1792%2C926&ssl=1\",\"contentUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2017\\\/07\\\/Cursor_and_RStudio.png?fit=1792%2C926&ssl=1\",\"width\":1792,\"height\":926},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/07\\\/28\\\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/rud.is\\\/b\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Analyzing &#8220;Crawl-Delay&#8221; Settings in Common Crawl robots.txt Data with R\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#website\",\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/\",\"name\":\"rud.is\",\"description\":\"&quot;In God we trust. All others must bring data&quot;\",\"publisher\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/rud.is\\\/b\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\",\"name\":\"hrbrmstr\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"url\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"contentUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"width\":460,\"height\":460,\"caption\":\"hrbrmstr\"},\"logo\":{\"@id\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\"},\"description\":\"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7\",\"sameAs\":[\"http:\\\/\\\/rud.is\"],\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/author\\\/hrbrmstr\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Analyzing \"Crawl-Delay\" Settings in Common Crawl robots.txt Data with R - rud.is","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/","og_locale":"en_US","og_type":"article","og_title":"Analyzing \"Crawl-Delay\" Settings in Common Crawl robots.txt Data with R - rud.is","og_description":"One of my tweets that referenced an excellent post about the ethics of web scraping garnered some interest: Apologies for a Medium link but if you do ANY web scraping, you need to read this #rstats \/\/ Ethics in Web Scraping https:\/\/t.co\/y5YxvzB8Fd &mdash; boB Rudis (@hrbrmstr) July 26, 2017 If you load that up that [&hellip;]","og_url":"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/","og_site_name":"rud.is","article_published_time":"2017-07-28T15:37:22+00:00","article_modified_time":"2018-03-10T12:53:51+00:00","og_image":[{"width":1792,"height":926,"url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1792%2C926&ssl=1","type":"image\/png"}],"author":"hrbrmstr","twitter_card":"summary_large_image","twitter_misc":{"Written by":"hrbrmstr","Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/#article","isPartOf":{"@id":"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/"},"author":{"name":"hrbrmstr","@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"headline":"Analyzing &#8220;Crawl-Delay&#8221; Settings in Common Crawl robots.txt Data with R","datePublished":"2017-07-28T15:37:22+00:00","dateModified":"2018-03-10T12:53:51+00:00","mainEntityOfPage":{"@id":"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/"},"wordCount":840,"commentCount":10,"publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"image":{"@id":"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1792%2C926&ssl=1","keywords":["post"],"articleSection":["R","web scraping"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/","url":"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/","name":"Analyzing \"Crawl-Delay\" Settings in Common Crawl robots.txt Data with R - rud.is","isPartOf":{"@id":"https:\/\/rud.is\/b\/#website"},"primaryImageOfPage":{"@id":"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/#primaryimage"},"image":{"@id":"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1792%2C926&ssl=1","datePublished":"2017-07-28T15:37:22+00:00","dateModified":"2018-03-10T12:53:51+00:00","breadcrumb":{"@id":"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/#primaryimage","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1792%2C926&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1792%2C926&ssl=1","width":1792,"height":926},{"@type":"BreadcrumbList","@id":"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/rud.is\/b\/"},{"@type":"ListItem","position":2,"name":"Analyzing &#8220;Crawl-Delay&#8221; Settings in Common Crawl robots.txt Data with R"}]},{"@type":"WebSite","@id":"https:\/\/rud.is\/b\/#website","url":"https:\/\/rud.is\/b\/","name":"rud.is","description":"&quot;In God we trust. All others must bring data&quot;","publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/rud.is\/b\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886","name":"hrbrmstr","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","width":460,"height":460,"caption":"hrbrmstr"},"logo":{"@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1"},"description":"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7","sameAs":["http:\/\/rud.is"],"url":"https:\/\/rud.is\/b\/author\/hrbrmstr\/"}]}},"jetpack_featured_media_url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1792%2C926&ssl=1","jetpack_shortlink":"https:\/\/wp.me\/p23idr-1AW","jetpack_likes_enabled":true,"jetpack-related-posts":[{"id":6385,"url":"https:\/\/rud.is\/b\/2017\/09\/19\/pirating-web-content-responsibly-with-r\/","url_meta":{"origin":6134,"position":0},"title":"Pirating Web Content Responsibly With R","author":"hrbrmstr","date":"2017-09-19","format":false,"excerpt":"International Code Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back. There will be no\u2026","rel":"","context":"In &quot;data wrangling&quot;","block_context":{"text":"data wrangling","link":"https:\/\/rud.is\/b\/category\/data-wrangling\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom-2.png?fit=1200%2C917&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom-2.png?fit=1200%2C917&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom-2.png?fit=1200%2C917&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom-2.png?fit=1200%2C917&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom-2.png?fit=1200%2C917&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":6465,"url":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/","url_meta":{"origin":6134,"position":1},"title":"Speeding Up Digital Arachnids","author":"hrbrmstr","date":"2017-09-25","format":false,"excerpt":"spiderbar, spiderbar Reads robots rules from afar. Crawls the web, any size; Fetches with respect, never lies. Look Out! Here comes the spiderbar. Is it fast? Listen bud, It's got C++ under the hood. Can you scrape, from a site? Test with can_fetch(), TRUE == alright Hey, there There goes\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1200%2C720&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1200%2C720&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1200%2C720&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1200%2C720&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1200%2C720&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":5907,"url":"https:\/\/rud.is\/b\/2017\/05\/05\/scrapeover-friday-a-k-a-another-r-scraping-makeover\/","url_meta":{"origin":6134,"position":2},"title":"Scrapeover Friday \u2014 a.k.a. Another R Scraping Makeover","author":"hrbrmstr","date":"2017-05-05","format":false,"excerpt":"I caught a glimpse of a tweet by @dataandme on Friday: Using R & rvest to explore Malaysian property mkt: \"Web Scraping: The Sequel, Propwall.my\" https:\/\/t.co\/daZOOJJfPN #rstats #rvest pic.twitter.com\/u6QMhm4M3e\u2014 Mara Averick (@dataandme) May 5, 2017 Mara is \u2014 without a doubt \u2014 the best data science promoter in the Twitterverse.\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":11670,"url":"https:\/\/rud.is\/b\/2018\/12\/02\/more-scraping-ethics-gone-awry-and-why-do-this-when-theres-a-free-api\/","url_meta":{"origin":6134,"position":3},"title":"More &#8220;Scraping Ethics Gone Awry&#8221; and &#8220;Why Do This When There&#8217;s a Free API?&#8221;","author":"hrbrmstr","date":"2018-12-02","format":false,"excerpt":"I can't seem to free my infrequently-viewed email inbox from \"you might like!\" notices by the content-lock-in site Medium. This one made it to the iOS notification screen (otherwise I'd've been blissfully unaware of it and would have saved you the trouble of reading this). Today, they sent me this\u2026","rel":"","context":"In &quot;web scraping&quot;","block_context":{"text":"web scraping","link":"https:\/\/rud.is\/b\/category\/web-scraping\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":11383,"url":"https:\/\/rud.is\/b\/2018\/08\/13\/in-brief-splashr-update-high-performance-scraping-with-splashr-furrr-teamhg-memexs-aquarium\/","url_meta":{"origin":6134,"position":4},"title":"In-brief: splashr update + High Performance Scraping with splashr, furrr &#038; TeamHG-Memex&#8217;s Aquarium","author":"hrbrmstr","date":"2018-08-13","format":false,"excerpt":"The development version of splashr now support authenticated connections to Splash API instances. Just specify user and pass on the initial splashr::splash() call to use your scraping setup a bit more safely. For those not familiar with splashr and\/or Splash: the latter is a lightweight alternative to tools like Selenium\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":6164,"url":"https:\/\/rud.is\/b\/2017\/08\/22\/caching-httr-requests-this-means-warc\/","url_meta":{"origin":6134,"position":5},"title":"Caching httr Requests? This means WAR[C]!","author":"hrbrmstr","date":"2017-08-22","format":false,"excerpt":"I've blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to work with\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/6134","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/comments?post=6134"}],"version-history":[{"count":0,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/6134\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/media\/6137"}],"wp:attachment":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/media?parent=6134"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/categories?post=6134"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/tags?post=6134"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}