

{"id":6945,"date":"2017-11-03T08:18:12","date_gmt":"2017-11-03T13:18:12","guid":{"rendered":"https:\/\/rud.is\/b\/?p=6945"},"modified":"2018-03-07T16:57:13","modified_gmt":"2018-03-07T21:57:13","slug":"i-for-one-welcome-our-forthcoming-new-robots-txt-overlords","status":"publish","type":"post","link":"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/","title":{"rendered":"I, For One, Welcome Our Forthcoming New robots.txt Overlords"},"content":{"rendered":"<p><a href=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/11\/andy-kelly-402111.jpg?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"6954\" data-permalink=\"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/andy-kelly-402111\/\" data-orig-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/11\/andy-kelly-402111.jpg?fit=1080%2C720&amp;ssl=1\" data-orig-size=\"1080,720\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;1&quot;}\" data-image-title=\"andy-kelly-402111\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/11\/andy-kelly-402111.jpg?fit=300%2C200&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/11\/andy-kelly-402111.jpg?fit=510%2C340&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/11\/andy-kelly-402111.jpg?resize=510%2C340&#038;ssl=1\" alt=\"\" width=\"510\" height=\"340\" class=\"aligncenter size-full wp-image-6954\" \/><\/a><\/p>\n<p>Despite my week-long Twitter consumption sabbatical (helped &#8212; in part &#8212; by the nigh week-long internet and power outage here in Maine), I still catch useful snippets from folks. My cow-orker @dabdine shunted a tweet by @terrencehart into a Slack channel this morning, and said tweet contained a link to <a href=\"https:\/\/scholar.google.com\/scholar_case?case=5916920735312631998\">this little gem<\/a>. Said gem is the text of a very recent ruling from a District Court in Texas and deals with a favourite subject of mine: <code>robots.txt<\/code>.<\/p>\n<p>The background of the case is that there were two parties who both ran websites for oil and gas professionals that include job postings. One party filed a lawsuit against the other asserting that the they hacked into their system and accessed and used various information in violation of the Computer Fraud and Abuse Act (CFAA), the Stored Wire and Electronic Communications and Transactional Records Access Act (SWECTRA), the Racketeer Influenced and Corrupt Organizations Act (RICO), the Texas Harmful Access by Computer Act (THACA), the Texas Theft Liability Act (TTLA), and the Texas Uniform Trade Secrets Act  (TUTS). They also asserted common law claims of misappropriation of confidential information, conversion, trespass to chattels, fraud, breach of fiduciary duty, unfair competition, tortious interference with present and prospective business relationships, civil conspiracy, and aiding and abetting.<\/p>\n<p>The other party filed a motion for dismissal on a number of grounds involving legalese on Terms &amp; Conditions, a rebuttal of CFAA claims and really gnarly legalese around copyrights. There are more than a few paragraphs that make me glad none of my offspring have gone into or have a desire to go into the legal profession. One TLDR here is that T&amp;Cs do, in fact, matter (though that is definitely dependent upon the legal climate where you live or have a case filed against you). We&#8217;re going to focus on the DMCA claim which leads us to the <code>robots.txt<\/code> part.<\/p>\n<p>I shall also preface the rest with &#8220;IANAL&#8221;, but I don&#8217;t think a review of this case requires a law degree.<\/p>\n<h3>Command-Shift-R<\/h3>\n<p>To refresh memories (or create lasting new ones), <code>robots.txt<\/code> is a file that is placed at the top of a web site domain (i.e. <a href=\"https:\/\/rud.is\/robots.txt\">https:\/\/rud.is\/robots.txt<\/a>) that contains <a href=\"https:\/\/en.wikipedia.org\/wiki\/Robots_exclusion_standard\">robots exclusion standard<\/a> rules. These rules tell bots (NOTE: if you write a scraper, you&#8217;ve written a scraping bot) what they can or cannot scrape and what &#8212; if any &#8212; delay should be placed between scraping efforts by said bot.<\/p>\n<p>R has two CRAN packages for dealing with these files\/rules: <a href=\"https:\/\/cran.r-project.org\/web\/packages\/robotstxt\/\"><code>robotstxt<\/code><\/a> by Peter Meissner and <a href=\"https:\/\/cran.r-project.org\/web\/packages\/spiderbar\/index.html\"><code>spiderbar<\/code><\/a> by me. They are not competitors, but are designed much like Reese&#8217;s Peanut Butter cups &#8212; to go together (though Peter did some <em>wicked good testing<\/em> and noted a possible error in the underlying C++ library I use that could generate Type I or Type II in certain circumstances) and each has some specialization. I note them now because you don&#8217;t have an excuse not to check <code>robots.txt<\/code> given two CRAN packages being available. Python folks have (at a minimum) <code>robotparser<\/code> and <code>reppy<\/code>. Node, Go and other, modern languages all have at least one module\/library\/package available as well. No. Excuses.<\/p>\n<h3>Your Point?<\/h3>\n<p><em>(Y&#8217;all are always in a rush, eh?)<\/em><\/p>\n<p>This October, 2017 Texas ruling references a <a href=\"https:\/\/scholar.google.com\/scholar_case?case=17446744354402591987&amp;hl=en&amp;as_sdt=40000006\">2007 ruling by a District Court in Pennsylvania<\/a>. I dug in a bit through <a href=\"https:\/\/scholar.google.com\/scholar?hl=en&amp;as_sdt=40000003&amp;q=%22robots.txt%22&amp;btnG=\">searchable Federal case law<\/a> for mentions of <code>robots.txt<\/code> and there aren&#8217;t many U.S. cases that mention this control, though I am amused a small cadre of paralegals had to type <code>robots.txt<\/code> over-and-over again.<\/p>\n<p>The dismissal request on the grounds that the CFAA did not apply was <em>summarily rejected<\/em>. Why? The defendant provided proof that they monitor for scraping activity that violates the <code>robots.txt<\/code> rules and that they use the Windows Firewall (ugh, they use Windows for web serving) to block offending IP addresses when they discover them.<\/p>\n<p>Nuances came out further along in the dismissal text noting that user-interactive viewing of the member profiles on the site was well-within the T&amp;Cs but that the defendant <em>&#8220;never authorized [the use of] automated bots to download over 500,000 profiles&#8221;<\/em> nor to have that data used for commercial purposes.<\/p>\n<p>The kicker (for me, anyway) is the last paragraph of the document in the Conclusion where the defendant asserts that:<\/p>\n<ul>\n<li><code>robots.txt<\/code> is in fact a bona-fide technological measure to effectively control access to copyright materials<\/li>\n<li>the &#8220;Internet industry&#8221; (I seriously dislike lawyers for wording just like that) has recognized <code>robots.txt<\/code> as a standard for controlling automated access to resources<\/li>\n<li><code>robots.txt<\/code> has been a valid enforcement mechanism since 1994<\/li>\n<\/ul>\n<p>The good bit is: -&#8220;Whether it actually qualifies in this case will be determined definitively at summary judgment or by a jury.&#8221;_ To me, this sounds like a ruling by a jury\/judge in favor of <code>robots.txt<\/code> could mean that it becomes much stronger case law for future misuse claims.<\/p>\n<p>With that in mind:<\/p>\n<ul>\n<li><strong>Site owners<\/strong>: USE <code>robots.txt<\/code>, if &#8212; for no other reason &#8212; to aid legitimate researchers who want to make use of your data for valid scientific purposes, education or to create non-infringing content or analyses that will be a benefit to the public good. You can also use it to legally protect your content (but there are definitely nuances around how you do that).<\/li>\n<li><strong>Scrapers<\/strong>: Check and obey <code>robots.txt<\/code> rules. You have no technological excuse not to and not doing so really appears that it could come back to haunt you in the very near future.<\/li>\n<\/ul>\n<h3>FIN<\/h3>\n<p>I&#8217;ve setup an alert for when future rulings come out for this case and will toss up another post here or on the work-blog (so I can run it by our very-non-skeezy legal team) when it pops up again.<\/p>\n<div style=\"width:100%; text-align:right\"><span style=\"font-size:8pt\"><a href=\"https:\/\/unsplash.com\/photos\/0E_vhMVqL9g\">&#8220;Best Friends&#8221;<\/a> image by Andy Kelly. Used with permission.<span><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Despite my week-long Twitter consumption sabbatical (helped &#8212; in part &#8212; by the nigh week-long internet and power outage here in Maine), I still catch useful snippets from folks. My cow-orker @dabdine shunted a tweet by @terrencehart into a Slack channel this morning, and said tweet contained a link to this little gem. Said gem [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":6954,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":3,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":""},"categories":[91],"tags":[810,803,801,802],"class_list":["post-6945","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-r","tag-post","tag-robots-exclusion-standard","tag-robots-txt","tag-robotstxt"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.2 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>I, For One, Welcome Our Forthcoming New robots.txt Overlords - rud.is<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"I, For One, Welcome Our Forthcoming New robots.txt Overlords - rud.is\" \/>\n<meta property=\"og:description\" content=\"Despite my week-long Twitter consumption sabbatical (helped &#8212; in part &#8212; by the nigh week-long internet and power outage here in Maine), I still catch useful snippets from folks. My cow-orker @dabdine shunted a tweet by @terrencehart into a Slack channel this morning, and said tweet contained a link to this little gem. Said gem [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/\" \/>\n<meta property=\"og:site_name\" content=\"rud.is\" \/>\n<meta property=\"article:published_time\" content=\"2017-11-03T13:18:12+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-03-07T21:57:13+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/11\/andy-kelly-402111.jpg?fit=1080%2C720&ssl=1\" \/>\n\t<meta property=\"og:image:width\" content=\"1080\" \/>\n\t<meta property=\"og:image:height\" content=\"720\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"hrbrmstr\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"hrbrmstr\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/\"},\"author\":{\"name\":\"hrbrmstr\",\"@id\":\"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886\"},\"headline\":\"I, For One, Welcome Our Forthcoming New robots.txt Overlords\",\"datePublished\":\"2017-11-03T13:18:12+00:00\",\"dateModified\":\"2018-03-07T21:57:13+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/\"},\"wordCount\":964,\"commentCount\":6,\"publisher\":{\"@id\":\"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886\"},\"image\":{\"@id\":\"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/11\/andy-kelly-402111.jpg?fit=1080%2C720&ssl=1\",\"keywords\":[\"post\",\"robots exclusion standard\",\"robots.txt\",\"robotstxt\"],\"articleSection\":[\"R\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/\",\"url\":\"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/\",\"name\":\"I, For One, Welcome Our Forthcoming New robots.txt Overlords - rud.is\",\"isPartOf\":{\"@id\":\"https:\/\/rud.is\/b\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/11\/andy-kelly-402111.jpg?fit=1080%2C720&ssl=1\",\"datePublished\":\"2017-11-03T13:18:12+00:00\",\"dateModified\":\"2018-03-07T21:57:13+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/#primaryimage\",\"url\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/11\/andy-kelly-402111.jpg?fit=1080%2C720&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/11\/andy-kelly-402111.jpg?fit=1080%2C720&ssl=1\",\"width\":1080,\"height\":720},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/rud.is\/b\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"I, For One, Welcome Our Forthcoming New robots.txt Overlords\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/rud.is\/b\/#website\",\"url\":\"https:\/\/rud.is\/b\/\",\"name\":\"rud.is\",\"description\":\"&quot;In God we trust. All others must bring data&quot;\",\"publisher\":{\"@id\":\"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/rud.is\/b\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886\",\"name\":\"hrbrmstr\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1\",\"url\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1\",\"width\":460,\"height\":460,\"caption\":\"hrbrmstr\"},\"logo\":{\"@id\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1\"},\"description\":\"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7\",\"sameAs\":[\"http:\/\/rud.is\"],\"url\":\"https:\/\/rud.is\/b\/author\/hrbrmstr\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"I, For One, Welcome Our Forthcoming New robots.txt Overlords - rud.is","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/","og_locale":"en_US","og_type":"article","og_title":"I, For One, Welcome Our Forthcoming New robots.txt Overlords - rud.is","og_description":"Despite my week-long Twitter consumption sabbatical (helped &#8212; in part &#8212; by the nigh week-long internet and power outage here in Maine), I still catch useful snippets from folks. My cow-orker @dabdine shunted a tweet by @terrencehart into a Slack channel this morning, and said tweet contained a link to this little gem. Said gem [&hellip;]","og_url":"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/","og_site_name":"rud.is","article_published_time":"2017-11-03T13:18:12+00:00","article_modified_time":"2018-03-07T21:57:13+00:00","og_image":[{"width":1080,"height":720,"url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/11\/andy-kelly-402111.jpg?fit=1080%2C720&ssl=1","type":"image\/jpeg"}],"author":"hrbrmstr","twitter_card":"summary_large_image","twitter_misc":{"Written by":"hrbrmstr","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/#article","isPartOf":{"@id":"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/"},"author":{"name":"hrbrmstr","@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"headline":"I, For One, Welcome Our Forthcoming New robots.txt Overlords","datePublished":"2017-11-03T13:18:12+00:00","dateModified":"2018-03-07T21:57:13+00:00","mainEntityOfPage":{"@id":"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/"},"wordCount":964,"commentCount":6,"publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"image":{"@id":"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/11\/andy-kelly-402111.jpg?fit=1080%2C720&ssl=1","keywords":["post","robots exclusion standard","robots.txt","robotstxt"],"articleSection":["R"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/","url":"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/","name":"I, For One, Welcome Our Forthcoming New robots.txt Overlords - rud.is","isPartOf":{"@id":"https:\/\/rud.is\/b\/#website"},"primaryImageOfPage":{"@id":"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/#primaryimage"},"image":{"@id":"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/11\/andy-kelly-402111.jpg?fit=1080%2C720&ssl=1","datePublished":"2017-11-03T13:18:12+00:00","dateModified":"2018-03-07T21:57:13+00:00","breadcrumb":{"@id":"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/#primaryimage","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/11\/andy-kelly-402111.jpg?fit=1080%2C720&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/11\/andy-kelly-402111.jpg?fit=1080%2C720&ssl=1","width":1080,"height":720},{"@type":"BreadcrumbList","@id":"https:\/\/rud.is\/b\/2017\/11\/03\/i-for-one-welcome-our-forthcoming-new-robots-txt-overlords\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/rud.is\/b\/"},{"@type":"ListItem","position":2,"name":"I, For One, Welcome Our Forthcoming New robots.txt Overlords"}]},{"@type":"WebSite","@id":"https:\/\/rud.is\/b\/#website","url":"https:\/\/rud.is\/b\/","name":"rud.is","description":"&quot;In God we trust. All others must bring data&quot;","publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/rud.is\/b\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886","name":"hrbrmstr","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","width":460,"height":460,"caption":"hrbrmstr"},"logo":{"@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1"},"description":"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7","sameAs":["http:\/\/rud.is"],"url":"https:\/\/rud.is\/b\/author\/hrbrmstr\/"}]}},"jetpack_featured_media_url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/11\/andy-kelly-402111.jpg?fit=1080%2C720&ssl=1","jetpack_shortlink":"https:\/\/wp.me\/p23idr-1O1","jetpack_likes_enabled":true,"jetpack-related-posts":[{"id":6134,"url":"https:\/\/rud.is\/b\/2017\/07\/28\/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r\/","url_meta":{"origin":6945,"position":0},"title":"Analyzing &#8220;Crawl-Delay&#8221; Settings in Common Crawl robots.txt Data with R","author":"hrbrmstr","date":"2017-07-28","format":false,"excerpt":"One of my tweets that referenced an excellent post about the ethics of web scraping garnered some interest: Apologies for a Medium link but if you do ANY web scraping, you need to read this #rstats \/\/ Ethics in Web Scraping https:\/\/t.co\/y5YxvzB8Fd\u2014 boB Rudis (@hrbrmstr) July 26, 2017 If you\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1200%2C620&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1200%2C620&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1200%2C620&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1200%2C620&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/Cursor_and_RStudio.png?fit=1200%2C620&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":6465,"url":"https:\/\/rud.is\/b\/2017\/09\/25\/speeding-up-digital-arachinds\/","url_meta":{"origin":6945,"position":1},"title":"Speeding Up Digital Arachnids","author":"hrbrmstr","date":"2017-09-25","format":false,"excerpt":"spiderbar, spiderbar Reads robots rules from afar. Crawls the web, any size; Fetches with respect, never lies. Look Out! Here comes the spiderbar. Is it fast? Listen bud, It's got C++ under the hood. Can you scrape, from a site? Test with can_fetch(), TRUE == alright Hey, there There goes\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1200%2C720&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1200%2C720&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1200%2C720&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1200%2C720&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/spiderbar_1-1.png?fit=1200%2C720&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":12142,"url":"https:\/\/rud.is\/b\/2019\/04\/12\/a-note-to-our-community-on-how-to-hide-your-content-from-search-engines\/","url_meta":{"origin":6945,"position":2},"title":"A Note to Our Community On How To Hide Your Content From Search Engines","author":"hrbrmstr","date":"2019-04-12","format":false,"excerpt":"UPDATE 2019-04-17 \u2014 The example at the bottom which shows that the, er, randomly chosen site has the offending <meta> tag present is an old result. As of this update timestamp, that robots noindex tag is not on the site. Since the presence status of that tag is in flux,\u2026","rel":"","context":"In &quot;Leadership&quot;","block_context":{"text":"Leadership","link":"https:\/\/rud.is\/b\/category\/leadership\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":6685,"url":"https:\/\/rud.is\/b\/2017\/10\/09\/enabling-concerned-visitors-ethical-security-researchers-with-security-txt-web-security-policies-plus-analyze-them-at-scale-with-r\/","url_meta":{"origin":6945,"position":3},"title":"Enabling Concerned Visitors &#038; Ethical Security Researchers with security.txt Web Security Policies (plus analyze them at-scale with R)","author":"hrbrmstr","date":"2017-10-09","format":false,"excerpt":"I've blogged a bit about robots.txt --- the rules file that documents a sites \"robots exclusion\" standard that instructs web crawlers what they can and cannot do (and how frequently they should do things when they are allowed to). This is a well-known and well-defined standard, but it's not mandatory\u2026","rel":"","context":"In &quot;Information Security&quot;","block_context":{"text":"Information Security","link":"https:\/\/rud.is\/b\/category\/information-security\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":11670,"url":"https:\/\/rud.is\/b\/2018\/12\/02\/more-scraping-ethics-gone-awry-and-why-do-this-when-theres-a-free-api\/","url_meta":{"origin":6945,"position":4},"title":"More &#8220;Scraping Ethics Gone Awry&#8221; and &#8220;Why Do This When There&#8217;s a Free API?&#8221;","author":"hrbrmstr","date":"2018-12-02","format":false,"excerpt":"I can't seem to free my infrequently-viewed email inbox from \"you might like!\" notices by the content-lock-in site Medium. This one made it to the iOS notification screen (otherwise I'd've been blissfully unaware of it and would have saved you the trouble of reading this). Today, they sent me this\u2026","rel":"","context":"In &quot;web scraping&quot;","block_context":{"text":"web scraping","link":"https:\/\/rud.is\/b\/category\/web-scraping\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":6385,"url":"https:\/\/rud.is\/b\/2017\/09\/19\/pirating-web-content-responsibly-with-r\/","url_meta":{"origin":6945,"position":5},"title":"Pirating Web Content Responsibly With R","author":"hrbrmstr","date":"2017-09-19","format":false,"excerpt":"International Code Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back. There will be no\u2026","rel":"","context":"In &quot;data wrangling&quot;","block_context":{"text":"data wrangling","link":"https:\/\/rud.is\/b\/category\/data-wrangling\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom-2.png?fit=1200%2C917&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom-2.png?fit=1200%2C917&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom-2.png?fit=1200%2C917&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom-2.png?fit=1200%2C917&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/Plot_Zoom-2.png?fit=1200%2C917&ssl=1&resize=1050%2C600 3x"},"classes":[]}],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/6945","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/comments?post=6945"}],"version-history":[{"count":0,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/6945\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/media\/6954"}],"wp:attachment":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/media?parent=6945"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/categories?post=6945"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/tags?post=6945"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}