I, For One, Welcome Our Forthcoming New robots.txt Overlords

Despite my week-long Twitter consumption sabbatical (helped — in part — by the nigh week-long internet and power outage here in Maine), I still catch useful snippets from folks. My cow-orker @dabdine shunted a tweet by @terrencehart into a Slack channel this morning, and said tweet contained a link to this little gem. Said gem is the text of a very recent ruling from a District Court in Texas and deals with a favourite subject of mine: robots.txt.

The background of the case is that there were two parties who both ran websites for oil and gas professionals that include job postings. One party filed a lawsuit against the other asserting that the they hacked into their system and accessed and used various information in violation of the Computer Fraud and Abuse Act (CFAA), the Stored Wire and Electronic Communications and Transactional Records Access Act (SWECTRA), the Racketeer Influenced and Corrupt Organizations Act (RICO), the Texas Harmful Access by Computer Act (THACA), the Texas Theft Liability Act (TTLA), and the Texas Uniform Trade Secrets Act (TUTS). They also asserted common law claims of misappropriation of confidential information, conversion, trespass to chattels, fraud, breach of fiduciary duty, unfair competition, tortious interference with present and prospective business relationships, civil conspiracy, and aiding and abetting.

The other party filed a motion for dismissal on a number of grounds involving legalese on Terms & Conditions, a rebuttal of CFAA claims and really gnarly legalese around copyrights. There are more than a few paragraphs that make me glad none of my offspring have gone into or have a desire to go into the legal profession. One TLDR here is that T&Cs do, in fact, matter (though that is definitely dependent upon the legal climate where you live or have a case filed against you). We’re going to focus on the DMCA claim which leads us to the robots.txt part.

I shall also preface the rest with “IANAL”, but I don’t think a review of this case requires a law degree.

Command-Shift-R

To refresh memories (or create lasting new ones), robots.txt is a file that is placed at the top of a web site domain (i.e. https://rud.is/robots.txt) that contains robots exclusion standard rules. These rules tell bots (NOTE: if you write a scraper, you’ve written a scraping bot) what they can or cannot scrape and what — if any — delay should be placed between scraping efforts by said bot.

R has two CRAN packages for dealing with these files/rules: robotstxt by Peter Meissner and spiderbar by me. They are not competitors, but are designed much like Reese’s Peanut Butter cups — to go together (though Peter did some wicked good testing and noted a possible error in the underlying C++ library I use that could generate Type I or Type II in certain circumstances) and each has some specialization. I note them now because you don’t have an excuse not to check robots.txt given two CRAN packages being available. Python folks have (at a minimum) robotparser and reppy. Node, Go and other, modern languages all have at least one module/library/package available as well. No. Excuses.

Your Point?

(Y’all are always in a rush, eh?)

This October, 2017 Texas ruling references a 2007 ruling by a District Court in Pennsylvania. I dug in a bit through searchable Federal case law for mentions of robots.txt and there aren’t many U.S. cases that mention this control, though I am amused a small cadre of paralegals had to type robots.txt over-and-over again.

The dismissal request on the grounds that the CFAA did not apply was summarily rejected. Why? The defendant provided proof that they monitor for scraping activity that violates the robots.txt rules and that they use the Windows Firewall (ugh, they use Windows for web serving) to block offending IP addresses when they discover them.

Nuances came out further along in the dismissal text noting that user-interactive viewing of the member profiles on the site was well-within the T&Cs but that the defendant “never authorized [the use of] automated bots to download over 500,000 profiles” nor to have that data used for commercial purposes.

The kicker (for me, anyway) is the last paragraph of the document in the Conclusion where the defendant asserts that:

robots.txt is in fact a bona-fide technological measure to effectively control access to copyright materials
the “Internet industry” (I seriously dislike lawyers for wording just like that) has recognized robots.txt as a standard for controlling automated access to resources
robots.txt has been a valid enforcement mechanism since 1994

The good bit is: -“Whether it actually qualifies in this case will be determined definitively at summary judgment or by a jury.”_ To me, this sounds like a ruling by a jury/judge in favor of robots.txt could mean that it becomes much stronger case law for future misuse claims.

With that in mind:

Site owners: USE robots.txt, if — for no other reason — to aid legitimate researchers who want to make use of your data for valid scientific purposes, education or to create non-infringing content or analyses that will be a benefit to the public good. You can also use it to legally protect your content (but there are definitely nuances around how you do that).
Scrapers: Check and obey robots.txt rules. You have no technological excuse not to and not doing so really appears that it could come back to haunt you in the very near future.

FIN

I’ve setup an alert for when future rulings come out for this case and will toss up another post here or on the work-blog (so I can run it by our very-non-skeezy legal team) when it pops up again.

“Best Friends” image by Andy Kelly. Used with permission.

4 Comments

- MikeJackTzen
- Posted 2017-11-08 at 16:35
- Permalink
- Reply
if someone uses spiderbar and robotstxt

and specifically uses the function to find crawl delays

crawl_delays()

The doc says

-1 will be returned for any listed agent without a crawl delay setting

If we get a -1, does that give the greenlight to scrape without delay?
- - hrbrmstr
  - Posted 2017-11-09 at 10:41
  - Permalink
  - Reply
  My view on that (there’s not “hard+fast rule” from the standard) is to use the research I’ve done (https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/) and choose a courtesy setting of 5s. However, that’s just a personal guideline. -1 does mean that the site failed to or deliberately did not set a crawl-delay and you get to pick.
  - - MikeJackTzen
    - Posted 2017-11-09 at 12:43
    - Permalink
    - Reply
    Gotcha, thanks for your advice. I’ve personally set a delay of more than 5 seconds as a default
- tembok api
- Posted 2026-05-23 at 01:40
- Permalink
- Reply
Thanks for sharing this case. Regarding the default 5-second courtesy delay you mentioned in the comments—do you think the developer community should collectively push for a standardized ‘ethical scraping’ framework, or will it always come down to individual site firewalls and legal battles to enforce compliance?

3 Trackbacks/Pingbacks

By I, For One, Welcome Our Forthcoming New robots.txt Overlords – Cloud Data Architect on 03 Nov 2017 at 3:00 pm

[…] leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data […]
By I, For One, Welcome Our Forthcoming New robots.txt Overlords – Mubashir Qasim on 03 Nov 2017 at 6:18 pm

[…] article was first published on R – rud.is, and kindly contributed to […]
By I, For One, Welcome Our Forthcoming New robots.txt Overlords | A bunch of data on 04 Nov 2017 at 3:41 pm

[…] article was first published on R – rud.is, and kindly contributed to […]

rud.is