I’ve blogged a bit about
robots.txt — the rules file that documents a sites “robots exclusion” standard that instructs web crawlers what they can and cannot do (and how frequently they should do things when they are allowed to). This is a well-known and well-defined standard, but it’s not mandatory and often ignored by crawlers and content owners alike.
There’s an emerging IETF draft for a different type of site metadata that content owners should absolutely consider adopting. This one defines “web security policies” for a given site and has much in common with robots exclusion standard, including the name (
security.txt) and format (policy directives are defined with simple syntax — see Chapter 5 of the Debian Policy Manual).
One core difference is that this file is intended for humans. If you are are a general user and visit a site and notice something “off” (security-wise) or if you are an honest, honorable security researcher who found a vulnerability or weakness on a site, this
security.txt file should make it easier to contact the appropriate folks at the site to help them identify and resolve security issues. The IETF abstract summarizes the intent well:
A big change from
robots.txt is where the
security.txt file goes. The IETF standard is still in draft state so the location may change, but the current thinking is to have it go into
/.well-known/security.txt vs being placed in the top level root (i.e. it’s not supposed to be in
/security.txt). If you aren’t familiar with the
.well-known directory, give RFC 5785 a read.
You can visit the general information site to find out more and install a development version of a that will make it easier for pull up this info in your browser if you find an issue.
security.txt for my site:
Contact: email@example.com Encryption: https://keybase.io/hrbrmstr/pgp_keys.asc?fingerprint=e5388172b81c210906f5e5605879179645de9399 Disclosure: Full
With that info, you know where to contact me, have the ability to encrypt your message and know that I’ll give you credit and will disclose the bugs openly.
So, Why the [R] tag?
Ah, yes. This post is in the
R RSS category feed for a reason. I do at-scale analysis of the web for a living and will be tracking the adoption of
security.txt across the internet (initially with the Umbrella Top 1m and a choice list of sites with more categorical data associated with them) over time. My esteemed colleague @jhartftw is handling the crawling part, but I needed a way to speedily read in these files for a broader analysis. So, I made an R package:
It’s pretty easy to use. Here’s how to install it and use one of the functions to generate a
security.txt target URL for a site:
devtools::install_github("hrbrmstr/securitytxt") library(securitytxt) (xurl <- sectxt_url("https://rud.is/b")) ##  "https://rud.is/.well-known/security.txt"
This is how you read in and parse a
(x <- sectxt(url(xurl))) ## <Web Security Policies Object> ## Contact: firstname.lastname@example.org ## Encryption: https://keybase.io/hrbrmstr/pgp_keys.asc?fingerprint=e5388172b81c210906f5e5605879179645de9399 ## Disclosure: Full
And, this is how you turn that into a usable data frame:
sectxt_info(x) ## key value ## 1 contact email@example.com ## 2 encryption https://keybase.io/hrbrmstr/pgp_keys.asc?fingerprint=e5388172b81c210906f5e5605879179645de9399 ## 3 disclosure Full
There’s also a function to validate that the keys are within the current IETF standard. That will become more useful once the standard moves out of draft status.
So, definitely adopt the standard and feel invited to kick the tyres on the package. Don’t hesitate to jump on board if you have ideas for how you’d like to extend the package, and drop a note in the comments if you have questions on it or on adopting the standard for your site.