HTTP Headers Hashing (HHHash) is a technique developed by Alexandre Dulaunoy to generate a fingerprint of an HTTP server based on the headers it returns. It employs one-way hashing to generate a hash value from the list of header keys returned by the server. The HHHash value is calculated by concatenating the list of headers returned, ordered by sequence, with each header value separated by a colon. The SHA256 of this concatenated list is then taken to generate the HHHash value. HHHash incorporates a version identifier to enable updates to new hashing functions.
While effective, HHHash’s performance relies heavily on the characteristics of the HTTP requests, so correlations are typically only established using the same crawler parameters. Locality-sensitive hashing (LSH) could be used to calculate distances between sets of headers for more efficient comparisons. There are some limitations with some LSH algorithms (such as the need to pad content to a minimum byte length) that make the initial use of SHA256 hashes a bit more straightforward.
Alexandre made a Python library for it, and I cranked out an R package for it as well.
There are three functions exposed by {hhhash}:
build_hash_from_response
: Build a hash from headers in a curl
response objectbuild_hash_from_url
: Build a hash from headers retrieved from a URLhash_headers
: Build a hash from a vector of HTTP header keys
The build_hash_from_url
function relies on {curl} vs {httr} since {httr} uses curl::parse_headers()
which (rightfully so) lowercases the header keys. We need to preserve both order and case for the hash to be useful.
Here is some sample usage:
remotes::install_github("hrbrmstr/hhhash")
library(hhhash)
build_hash_from_url("https://www.circl.lu/")
## [1] "hhh:1:78f7ef0651bac1a5ea42ed9d22242ed8725f07815091032a34ab4e30d3c3cefc"
res <- curl::curl_fetch_memory("https://www.circl.lu/", curl::new_handle())
build_hash_from_response(res)
## [1] "hhh:1:78f7ef0651bac1a5ea42ed9d22242ed8725f07815091032a34ab4e30d3c3cefc"
c(
"Date", "Server", "Strict-Transport-Security",
"Last-Modified", "ETag", "Accept-Ranges",
"Content-Length", "Content-Security-Policy",
"X-Content-Type-Options", "X-Frame-Options",
"X-XSS-Protection", "Content-Type"
) -> keys
hash_headers(keys)
## [1] "hhh:1:78f7ef0651bac1a5ea42ed9d22242ed8725f07815091032a34ab4e30d3c3cefc"
One Comment
@hrbrmstr@rud.is Well, formatted code-blocks from WordPress (using the activitypub federation capabilities of WP) really don't work well outside of a visible web context.Also testing to re-make-sure these replies show up as comments on the blog.
3 Trackbacks/Pingbacks
[…] *** This is a Security Bloggers Network syndicated blog from rud.is authored by hrbrmstr. Read the original post at: https://rud.is/b/2023/07/09/new-r-package-hhhash/ […]
[…] HTTP Headers Hashing (HHHash) is a technique developed by Alexandre Dulaunoy to generate a fingerprint of an HTTP server based on the headers it returns. It employs one-way hashing to generate a hash value from the list of header keys returned by the server. The HHHash value is calculated by concatenating the list of headers… Continue reading β […]
[…] Bob Rudis creates an R package: […]