Modern websites are complex beasts. They house photo galleries, interactive visualizations, web fonts, analytics code and other diverse types of content. Despite the potential for diversity, many web sites share similar “tech stacks” — the components that come together to make them what they are. These stacks consist of web servers (often with special capabilities), cache managers and a wide array of front-end web components. Unless a site goes to great lengths to cloak what they are using, most of these stack components leave a fingerprint — bits and pieces that we can piece together to identify them.
Wappalyzer is one tool that we can use to take these fingerprints and match them against a database of known components. If you’re not familiar with that service, go there now and enter in the URL of your own blog or company and come back to see why I’ve mentioned it.
Back? Good.
If you explored that site a bit, you likely saw a link to their GitHub repo. There you’ll find JavaScript source code and even some browser plugins along with the core “fingerprint database”, apps.json
?. If you poke around on their repo, you’ll find the repo’s wiki and eventually see that folks have made various unofficial ports to other languages.
By now, you’ve likely guessed that this blog post is introducing an R port of “wappalyzer” and if you have made such a guess, you’re correct! The rappalyzer
? package is a “work-in-progress” but it’s usable enough to get some feedback on the package API, solicit contributors and show some things you can do with it.
Just rappalyze something already!
For the moment, one real function is exposed: rappalyze()
. Feed it a URL string or an httr
response
object and — if the analysis was fruitful — you’ll get back a data frame with all the identified tech stack components. Let’s see what jquery.com
? is made of:
devtools::install_github("hrbrmstr/rappalyzer")
library(rappalyzer)
rappalyze("https://jquery.com/")
## # A tibble: 8 x 5
## tech category match_type version confidence
## <chr> <chr> <chr> <chr> <dbl>
## 1 CloudFlare CDN headers <NA> 100
## 2 Debian Operating Systems headers <NA> 100
## 3 Modernizr JavaScript Frameworks script <NA> 100
## 4 Nginx Web Servers headers <NA> 100
## 5 PHP Programming Languages headers 5.4.45 100
## 6 WordPress CMS meta 4.5.2 100
## 7 WordPress Blogs meta 4.5.2 100
## 8 jQuery JavaScript Frameworks script 1.11.3 100
If you played around on the Wappalyzer site you saw that it “spiders” many URLs to try to get a very complete picture of components that a given site may use. I just used the main jQuery site in the above example and we managed to get quite a bit of information from just that interaction.
Wappalyzer (and, hence rappalyzer
) works by comparing sets of regular expressions against different parts of a site’s content. For now, rappalyzer
checks:
- the site URL (the one after a site’s been contacted & content retrieved)
- HTTP headers (an invisible set of metadata browsers/crawlers & servers share)
- HTML content
- scripts
- meta tags
Wappalyzer-proper runs in a browser-context, so it can pick up DOM elements and JavaScript environment variables which can provide some extra information not present in the static content.
As you can see from the returned data frame, each tech component has one or more associated categories as well as a version number (if it could be guessed) and a value indicating how confident the guess was. It will also show where the guess came from (headers, scripts, etc).
A peek under the hood
I started making a pure R port of Wappalyzer but there are some frustrating “gotchas” with the JavaScript regular expressions that would have meant spending time identifying all the edge cases. Since the main Wappalyzer source is in JavaScript it was trivial to down-port it to a V8
?-compatible version and expose it via an R function. A longer-term plan is to deal with the regular expression conversion, but I kinda needed the functionality sooner than later (well, wanted it in R, anyway).
On the other hand, keeping it in JavaScript has some advantages, especially with the advent of chimera
? (a phantomjs alternative that can put a full headless browser into a V8 engine context). Using that would mean getting the V8
package ported to more modern version of the V8 source code which isn’t trivial, but doable. Using chimera
would make it possible to identify even more tech stack components.
Note also that these tech stack analyses can be dreadfully slow. That’s not due to using V8. The ones that slow are slow in all the language ports. This is due to the way the regular expressions are checked against the site content and for just using regular expressions to begin with. I’ve got some ideas to speed things up and may also introduce some “guide rails” to prevent lengthy operations and avoiding some checks if page characteristics meet certain criteria. Drop an issue if you have ideas as well.
Why is this useful?
Well, you can interactively see what a site uses like we did above. But, we can also look for commonalities across a number of sites. We’ll do this in a small example now. You can skip the expository and just work with the RStudio project if that’s how you’d rather roll.
Crawling 1,000 orgs
I have a recent-ish copy of a list of Fortune 1000 companies and their industry sectors along with their website URLs. Let’s see how many of those sites we can still access and what we identify.
I’m going to pause for just a minute to revisit some “rules” in the context of this operation.
Specifically, I’m not:
- reading each site’s terms of service/terms & conditions
- checking each site’s
robots.txt
file - using a uniquely identifying non-browser string crawler user-agent (since we need the site to present content like it would to a browser)
However, I’m also not:
- using the site content in any way except to identify tech stack components (something dozens of organizations do)
- scraping past the index page HTML content, so there’s no measurable resource usage
I believe I’ve threaded the ethical needle properly (and, I do this for a living), but if you’re uncomfortable with this you should use a different target URL list.
Back to the code. We’ll need some packages:
library(hrbrthemes)
library(tidyverse)
library(curl)
library(httr)
library(rvest)
library(stringi)
library(urltools)
library(rappalyzer) # devtools::install_github("hrbrmstr/rappalyzer")
library(rprojroot)
# I'm also using a user agent shortcut from the splashr package which is on CRAN
Now, I’ve included data in the RStudio project GH repo since it can be used later on to test out changes to rappalyzer
. That’s the reason for the if
/file.exists
tests. As I’ve said before, be kind to sites you scrape and cache content whenever possible to avoid consuming resources that you don’t own.
Let’s take a look at our crawler code:
rt <- find_rstudio_root_file()
if (!file.exists(file.path(rt, "data", "f1k_gets.rds"))) {
f1k <- read_csv(file.path(rt, "data", "f1k.csv"))
targets <- pull(f1k, website)
results <- list()
errors <- list()
OK <- function(res) {
cat(".", sep="")
results <<- c(results, list(res))
}
BAD <- function(err_msg) {
cat("X", sep="")
errors <<- c(errors, list(err_msg))
}
pool <- multi_set(total_con = 20)
walk(targets, ~{
multi_add(
new_handle(url = .x, useragent = splashr::ua_macos_chrome, followlocation = TRUE, timeout = 60),
OK, BAD
)
})
multi_run(pool = pool)
write_rds(results, file.path(rt, "data", "f1k_gets.rds"), compress = "xz")
} else {
results <- read_rds(file.path(rt, "data", "f1k_gets.rds"))
}
If you were expecting read_html()
and/or GET()
calls you’re likely surprised at what you see. We’re trying to pull content from a thousand web sites, some of which may not be there anymore (for good or just temporarily). Sequential calls would need many minutes with error handling. We’re using the asynchronous/parallel capabilities in the curl
? package to setup all 1,000 requests with the ability to capture the responses. There’s a hack-ish “progress” bar that uses .
for “good” requests and X
for ones that didn’t work. It still takes a little time (and you may need to tweak the total_con
value on older systems) but way less than sequential GET
s would have, and any errors are implicitly handled (ignored).
The next block does the tech stack identification:
if (!file.exists(file.path(rt, "data", "rapp_results.rds"))) {
results <- keep(results, ~.x$status_code < 300)
map(results, ~{
list(
url = .x$url,
status_code = .x$status_code,
content = .x$content,
headers = httr:::parse_headers(.x$headers)
) -> res
class(res) <- "response"
res
}) -> results
# this takes a *while*
pb <- progress_estimated(length(results))
map_df(results, ~{
pb$tick()$print()
rap_df <- rappalyze(.x)
if (nrow(rap_df) > 0) rap_df <- mutate(rap_df, url = .x$url)
}) -> rapp_results
write_rds(rapp_results, file.path(rt, "data", "rapp_results.rds"))
} else {
rapp_results <- read_rds(file.path(rt, "data", "rapp_results.rds"))
}
We filter out non-“OK” (HTTP 200-ish response) curl
results and turn them into just enough of an httr
request
object to be able to work with them.
Then we let rappalyzer
work. I had time to kill so I let it run sequentially. In a production or time-critical context, I’d use the future
? package.
We’re almost down to data we can play with! Let’s join it back up with the original metadata:
left_join(
mutate(rapp_results, host = domain(rapp_results$url)) %>%
bind_cols(suffix_extract(.$host))
,
mutate(f1k, host = domain(website)) %>%
bind_cols(suffix_extract(.$host)),
by = c("domain", "suffix")
) %>%
filter(!is.na(name)) -> rapp_results
length(unique(rapp_results$name))
## [1] 754
Note that some orgs seem to have changed hands/domains so we’re left with ~750 sites to play with (we could have tried to match more, but this is just an exercise). If you do the same curl
-fetching exercise on your own, you may get fewer or more total sites. Internet crawling is fraught with peril and it’s not really possible to conveniently convey 100% production code in a blog post.
Comparing “Fortune 1000” org web stacks
Let’s see how many categories we picked up:
xdf <- distinct(rapp_results, name, sector, category, tech)
sort(unique(xdf$category))
## [1] "Advertising Networks" "Analytics" "Blogs"
## [4] "Cache Tools" "Captchas" "CMS"
## [7] "Ecommerce" "Editors" "Font Scripts"
## [10] "JavaScript Frameworks" "JavaScript Graphics" "Maps"
## [13] "Marketing Automation" "Miscellaneous" "Mobile Frameworks"
## [16] "Programming Languages" "Tag Managers" "Video Players"
## [19] "Web Frameworks" "Widgets"
Plenty to play with!
Now, what do you think the most common tech component is across all these diverse sites?
count(xdf, tech, sort=TRUE)
## # A tibble: 115 x 2
## tech n
## <chr> <int>
## 1 jQuery 572
## 2 Google Tag Manager 220
## 3 Modernizr 197
## 4 Microsoft ASP.NET 193
## 5 Google Font API 175
## 6 Twitter Bootstrap 162
## 7 WordPress 150
## 8 jQuery UI 143
## 9 Font Awesome 118
## 10 Adobe Experience Manager 69
## # ... with 105 more rows
I had suspected it was jQuery
(and hinted at that when I used it as the initial rappalyzer
example). In a future blog post we’ll look at just how vulnerable orgs are based on which CDNs they use (many use similar jQuery and other resource CDNs, and use them insecurely).
Let’s see how broadly each tech stack category is used across the sectors:
group_by(xdf, sector) %>%
count(category) %>%
ungroup() %>%
arrange(category) %>%
mutate(category = factor(category, levels=rev(unique(category)))) %>%
ggplot(aes(category, n)) +
geom_boxplot() +
scale_y_comma() +
coord_flip() +
labs(x=NULL, y="Tech/Services Detected across Sectors",
title="Usage of Tech Stack Categories by Sector") +
theme_ipsum_rc(grid="X")
That’s alot of JavaScript. But, there’s also large-ish use of CMS, Web Frameworks and other components.
I wonder what the CMS usage looks like:
filter(xdf, category=="CMS") %>%
count(tech, sort=TRUE)
## # A tibble: 15 x 2
## tech n
## <chr> <int>
## 1 WordPress 75
## 2 Adobe Experience Manager 69
## 3 Drupal 68
## 4 Sitefinity 26
## 5 Microsoft SharePoint 19
## 6 Sitecore 14
## 7 DNN 11
## 8 Concrete5 6
## 9 IBM WebSphere Portal 4
## 10 Business Catalyst 2
## 11 Hippo 2
## 12 TYPO3 CMS 2
## 13 Contentful 1
## 14 Orchard CMS 1
## 15 SilverStripe 1
Drupal and WordPress weren’t shocking to me (again, I do this for a living) but they may be to others when you consider this is the Fortune 1000 we’re talking about.
What about Web Frameworks?
filter(xdf, category=="Web Frameworks") %>%
count(tech, sort=TRUE)
## # A tibble: 7 x 2
## tech n
## <chr> <int>
## 1 Microsoft ASP.NET 193
## 2 Twitter Bootstrap 162
## 3 ZURB Foundation 61
## 4 Ruby on Rails 2
## 5 Adobe ColdFusion 1
## 6 Kendo UI 1
## 7 Woltlab Community Framework 1
Yikes! Quite a bit of ASP.NET. But, many enterprises haven’t migrated to more modern, useful platforms (yet).
FIN
As noted earlier, the data & code are on GitHub. There are many questions you can ask and answer with this data set and definitely make sure to share any findings you discover!
We’ll continue exploring different aspects of what you can glean from looking at web sites in a different way in future posts.
I’ll leave you with one interactive visualization that lets you explore the tech stacks by sector.
devtools::install_github("jeromefroe/circlepackeR")
library(circlepackeR)
library(data.tree)
library(treemap)
cpdf <- count(xdf, sector, tech)
cpdf$pathString <- paste("rapp", cpdf$sector, cpdf$tech, sep = "/")
stacks <- as.Node(cpdf)
circlepackeR(stacks, size = "n", color_min = "hsl(56,80%,80%)",
color_max = "hsl(341,30%,40%)", width = "100%", height="800px")