Many R package authors (including myself) lump a collection of small, useful functions into some type of utils.R
file and usually do not export the functions since they are (generally) designed to work on package internals rather than expose their functionality via the exported package API. Just like Batman’s utility belt, which can be customized for any mission, any set of utilities in a given R package will also likely be different from those in other packages.
I thought it would be neat to take a look at:
- just how many packages have one or more
util*.R
files and what the most common file names are for them;
- utility function naming preferences — specifically snake-case, camel-case or dot-case
- what the most common “utility” functions names are across the packages
- coding style — specifically compare ratios of white space, full-line comments to code size
for all the published packages on CRAN.
There are many more questions one can ask and then use this corpus to answer, so we’ll close out the post with a link to it so any intrepid readers can do just that, especially since reproducing the first bit of this post would require a local CRAN mirror (which most folks — rightly so — do not have handy).
Acquiring and Transforming the Data We Need
Since I have local CRAN mirror, it’s just a matter of iterating through all the package tar.gz
files in src/contrib
and grep
ping through a tar
listing of each for a pattern like "R/util.*$
. That pattern isn’t perfect but it’s quick and we’ll be able to filter out any files it catches that don’t belong. I chose to use a small bash
script for this but it’s possible to do this with R as well (an exercise left to the reader). The resultant data file looks a bit like the output from an ls -l
(linux-ish) directory listing:
-rw-r--r-- 0 hornik users 1658 Jun 5 2016 AHR/R/util.R
-rw-r--r-- 0 ligges users 12609 Dec 13 2016 ALA4R/R/utilities_internal.R
-rw-r--r-- 0 hornik users 0 Feb 24 2017 AWR.Kinesis/R/utils.R
-rw-r--r-- 0 ligges users 4127 Aug 30 2017 AlphaVantageClient/R/utils.R
-rw-r--r-- 0 ligges users 121 Jan 19 2017 AmyloGram/R/utils.R
-rw-r--r-- 0 ligges users 2873 Jan 17 23:04 DT/R/utils.R
-rw-r--r-- 0 ligges users 3055 Jan 17 2017 cleanr/inst/source/R/utils.R
drwxr-xr-x 0 ligges users 0 Sep 24 2017 JGR/java/org/rosuda/JGR/util/
I made sure to show a few examples of where a better search pattern would have helped ensure lines like the three at the bottom of that listings aren’t included. But, we all often have to deal with imperfect data, so we’ll make sure to deal with that during the ingestion & cleanup process.
library(stringi)
library(hrbrthemes)
library(archive) # devtools::install_github("jimhester", "archive")
library(tidyverse)
# I ran readr::type_convert() once and it returns this column type spec. By using it
# for subsequent conversions, we'll gain reproducibility and data format change
# detection capabilities "for free"
cols(
permsissions = col_character(),
links = col_integer(),
owner = col_character(),
group = col_character(),
size = col_integer(),
month = col_character(),
day = col_integer(),
year_hr = col_character(),
path = col_character()
) -> tar_cols
# Now, we parse the tar verbose ('ls -l') listing
stri_read_lines("~/Data/pkutils.txt") %>% # stringi was loaded so might as well use it
stri_split_regex(" +", 9, simplify = TRUE) %>% # split input into 9 columns
as_data_frame() %>% # ^^ returns a matrix but data frames are more useful for our work
set_names(names(tar_cols$cols)) %>% # column names are useful and we can use our colspec for it
type_convert(col_types = tar_cols) %>% # see comment block before cols()
mutate(day = sprintf("%02d", day)) %>% # now we'll work on getting the date pieces to be a Date
mutate(year_hr = case_when( # the year_hr field can be either %Y or %H:%M depending on file 'recency'
stri_detect_fixed(year_hr, ":") &
(month %in% c("Jan", "Feb", "Mar", "Apr")) ~ "2018", # if %H:%M but 'starter' months it's 2018
stri_detect_fixed(year_hr, ":") &
(month %in% c("Dec", "Nov", "Oct", "Sep", "Aug", "Jul", "Jun")) ~ "2017", # %H:%M & 'end' months
TRUE ~ year_hr # already in %Y format
)) %>%
mutate(date= lubridate::mdy(sprintf("%s %s, %s", month, day, year_hr))) %>% # get a Date
mutate(pkg = stri_match_first_regex(path, "^(.*)/R/")[,2]) %>% # extract package name (stri_extract is also usable here)
mutate(fil = basename(path)) %>% # extrafct just the file name
filter(!is.na(pkg)) %>% # handle one type of wrongly included file
filter(!stri_detect_fixed(pkg, "/")) %>% # ande another
filter(!is.na(path)) -> xdf # and another; but we're done so we close with an assignment
glimpse(xdf)
## Observations: 1,746
## Variables: 12
## $ permsissions <chr> "-rw-r--r--", "-rw-r--r--", "-rw-r--r--", "-rw-r-...
## $ links <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ owner <chr> "hornik", "ligges", "hornik", "ligges", "ligges",...
## $ group <chr> "users", "users", "users", "users", "users", "her...
## $ size <int> 1658, 12609, 0, 4127, 121, 52, 36977, 34198, 3676...
## $ month <chr> "Jun", "Dec", "Feb", "Aug", "Jan", "Aug", "Jan", ...
## $ day <chr> "05", "13", "24", "30", "19", "10", "06", "10", "...
## $ year_hr <chr> "2016", "2016", "2017", "2017", "2017", "2017", "...
## $ path <chr> "AHR/R/util.R", "ALA4R/R/utilities_internal.R", "...
## $ date <date> 2016-06-05, 2016-12-13, 2017-02-24, 2017-08-30, ...
## $ pkg <chr> "AHR", "ALA4R", "AWR.Kinesis", "AlphaVantageClien...
## $ fil <chr> "util.R", "utilities_internal.R", "utils.R", "uti...
To the analysis!
Finding the Utility of ‘util’s
A careful look at the glimpse()
listing shows we have 1,745 files that begin with util
, but how many packages have at least one util
files?
nrow(distinct(xdf, pkg))
## [1] 1397
That’s roughly 10% of CRAN, but doesn’t mean other packages do not have “utility belt” functions since other authors may have just been more creative or deliberate with their file naming conventions.
Readers with keen eyes may have noticed we spent some deliberate CPU cycles to get a Date
column. Part of that was to show how to do that (mostly as an example for folks new to R) but we also did it to ask temporal questions, such as “Are package ‘utility belts’ a “new” thing?”. The data suggests that utility belts are products/attributes of more recently published or updated packages:
distinct(xdf, pkg, date) %>%
mutate(yr = as.integer(lubridate::year(date))) %>%
count(yr) %>%
complete(yr, fill=list(n=0)) %>%
ggplot(aes(yr, n)) +
geom_col(fill="lightslategray", width=0.65) +
labs(
x = NULL, y = "Package count",
title = "Recently published or updated packages tend to have more 'util'\nthan older/less actively-maintained ones",
subtitle = "Count of packages (by year) with 'util's"
) +
theme_ipsum_rc(grid="Y")
We could answer this more completely by going through the CRAN archives for all these packages, but for now we’ll just see which packages might have helped set this trend going:
distinct(xdf, pkg, date) %>%
arrange(date) %>%
print(n=20)
## # A tibble: 1,540 x 2
## date pkg
## 1 1980-01-01 bsts
## 2 2006-06-28 evdbayes
## 3 2006-11-29 hexView
## 4 2006-12-17 StatDataML
## 5 2007-10-05 tpr
## 6 2007-11-07 seqinr
## 7 2007-11-26 registry
## 8 2008-07-25 ramps
## 9 2008-10-23 RobAStBase
## 10 2009-02-23 vcd
## 11 2009-06-26 ttutils
## 12 2009-07-03 histogram
## 13 2009-11-27 polynom
## 14 2009-11-27 tau
## 15 2010-01-05 itertools
## 16 2010-01-22 tableplot
## 17 2010-06-09 rbugs
## 18 2011-03-17 playwith
## 19 2011-05-11 marelac
## 20 2011-10-11 timeSeries
## # ... with 1,520 more rows
Going back to our corpus, what are the most common names for these utility belt files?
## count(xdf, fil, sort=TRUE) %>%
## mutate(pct = scales::percent(n/sum(n))) %>%
## print(n=20)
## # A tibble: 409 x 3
## fil n pct
## 1 utils.R 865 49.5%
## 2 utilities.R 145 8.3%
## 3 util.R 134 7.7%
## 4 utils.r 68 3.9%
## 5 utility.R 47 2.7%
## 6 utility_functions.R 25 1.4%
## 7 util.r 16 0.9%
## 8 utilities.r 14 0.8%
## 9 utils-pipe.R 9 0.5%
## 10 utilityFunctions.R 6 0.3%
## 11 utils-format.r 3 0.2%
## 12 util_functions.R 2 0.1%
## 13 util_rescale.R 2 0.1%
## 14 util-aux.R 2 0.1%
## 15 util-checkparam.R 2 0.1%
## 16 util-startarg.R 2 0.1%
## 17 utilcmst.R 2 0.1%
## 18 utilhot.R 2 0.1%
## 19 utilities_internal.R 2 0.1%
## 20 utility-functions.R 2 0.1%
## # ... with 389 more rows
Over 50% of other CRAN packages are as “un-creative” as I am when it comes to naming these files.
Let’s see how packed these belts are:
ggplot(xdf, aes(x="", size)) +
ggbeeswarm::geom_quasirandom(
fill="lightslategray", color="white",
alpha=1/2, stroke=0.25, size=3, shape=21
) +
geom_boxplot(fill="#00000000", outlier.colour = "#00000000") +
geom_text(
data=data_frame(), aes(x=-Inf, y=median(xdf$size), label="Median:\n2,717"),
hjust = 0, family = font_rc, size = 3, color = "lightslateblue"
) +
scale_y_comma(
name = "File size", trans="log10", limits=c(NA, 200000),
breaks = c(10, 100, 1000, 10000, 100000)
) +
labs(
x = NULL,
title = "Most 'util' files are between 1K and 10K in size",
caption = "Note y-axis log10 scale"
) +
theme_ipsum_rc(grid="Y")
We’ll need to do a bit more data collection to answer the last two questions.
Focus on Functions
To examine function names and source code statistics, we’ll need to read in the contents of each file and parse them. Let’s do that first bit with some help from the archive
package which will help us open up these compressed tar
files and pull out the file(s) we need from them vs have to code this up more manually.
Again, this code is only reproducible if you have CRAN handy, but soon (promise!) you’ll have a file you can work with for the remainder of the post:
extract_source <- function(pkg, fil, .pb = NULL) {
if (!is.null(.pb)) .pb$tick()$print()
list.files(
path = "/cran/src/contrib", # my path to local CRAN
pattern = sprintf("^%s_.*gz", pkg), # rough pattern for the package archive filename
recursive = FALSE,
full.names = TRUE
) -> tgt
con <- archive_read(tgt[1], fil)
src <- readLines(con, warn = FALSE)
close(con)
paste0(src, collapse="\n")
}
pb <- progress_estimated(nrow(xdf))
xdf <- mutate(xdf, file_src = map2_chr(pkg, path, extract_source, .pb=pb))
That (on-drive) ~10MB data frame is in https://rud.is/dl/utility-belt.rds. The rest of the post builds off of it so you can start coding along at home now.
Let’s extract the function names:
# we'll use these two functions to help test whether bits
# of our parsed code are, indeed, functions.
#
# Alternately: "I heard you liked functions so I made
# functions to help you find functions"
#
# we could have used `rlang` helpers here, but I had these
# handy from pre-`rlang` days.
is_assign <- function(x) {
as.character(x) %in% c('<-', '=', '<<-', 'assign')
}
is_func <- function(x) {
is.call(x) &&
is_assign(x[[1]]) &&
is.call(x[[3]]) &&
(x[[3]][[1]] == quote(`function`))
}
read_rds("~/Data/utility-belt.rds") %>% # I have this file in ~/Data; change this for your location
mutate(parsed = map(file_src, ~parse(text = .x, keep.source = TRUE))) %>% # parse each file
mutate(func_names = map(parsed, ~{ # go through parsed file
keep(.x, is_func) %>% # and only keep functions
map(~as.character(.x[[2]])) %>% # extract the function name
flatten_chr() # return a character vector
})) -> xdf
With those handy, we can see if there are any commonalities across all these packages:
select(xdf, pkg, fil, func_names) %>%
unnest() %>%
count(func_names, sort=TRUE) %>%
print(n=20)
## func_names n
## 1 %||% 84
## 2 compact 19
## 3 isFALSE 19
## 4 assertthat::on_failure 16
## 5 is_windows 16
## 6 trim 14
## 7 .on Load 13 # (IRL there's no space here but the WP input sanitizer hates this word due to js abuse
## 8 names2 12
## 9 dots 11
## 10 is_string 11
## 11 vlapply 11
## 12 .onAttach 10
## 13 error.bars 10
## 14 normalize 10
## 15 vcapply 10
## 16 cat0 9
## 17 collapse 9
## 18 err 9
## 19 getmin 9
## 20 is_dir 9
## # ... with 1.252e+04 more rows
We can also see if there are common case conventions:
select(xdf, pkg, fil, func_names) %>%
unnest() %>%
mutate(is_camel = (!stri_detect_fixed(func_names, "_")) &
(!stri_detect_regex(func_names, "[[:alpha:]]\\.[[:alpha:]]")) &
(stri_detect_regex(func_names, "[A-Z]"))) %>%
mutate(is_dotcase = stri_detect_regex(func_names, "[[:alpha:]]\\.[[:alpha:]]")) %>%
mutate(is_snake = stri_detect_fixed(func_names, "_") &
(!stri_detect_regex(func_names, "[[:alpha:]]\\.[[:alpha:]]"))) -> case_hunt
count(case_hunt, is_camel, is_dotcase, is_snake) %>%
mutate(pct = scales::percent(n/sum(n))) %>%
mutate(description = c(
"one-'word' names",
"snake_case",
"dot.case",
"camelCase"
)) %>%
arrange(n) %>%
mutate(description = factor(description, description)) %>%
ggplot(aes(description, n)) +
geom_col(fill="lightslategray", width=0.65) +
geom_label(aes(y = n, label=pct), label.size=0, family=font_rc, nudge_y=150) +
scale_y_comma("Number of functions") +
labs(
x=NULL,
title = "dot.case does not seem to be en-vogue for utility belt functions"
) +
theme_ipsum_rc(grid="Y")
I had a hunch that isX…()
/is_x…()
could be likely names for utility belt functions, so let’s normalize the function names to snake_case and see if that’s true:
select(xdf, pkg, fil, func_names) %>%
unnest() %>%
filter(stri_detect_regex(func_names, "^(\\.is|is)")) %>%
mutate(func_names = snakecase::to_snake_case(func_names)) %>%
count(func_names, sort=TRUE)
## # A tibble: 547 x 2
## func_names n
## 1 is_false 24
## 2 is_windows 19
## 3 is_string 18
## 4 is_empty 13
## 5 is_dir 11
## 6 is_formula 11
## 7 is_installed 11
## 8 is_linux 9
## 9 is_na 9
## 10 is_error 8
## # ... with 537 more rows
Only 5% (819) out of 14,123 extracted function names are is_
; not overwhelming, but a respectable slice.
There are more questions we could ask of function names and styles, but we’ll leave some work for y’all to do on your own.
Let’s head over to the final rooftop exercise.
Code, Comment & Blank Line Density
Since we have the raw source, we can also take a look at coding style. There are many questions we could ask here and more than a few packages we could draw on to help answer them. For now, we’ll just take a look at the mean ratios of comments and blank lines to code across the packages in this utility belt corpus and give you the opportunity to tease out other interesting tidbits such as “what base R and other package functions are most often used in utility belt functions?” or “are package authors using evil =
for assignment or proper <-
?”.
xdf %>%
mutate(
num_lines = stri_count_fixed(xdf$file_src, "\n"),
num_blank_lines = stri_count_regex(xdf$file_src, "^[[:space:]]*$", opts_regex = stri_opts_regex(multiline=TRUE)),
num_whole_line_comments = lengths(cmnt_df$comments),
comment_density = num_whole_line_comments / (num_lines - num_blank_lines - num_whole_line_comments),
blank_density = num_blank_lines / (num_lines - num_whole_line_comments)
) %>%
select(-permsissions, -links, -owner, -group, month, -day, -year_hr) -> xdf
# now compute mean ratios
group_by(xdf, pkg) %>%
summarise(
`Comment-to-code Ratio` = mean(comment_density),
`Blank lines-to-code Ratio` = mean(blank_density)
) %>%
ungroup() %>%
filter(!is.infinite(`Comment-to-code Ratio`)) %>%
filter(!is.nan(`Comment-to-code Ratio`)) %>%
filter(!is.infinite(`Blank lines-to-code Ratio`)) %>%
filter(!is.nan(`Blank lines-to-code Ratio`)) %>%
gather(measure, value, -pkg) -> code_ratios
# we want to label the median values
group_by(code_ratios, measure) %>%
summarise(median = median(value)) -> code_ratio_meds
ggplot(code_ratios, aes(measure, value, group=measure)) +
ggbeeswarm::geom_quasirandom(
fill="lightslategray", color="#2b2b2b", alpha=1/2,
stroke=0.25, size=3, shape=21
) +
geom_boxplot(fill="#00000000", outlier.colour = "#00000000") +
geom_label(
data = code_ratio_meds,
aes(-Inf, c(0.3, 5), label=sprintf("Median:\n%s", round(median, 2)), group=measure),
family = font_rc, size=3, color="lightslateblue", hjust = 0, label.size=0
) +
scale_y_continuous() +
labs(
x = NULL, y = NULL,
caption = "Note free y scale"
) +
facet_wrap(~measure, scales="free") +
theme_ipsum_rc(grid="Y", strip_text_face = "bold") +
theme(axis.text.x=element_blank())
FIN
You can find the (on-drive) ~10MB data frame is at: https://rud.is/dl/utility-belt.rds.
All the above code in this gist: https://gist.github.com/hrbrmstr/33d29bb39eaa7f2f1e95308038f85b59.
If you do your own ‘utility belt’ analyses, drop a note in the comments with a link to your findings!
Does Congress Really Care About Your Privacy?
I apologize up-front for using bad words in this post.
Said bad words include “Facebook”, “Mark Zuckerberg” and many referrals to entities within the U.S. Government. Given the topic, it cannot be helped.
I’ve also left the
R
tag on this despite only showing some ggplot2 plots and Markdown tables. See the end of the post for how to get access to the code & data.R
was used solely and extensively for the work behind the words.This week Congress put on a show as they summoned the current Facebook CEO — Mark Zuckerberg — down to Washington, D.C. to demonstrate how little most of them know about how the modern internet and social networks actually work plus chest-thump to prove to their constituents they really and truly care about you.
These Congress-critters offered such proof in the guise of railing against Facebook for how they’ve handled your data. Note that I should really say our data since they do have an extensive profile database on me and most everyone else even if they’re not Facebook platform users (full disclosure: I do not have a Facebook account).
Ostensibly, this data-mishandling impacted your privacy. Most of the committee members wanted any constituent viewers to come away believing they and their fellow Congress-critters truly care about your privacy.
Fortunately, we have a few ways to measure this “caring” and the remainder of this post will explore how much members of the U.S. House and Senate care about your privacy when you visit their official
.gov
web sites. Future posts may explore campaign web sites and other metrics, but what better place to show they care about you then right there in their digital houses.Privacy Primer
When you visit a web site with any browser, the main URL pulls in resources to aid in the composition and functionality of the page. These could be:
png
,jpg
,gif
, “svg”, etc),(plus some others)
When you go to, say,
www.example.com
the site does not have to load all the resources fromexample.com
domains. In fact, it’s rare to find a modern site which does not use resources from one or more third party sites.When each resource is loaded (generally) some information about you goes along for the ride. At a minimum, the request time and source (your) IP address is exposed and — unless you’re really careful/paranoid — the referring site, browser configuration and even cookies are even available to the third party sites. It does not take many of these data points to (pretty much) uniquely identify you. And, this is just for “benign” content like images. We’ll get to JavaScript in a bit.
As you move along the web, these third-party touch-points add up. To demonstrate this, I did my best to de-privatize my browser and OS configuration and visited 12 web sites while keeping a fresh install of Firefox Lightbeam running. Here’s the result:
Each main circle is a distinct/main site and the triangles are resources the site tried to load. The red triangles indicate a common third-party resource that was loaded by two or more sites. Each of those red triangles knows where you’ve been (again, unless you’ve been very careful/paranoid) and can use that information to enhance their knowledge about you.
It gets a bit worse with JavaScript content since a much stronger fingerprint can be created for you (you can learn more about fingerprints at this spiffy EFF site). Plus, JavaScript code can try to pilfer cookies, “hack” the browser, serve up malicious adverts, measure time-on-site, and even enlist you in a cryptomining army.
There are other issues with trusting loaded browser content, but we’ll cover that a bit further into the investigation.
Measuring “Caring”
The word “privacy” was used over 100 times each day by both Zuckerberg and our Congress-critters. Senators and House members made it pretty clear Facebook should care more about your privacy. Implicit in said posit is that they, themselves, must care about your privacy. I’m sure they’ll be glad to point out all along the midterm campaign trails just how much they’re doing to protect your privacy.
We don’t just have to take their word for it. After berating Facebook’s chief college dropout and chastising the largest social network on the planet we can see just how much of “you” these representatives give to Facebook (and other sites) and also how much they protect you when you decide to pay them[†] [‡] a digital visit.
For this metrics experiment, I built a crawler using R and my
splashr
? package which, in turn, uses ScrapingHub’s open sourceSplash
. Splash is an automation framework that lets you programmatically visit a site just like a human would with a real browser.Normally when one scrapes content from the internet they’re just grabbing the plain, single HTML file that is at the target of a URL.
Splash
lets us behave like a browser and capture all the resources — images, CSS, fonts, JavaScript — the site loads and will also execute any JavaScript, so it will also capture resources each script may itself load.By capturing the entire browser experience for the main page of each member of Congress we can get a pretty good idea of just how much each one cares about your digital privacy, and just how much they secretly love Facebook.
Let’s take a look, first, at where you go when you digitally visit a Congress-critter.
Network/Hosting/DNS
Each House and Senate member has an official (not campaign) site that is hosted on a
.gov
domain and served up from a handful of IP addresses across the following (n
is the number of Congress-critter web sites):“Orange” is really Akamai and Akamai is a giant content delivery network which helps web sites efficiently provide content to your browser and can offer Denial of Service (DoS) protection. Most sites are behind Akamai, which means you “touch” Akamai every time you visit the site. They know you were there, but I know a sufficient body of folks who work at Akamai and I’m fairly certain they’re not too evil. Virtually no representative solely uses House/Senate infrastructure, but this is almost a necessity given how easy it is to take down a site with a DoS attack and how polarized politics is in America.
To get to those IP addresses, DNS names like
www.king.senate.gov
(one of the Senators from my state) needs to be translated to IP addresses. DNS queries are also data gold mines and everyone from your ISP to the DNS server that knows the name-to-IP mapping likely sees your IP address. Here are the DNS servers that serve up the directory lookups for all of the House and Senate domains:Akamai kinda does need to serve up DNS for the sites they host, so this list also makes sense. But, you’ve now had two touch-points logged and we haven’t even loaded a single web page yet.
Safe? & Secure? Connections
When we finally make a connection to a Congress-critter’s site, it is going to be over SSL/TLS. They all support it (which is ?, but SSL/TLS confidentiality is not as bullet-proof as many “HTTPS Everywhere” proponents would like to con you into believing). However, I took a look at the SSL certificates for House and Senate sites. Here’s a sampling from, again, my state (one House representative):
The
*.house.gov
“Common Name (CN)” is a wildcard certificate. Many SSL certificates have just one valid CN, but it’s also possible to list alternate, valid “alt” names that can all use the same, single certificate. Wildcard certificates ease the burden of administration but it also means that if, say, I managed to get my hands on the certificate chain and private key file, I could setupvladimirputin.house.gov
somewhere and your browser would think it’s A-OK. Granted, there are far more Representatives than there are Senators and their tenure length is pretty erratic these days, so I can sort of forgive them for taking the easy route, but I also in no way, shape or form believe they protect those chains and private keys well.In contrast, the Senate can and does embed the alt-names:
Are We There Yet?
We’ve got the IP address of the site and established a “secure” connection. Now it’s time to grab the index page and all the rest of the resources that come along for the ride. As noted in the Privacy Primer (above), the loading of third-party resources is problematic from a privacy (and security) perspective. Just how many third party resources do House and Senate member sites rely on?
To figure that out, I tallied up all of the non-
.gov
resources loaded by each web site and plotted the distribution of House and Senate (separately) in a “beeswarm” plot with a boxplot shadowing underneath so you can make out the pertinent quantiles:As noted, the median is around 30 for both House and Senate member sites. In other words, they value your browsing privacy so little that most Congress-critters gladly share your browser session with many other sites.
We also talked about confidentiality above. If an
https
site loadshttp
resources the contents of what you see on the page cannot but guaranteed. So, how responsible are they when it comes to at least ensuring these third-party resources are loaded overhttps
?You’re mostly covered from a pseudo-confidentiality perspective, but what are they serving up to you? Here’s a summary of the MIME types being delivered to you:
We’ll cover some of these in more detail a bit further into the post.
Facebook & “Friends”
Facebook started all this, so just how cozy are these Congress-critters with Facebook?
Turns out that both Senators and House members are very comfortable letting you give Facebook a love-tap when you come visit their sites since over 60% of House and 40% of Senate sites use 2 or more Facebook resources. Not all Facebook resources are created equal[ly evil] and we’ll look at some of the more invasive ones soon.
Facebook is not the only devil out there. I added in the public filter list from Disconnect and the numbers go up from 60% to 70% for the House and from 40% to 60% for the Senate when it comes to a larger corpus of known tracking sites/resources.
Here’s a list of some (first 20) of the top domains (with one of Twitter’s media-serving domains taking the individual top-spot):
So, when you go to check out what your representative is ‘officially’ up to, you’re being served…up on a silver platter to a plethora of sites where you are the product.
It’s starting to look like Congress-folk aren’t as sincere about your privacy as they may have led us all to believe this week.
A [Java]Script for Success[ful Privacy Destruction]
As stated earlier, not all third-party content is created equally malicious. JavaScript resources run code in your browser on your device and while there are limits to what it can do, those limits diminish weekly as crafty coders figure out more ways to use JavaScript to collect information and perform shady or malicious deeds.
So, how many House/Senate sites load one or more third-party JavaScript resources?
Virtually all of them.
To make matters worse, no
.gov
or third-party resource of any kind was loaded using subresource integrity validation. Subresource integrity validation means that the site owner — at some point — ensured that the resource being loaded was not malicious and then created a fingerprint for it and told your browser what that fingerprint is so it can compare it to what got loaded. If the fingerprints don’t match, the content is not loaded/executed. Using subresource integrity is not trivial since it requires a top-notch content management team and failure to synchronize/checkpoint third-party content fingerprints will result in resources failing to load.Congress was quick to demand that Facebook implement stronger policies and controls, but they, themselves, cannot be bothered.
Future Work
There are plenty more avenues to explore in this data set (such as “security headers” — they all 100% use
strict-transport-security
pretty well, but are deeply deficient in others) and more targets for future works, such as the campaign sites of House and Senate members. I may follow up with a look at a specific slice from this data set (the members of the committees who were berating Zuckerberg this week).The bottom line is that while the beating Facebook took this week was just, those inflicting the pain have a long way to go themselves before they can truly judge what other social media and general internet sites do when it comes to ensuring the safety and privacy of their visitors.
In other words, “Legislator, regulate thyself” before thy regulatists others.
FIN
Apart from some egregiously bad (or benign) examples, I tried not to “name and shame”. I also won’t answer any questions about facets by party since that really doesn’t matter too much as they’re all pretty bad when it comes to understanding and implementing privacy and safey on their sites.
The data set can be found over at Zenodo (alternately, click/tap/select the badge below). I converted the R data frame to ndjson/streaming JSON/jsonlines (however you refer to the format) and tested it out in Apache Drill.
I’ll toss up some R code using data extracts later this week (meaning by April 20th).