Skip navigation

Category Archives: Cybersecurity

Phishing is [still] the primary way attackers either commit a primary criminal act (i.e. phish a target to, say, install ransomware) or is the initial vehicle used to gain a foothold in an organization so they can perform other criminal operations to achieve some goal. As such, security teams, vendors and active members of the cybersecurity community work diligently to neutralize phishing campaigns as quickly as possible.

One popular community tool/resource in this pursuit is PhishTank which is a collaborative clearing house for data and information about phishing on the Internet. Also, PhishTank provides an open API for developers and researchers to integrate anti-phishing data into their applications at no charge.

While the PhishTank API is useful for real-time anti-phishing operations the data is also useful for security researchers as we work to understand the ebb, flow and evolution of these attacks. One avenue of research is to track the various features associated with phishing campaigns which include (amongst many other elements) network (internet) location of the phishing site, industry being targeted, domain names being used, what type of sites are being cloned/copied and a feature we’ll be looking at in this post: what percentage of new phishing sites use SSL encryption and — of these — which type of SSL certificates are “en vogue”.

Phishing sites are increasingly using and relying on SSL certificates because we in the information security industry spent a decade instructing the general internet surfing population to trust sites with the green lock icon near the location bar. Initially, phishers worked to compromise existing, encryption-enabled web properties to install phishing sites/pages since they could leech off of the “trusted” status of the associated SSL certificates. However, the advent of services like Let’s Encrypt have made it possible for attacker to setup their own phishing domains that look legitimate to current-generation internet browsers and prey upon the decade’s old “trust the lock icon” mantra that most internet users still believe. We’ll table that path of discussion (since it’s fraught with peril if you don’t support the internet-do-gooder-consequences-be-darned cabal’s personal agendas) and just focus on how to work with PhishTank data in R and take a look at the most prevalent SSL certs used in the past week (you can extend the provided example to go back as far as you like provided the phishing sites are still online).

Accessing PhishTank From R

You can use the aquarium package [GL|GH] to gain access to the data provided by PhishTank’s API (you need to sign up for access and put you API key into the PHISHTANK_API_KEY environment variable which is best done via your ~/.Renviron file).

Let’s setup all the packages we’ll need and cache a current copy of the PhishTank data. The package forces you to utilize your own caching strategy since it doesn’t make sense for it to decide that for you. I’d suggest either using the time-stamped approach below or using some type of database system (or, say, Apache Drill) to actually manage the data.

Here are the packages we’ll need:

library(psl) # git[la|hu]b/hrbrmstr/psl
library(curlparse) # git[la|hu]b/hrbrmstr/curlparse
library(aquarium) # git[la|hu]b/hrbrmstr/aquarium
library(gt) # github/rstudio/gt
library(furrr)
library(stringi)
library(openssl)
library(tidyverse)

NOTE: The psl and curlparse packages are optional. Windows users will find it difficult to get them working and it may be easier to review the functions provided by the urlparse package and substitute equivalents for the domain() and apex_domain() functions used below. Now, we get a copy of the current PhishTank dataset & cache it:

if (!file.exists("~/Data/2018-12-23-fishtank.rds")) {
  xdf <- pt_read_db()
  saveRDS(xdf, "~/Data/2018-12-23-fishtank.rds")
} else {
  xdf <- readRDS("~/Data/2018-12-23-fishtank.rds")
}

Let’s take a look:

glimpse(xdf)
## Observations: 16,446
## Variables: 9
## $ phish_id          <chr> "5884184", "5884138", "5884136", "5884135", ...
## $ url               <chr> "http://internetbanking-bancointer.com.br/lo...
## $ phish_detail_url  <chr> "http://www.phishtank.com/phish_detail.php?p...
## $ submission_time   <dttm> 2018-12-22 20:45:09, 2018-12-22 18:40:24, 2...
## $ verified          <chr> "yes", "yes", "yes", "yes", "yes", "yes", "y...
## $ verification_time <dttm> 2018-12-22 20:45:52, 2018-12-22 21:26:49, 2...
## $ online            <chr> "yes", "yes", "yes", "yes", "yes", "yes", "y...
## $ details           <list> [<209.132.252.7, 209.132.252.0/24, 7296 468...
## $ target            <chr> "Other", "Other", "Other", "PayPal", "Other"...

The data is really straightforward. We have unique ids for each site/campaign the URL of the site along with a URL to extra descriptive info PhishTank has on the site/campaign. We also know when the site was submitted/discovered and other details, such as the network/internet space the site is in:

glimpse(xdf$details[1])
## List of 1
##  $ :'data.frame':    1 obs. of  6 variables:
##   ..$ ip_address        : chr "209.132.252.7"
##   ..$ cidr_block        : chr "209.132.252.0/24"
##   ..$ announcing_network: chr "7296 468"
##   ..$ rir               : chr "arin"
##   ..$ country           : chr "US"
##   ..$ detail_time       : chr "2018-12-23T01:46:16+00:00"

We’re going to focus on recent phishing sites (in this case, ones that are less than a week old) and those that use SSL certificates:

filter(xdf, verified == "yes") %>%
  filter(online == "yes") %>%
  mutate(diff = as.numeric(difftime(Sys.Date(), verification_time), "days")) %>%
  filter(diff <= 7) %>%
  { all_ct <<- nrow(.) ; . } %>%
  filter(grepl("^https", url)) %>%
  { ssl_ct <<- nrow(.) ; . } %>%
  mutate(
    domain = domain(url),
    apex = apex_domain(domain)
  ) -> recent

Let’s ee how many are using SSL:

(ssl_ct)
## [1] 383

(pct_ssl <- ssl_ct / all_ct)
## [1] 0.2919207

This percentage is lower than a recent “50% of all phishing sites use encryption” statistic going around of late. There are many reasons for the difference:

  • PhishTank doesn’t have all phishing sites in it
  • We just looked at a week of examples
  • Some sites were offline at the time of access attempt
  • Diverse attacker groups with varying degrees of competence engage in phishing attacks

Despite the 20% deviation, 30% is still a decent percentage, and a green, “everything’s ??” icon is a still a valued prize so we shall pursue our investigation.

Now we need to retrieve all those certs. This can be a slow operation that so we’ll grab them in parallel. It’s also quite possible the “online”status above data frame glimpse is inaccurate (sites can go offline quickly) so we’ll catch certificate request failures with safely() and cache the results:

cert_dl <- purrr::safely(openssl::download_ssl_cert)

plan(multiprocess)

if (!file.exists("~/Data/recent.rds")) {

  recent <- mutate(recent, cert = future_map(domain, cert_dl))
  saveRDS(recent, "~/Data/recent.rds")

} else {
  recent <- readRDS("~/Data/recent.rds")
}

Let see how many request failures we had:

(failed <- sum(map_lgl(recent$cert, ~is.null(.x$result))))
## [1] 25

(failed / nrow(recent))
## [1] 0.06527415

As noted in the introduction to the blog, when attackers want to use SSL for the lock icon ruse they can either try to piggyback off of legitimate domains or rely on Let’s Encrypt to help them commit crimes. Let’s see what the top p”apex” domains](https://help.github.com/articles/about-supported-custom-domains/#apex-domains) were in use in the past week:

count(recent, apex, sort = TRUE)
## # A tibble: 255 x 2
##    apex                              n
##    <chr>                         <int>
##  1 000webhostapp.com                42
##  2 google.com                       17
##  3 umbler.net                        8
##  4 sharepoint.com                    6
##  5 com-fl.cz                         5
##  6 lbcpzonasegurabeta-viabcp.com     4
##  7 windows.net                       4
##  8 ashaaudio.net                     3
##  9 brijprints.com                    3
## 10 portaleisp.com                    3
## # ... with 245 more rows

We can see that a large hosting provider (000webhostapp.com) bore a decent number of these sites, but Google Sites (which is what the full domain represented by the google.com apex domain here is usually pointing to) Microsoft SharePoint (sharepoint.com) and Microsoft forums (windows.net) are in active use as well (which is smart give the pervasive trust associated with those properties). There are 241 distinct apex domains in this 1-week set so what is the SSL cert diversity across these pages/campaigns?

We ultimately used openssl::download_ssl_cert to retrieve the SSL certs of each site that was online, so let’s get the issuer and intermediary certs from them and look at the prevalence of each. We’ll extract the fields from the issuer component returned by openssl::download_ssl_cert then just do some basic maths:

filter(recent, map_lgl(cert, ~!is.null(.x$result))) %>%
  mutate(issuers = map(cert, ~map_chr(.x$result, ~.x$issuer))) %>%
  mutate(
    inter = map_chr(issuers, ~.x[1]), # the order is not guaranteed here but the goal of the exercise is
    root = map_chr(issuers, ~.x[2])   # to get you working with the data vs build a 100% complete solution
  ) %>%
  mutate(
    inter = stri_replace_all_regex(inter, ",([[:alpha:]])+=", ";;;$1=") %>%
      stri_split_fixed(";;;") %>% # there are parswers for the cert info fields but this hack is quick and works
      map(stri_split_fixed, "=", 2, simplify = TRUE) %>%
      map(~setNames(as.list(.x[,2]), .x[,1])) %>%
      map(bind_cols),
    root = stri_replace_all_regex(root, ",([[:alpha:]])+=", ";;;$1=") %>%
      stri_split_fixed(";;;") %>%
      map(stri_split_fixed, "=", 2, simplify = TRUE) %>%
      map(~setNames(as.list(.x[,2]), .x[,1])) %>%
      map(bind_cols)
  ) -> recent

Let’s take a look at roots:

unnest(recent, root) %>%
  distinct(phish_id, apex, CN) %>%
  count(CN, sort = TRUE) %>%
  mutate(pct = n/sum(n)) %>%
  gt::gt() %>%
  gt::fmt_number("n", decimals = 0) %>%
  gt::fmt_percent("pct")

CN n pct
DST Root CA X3 96 26.82%
COMODO RSA Certification Authority 93 25.98%
DigiCert Global Root G2 45 12.57%
Baltimore CyberTrust Root 30 8.38%
GlobalSign 27 7.54%
DigiCert Global Root CA 15 4.19%
Go Daddy Root Certificate Authority – G2 14 3.91%
COMODO ECC Certification Authority 11 3.07%
Actalis Authentication Root CA 9 2.51%
GlobalSign Root CA 4 1.12%
Amazon Root CA 1 3 0.84%
Let’s Encrypt Authority X3 3 0.84%
AddTrust External CA Root 2 0.56%
DigiCert High Assurance EV Root CA 2 0.56%
USERTrust RSA Certification Authority 2 0.56%
GeoTrust Global CA 1 0.28%
SecureTrust CA 1 0.28%

DST Root CA X3 is (wait for it) Let’s Encrypt! Now, Comodo is not far behind and indeed surpasses LE if we combine the extra-special “enhanced” versions they provide and it’s important for you to read the comments near the lines of code making assumptions about order of returned issuer information above. Now, let’s take a look at intermediaries:

unnest(recent, inter) %>%
  distinct(phish_id, apex, CN) %>%
  count(CN, sort = TRUE) %>%
  mutate(pct = n/sum(n)) %>%
  gt::gt() %>%
  gt::fmt_number("n", decimals = 0) %>%
  gt::fmt_percent("pct")

CN n pct
Let’s Encrypt Authority X3 99 27.65%
cPanel\, Inc. Certification Authority 75 20.95%
RapidSSL TLS RSA CA G1 45 12.57%
Google Internet Authority G3 24 6.70%
COMODO RSA Domain Validation Secure Server CA 20 5.59%
CloudFlare Inc ECC CA-2 18 5.03%
Go Daddy Secure Certificate Authority – G2 14 3.91%
COMODO ECC Domain Validation Secure Server CA 2 11 3.07%
Actalis Domain Validation Server CA G1 9 2.51%
RapidSSL RSA CA 2018 9 2.51%
Microsoft IT TLS CA 1 6 1.68%
Microsoft IT TLS CA 5 6 1.68%
DigiCert SHA2 Secure Server CA 5 1.40%
Amazon 3 0.84%
GlobalSign CloudSSL CA – SHA256 – G3 2 0.56%
GTS CA 1O1 2 0.56%
AlphaSSL CA – SHA256 – G2 1 0.28%
DigiCert SHA2 Extended Validation Server CA 1 0.28%
DigiCert SHA2 High Assurance Server CA 1 0.28%
Don Dominio / MrDomain RSA DV CA 1 0.28%
GlobalSign Extended Validation CA – SHA256 – G3 1 0.28%
GlobalSign Organization Validation CA – SHA256 – G2 1 0.28%
RapidSSL SHA256 CA 1 0.28%
TrustAsia TLS RSA CA 1 0.28%
USERTrust RSA Domain Validation Secure Server CA 1 0.28%
NA 1 0.28%

LE is number one again! But, it’s important to note that these issuer CommonNames can roll up into a single issuing organization given just how messed up integrity and encryption capability is when it comes to web site certs, so the raw results could do with a bit of post-processing for a more complete picture (an exercise left to intrepid readers).

FIN

There are tons of avenues to explore with this data, so I hope this post whet your collective appetites sufficiently for you to dig into it, especially if you have some dowm-time coming.

Let me also take this opportunity to resissue guidance I and many others have uttered this holiday season: be super careful about what you click on, which sites you even just visit, and just how much you really trust the site, provider and entity behind the form about to enter your personal information and credit card info into.

Despite their now inherent evil status, GitHub has some tools other repository aggregators do not. One such tool is the free vulnerability alert service which will scan repositories for outdated+vulnerable dependencies.

Now, “R” is nowhere near a first-class citizen in the internet writ large, including software development tooling (e.g. the Travis-CI and GitLab continuous integration recipes are community maintained vs a first-class/supported offering). This also means that GitHub’s service will never check for nor alert when a pure R package has security issues, mostly due to the fact that there’s only a teensy few of us who even bother to check packages for issues once in a while and there’s no real way to report said issues into the CVE process easily (though I guess I could given that my $DAYJOB is an official CVE issuer), so the integrity & safety of the R package ecosystem is still in the “trust me, everything’s ?!!” state. Given that, any extra way to keep even some R packages less insecure is great.

So, right now you’re thinking “you click-baited us with a title that your lede just said isn’t possible…WTHeck?!.

It’s true that GitHub does not consider R a first-class citizen, but it does support Java and:

    available.packages() %>% 
      dplyr::as_data_frame() %>% 
      tidyr::separate_rows(Imports, sep=",[[:space:]]*") %>% # we really just
      tidyr::separate_rows(Depends, sep=",[[:space:]]*") %>% # need these two
      tidyr::separate_rows(Suggests, sep=",[[:space:]]*") %>%
      tidyr::separate_rows(Enhances, sep=",[[:space:]]*") %>%
      dplyr::select(Package, Imports, Depends) %>% 
      filter(
        grepl("rJava", Imports) | grepl("rJava", "Depends") | 
          grepl("Suggests", Imports) | grepl("Enhances", "Depends")
      ) %>% 
      dplyr::distinct(Package) %>% 
      dplyr::summarise(total_pkgs_using_rjava = n())
    ## # A tibble: 1 x 1
    ##   total_pkgs_using_rjava
    ##                    <int>
    ## 1                     66

according to ☝ there are 66 CRAN packages that require rJava, seven of which explicitly provide only JARs (a compressed directory tree of supporting Java classes). There are more CRAN-unpublished rJava-based projects on GitLab & GitHub, but it’s likely that public-facing rJava packages that include or depend on public JAR-dependent projects still number less than ~200. Given the now >13K packages in CRAN, this is a tiny subset but with the sorry state of R security, anything is better than nothing.

Having said that, one big reason (IMO) for the lack of Java-wrapped CRAN or “devtools”-only released rJava-dependent packages it that it’s 2018 and you still have better odds of winning a Vegas-jackpot than you do getting rJava to work on your workstation in less than 4 tries and especially after an OS upgrade. That’s sad since there are many wonderful, solid and useful Java libraries that would be super-handy for many workflows yet most of us (I’m including myself) package-writers prefer to spin wheels to get C++ or Rust libraries working with R than try to make it easier for regular R users to tap into that rich Java ecosystem.

But, I digress.

For the handful of us that do write and use rJava-based packages, we can better serve our userbase by deliberately putting those R+Java repos on GitHub. Now, I hear you. They’re evil and by doing this one of the most evil corporations on the planet can make money with your metadata (and, possibly just blatantly steal your code for use in-product without credit) but I’ll give that up on a case-by-case basis to make it easier to keep users safe.

Why will this enhance safety? Go take a look at one of my non-CRAN rJava-backed packages: pdfbox?. It has this awesome “in-your-face” security warning banner:

The vulnerability is CVE-2018-11797 which is baseline computed to be “high severity” with a the following specific weakness: In Apache PDFBox 1.8.0 to 1.8.15 and 2.0.0RC1 to 2.0.11, a carefully crafted PDF file can trigger an extremely long running computation when parsing the page tree.. So, it’s a process denial of service vulnerability. You’ll also note I haven’t updated the JARs yet (mostly since it’s not a code-execution vulnerability).

I knew about this 28 days ago (I’ve been incredibly busy and there’s alot of blather required to talk about it, hence the delay in blogging) thanks to the GitHub service and will resolve it when I get some free time over the Thanksgiving break. I received an alert for this, there are hooks for security alerts (so one can auto-create an issue) and there’s a warning for users and any of them could file an issue to let me know it’s super-important to them that I get it fixed (or they could be super-awesome and file a PR :-).

FIN

The TLDR is (first) a note — to package authors — who use rJava to bite the GitHub bullet and take advantage of this free service; and, (second) — to users — to encourage use of this service by authors of packages you use and to keep a watchful eye out for any security alerts for code you depend on to get things done.

A (perhaps) third and final note is for all of us to be to continually mindful about the safety & integrity of the R package ecosystem and do what we can to keep moving it forward.

I pen this mini-tome on “GDPR Enforcement Day”. The spirit of GDPR is great, but it’s just going to be another Potempkin Village in most organizations much like PCI or SOX. For now, the only thing GDPR has done is made GDPR consulting companies rich, increased the use of javascript on web sites so they can pop-up useless banners we keep telling users not to click on and increase the size of email messages to include mandatory postscripts (that should really be at the beginning of the message, but, hey, faux privacy is faux privacy).

Those are just a few of the “unintended consequences” of GDPR. Just like Let’s Encrypt & “HTTPS Everywhere” turned into “Let’s Enable Criminals and Hurt Real People With Successful Phishing Attacks”, GDPR is going to cause a great deal of downstream issues that either the designers never thought of or decided — in their infinite, superior wisdom — were completely acceptable to make themselves feel better.

Today’s installment of “GDPR Unintended Consequences” is WordPress.

WordPress “powers” a substantial part of the internet. As such, it is a perma-target of attackers.

Since the GDPR Intelligentsia provided a far-too-long lead-time on both the inaugural and mandated enforcement dates for GDPR and also created far more confusion with the regulations than clarity, WordPress owners are flocking to “single button install” solutions to make them magically GDPR compliant (#protip that’s not “a thing”). Here’s a short list of plugins and active installation counts (no links since I’m not going to encourage attack surface expansion):

  • WP GDPR Compliance : 50,000+ active installs
  • GDPR : 10,000+ active installs
  • The GDPR Framework : 6,000+ installs
  • GDPR Cookie Compliance : 10,000+ active installs
  • GDPR Cookie Consent : 200,000+ active installs
  • WP GDPR : 4,000 active installs
  • Cookiebot | GDPR Compliant Cookie Consent and Notice : 10,000+ active installations
  • GDPR Tools : 500+ active installs
  • Surbma — GDPR Proof Cookies : 400+ installs
  • Social Media Share Buttons & Social Sharing Icons (which “enhanced” GDPR compatibility) : 100,000+ active installs
  • iubenda Cookie Solution for GDPR : 10,000+ active installs
  • Cookie Consent : 100,000+ active installs

I’m somewhat confident that a fraction of those publishers follow secure coding guidelines (it may be a small fraction). But, if I was an attacker, I’d be poking pretty hard at a few of those with six-figure installs to see if I could find a usable exploit.

GDPR just gave attackers a huge footprint of homogeneous resources to attempt at-scale exploits. They will very likely succeed (over-and-over-and-over again). This means that GDPR just increased the likelihood of losing your data privacy…the complete opposite of the intent of the regulation.

There are more unintended consequences and I’ll pepper the blog with them as the year and pain progresses.

RIPE 76 is going on this week and — as usual — there are scads of great talks. The selected ones below are just my (slightly) thinner slice at what may have broader appeal outside pure networking circles.

Do not read anything more into the order than the end-number of the “Main URL” since this was auto-generated from a script that processed my Firefox tab URLs.

Artyom Gavrichenkov – Memcache Amplification DDoS: Lessons Learned

Erik Bais – Why Do We Still See Amplification DDOS Traffic

Jordi Palet Martinez – A New Internet Intro to HTTP/2, QUIC, DOH and DNS over QUIC

Sara Dickinson – DNS Privacy BCP

Jordi Palet Martinez – Email Servers on IPv6

Martin Winter – Real-Time BGP Toolkit: A New BGP Monitor Service

Job Snijders – Practical Data Sources For BGP Routing Security

Charles Eckel – Combining Open Source and Open Standards

Kostas Zorbadelos – Towards IPv6 Only: A large scale lw4o6 deployment (rfc7596) for broadband users @AS6799

Louis Poinsignon – Internet Noise (Announcing 1.1.1.0/24)

Filiz Yilmaz – Current Policy Topics – Global Policy Proposals

Geoff Huston – Measuring ATR

Moritz Muller, SIDN – DNSSEC Rollovers

Anand Buddhdev – DNS Status Report

Victoria Risk – A Survey on DNS Privacy

Baptiste Jonglez – High-Performance DNS over TCP

Sara Dickinson – Latest Measurements on DNS Privacy

Willem Toorop – Sunrise DNS-over-TLS! Sunset DNSSEC – Who Needs Reasons, When You’ve Got Heroes

Laurenz Wagner – A Modern Chatbot Approach for Accessing the RIPE Database

I apologize up-front for using bad words in this post.

Said bad words include “Facebook”, “Mark Zuckerberg” and many referrals to entities within the U.S. Government. Given the topic, it cannot be helped.

I’ve also left the R tag on this despite only showing some ggplot2 plots and Markdown tables. See the end of the post for how to get access to the code & data. R was used solely and extensively for the work behind the words.


This week Congress put on a show as they summoned the current Facebook CEO — Mark Zuckerberg — down to Washington, D.C. to demonstrate how little most of them know about how the modern internet and social networks actually work plus chest-thump to prove to their constituents they really and truly care about you.

These Congress-critters offered such proof in the guise of railing against Facebook for how they’ve handled your data. Note that I should really say our data since they do have an extensive profile database on me and most everyone else even if they’re not Facebook platform users (full disclosure: I do not have a Facebook account).

Ostensibly, this data-mishandling impacted your privacy. Most of the committee members wanted any constituent viewers to come away believing they and their fellow Congress-critters truly care about your privacy.

Fortunately, we have a few ways to measure this “caring” and the remainder of this post will explore how much members of the U.S. House and Senate care about your privacy when you visit their official .gov web sites. Future posts may explore campaign web sites and other metrics, but what better place to show they care about you then right there in their digital houses.

Privacy Primer

When you visit a web site with any browser, the main URL pulls in resources to aid in the composition and functionality of the page. These could be:

  • HTML (the main page is very likely HTML unless it’s just a media URL)
  • images (png, jpg, gif, “svg”, etc),
  • fonts
  • CSS (the “style sheet” that tells the browser how to decorate and position elements on the page)
  • binary objects (such as embedded PDF files or “protocol buffer” content)
  • XML or JSON
  • JavaScript

(plus some others)

When you go to, say, www.example.com the site does not have to load all the resources from example.com domains. In fact, it’s rare to find a modern site which does not use resources from one or more third party sites.

When each resource is loaded (generally) some information about you goes along for the ride. At a minimum, the request time and source (your) IP address is exposed and — unless you’re really careful/paranoid — the referring site, browser configuration and even cookies are even available to the third party sites. It does not take many of these data points to (pretty much) uniquely identify you. And, this is just for “benign” content like images. We’ll get to JavaScript in a bit.

As you move along the web, these third-party touch-points add up. To demonstrate this, I did my best to de-privatize my browser and OS configuration and visited 12 web sites while keeping a fresh install of Firefox Lightbeam running. Here’s the result:

Each main circle is a distinct/main site and the triangles are resources the site tried to load. The red triangles indicate a common third-party resource that was loaded by two or more sites. Each of those red triangles knows where you’ve been (again, unless you’ve been very careful/paranoid) and can use that information to enhance their knowledge about you.

It gets a bit worse with JavaScript content since a much stronger fingerprint can be created for you (you can learn more about fingerprints at this spiffy EFF site). Plus, JavaScript code can try to pilfer cookies, “hack” the browser, serve up malicious adverts, measure time-on-site, and even enlist you in a cryptomining army.

There are other issues with trusting loaded browser content, but we’ll cover that a bit further into the investigation.

Measuring “Caring”

The word “privacy” was used over 100 times each day by both Zuckerberg and our Congress-critters. Senators and House members made it pretty clear Facebook should care more about your privacy. Implicit in said posit is that they, themselves, must care about your privacy. I’m sure they’ll be glad to point out all along the midterm campaign trails just how much they’re doing to protect your privacy.

We don’t just have to take their word for it. After berating Facebook’s chief college dropout and chastising the largest social network on the planet we can see just how much of “you” these representatives give to Facebook (and other sites) and also how much they protect you when you decide to pay them[] [] a digital visit.

For this metrics experiment, I built a crawler using R and my splashr? package which, in turn, uses ScrapingHub’s open source Splash. Splash is an automation framework that lets you programmatically visit a site just like a human would with a real browser.

Normally when one scrapes content from the internet they’re just grabbing the plain, single HTML file that is at the target of a URL. Splash lets us behave like a browser and capture all the resources — images, CSS, fonts, JavaScript — the site loads and will also execute any JavaScript, so it will also capture resources each script may itself load.

By capturing the entire browser experience for the main page of each member of Congress we can get a pretty good idea of just how much each one cares about your digital privacy, and just how much they secretly love Facebook.

Let’s take a look, first, at where you go when you digitally visit a Congress-critter.

Network/Hosting/DNS

Each House and Senate member has an official (not campaign) site that is hosted on a .gov domain and served up from a handful of IP addresses across the following (n is the number of Congress-critter web sites):

asn aso n
AS5511 Orange 425
AS7016 Comcast Cable Communications, LLC 95
AS20940 Akamai International B.V. 13
AS1999 U.S. House of Representatives 6
AS7843 Time Warner Cable Internet LLC 1
AS16625 Akamai Technologies, Inc. 1

“Orange” is really Akamai and Akamai is a giant content delivery network which helps web sites efficiently provide content to your browser and can offer Denial of Service (DoS) protection. Most sites are behind Akamai, which means you “touch” Akamai every time you visit the site. They know you were there, but I know a sufficient body of folks who work at Akamai and I’m fairly certain they’re not too evil. Virtually no representative solely uses House/Senate infrastructure, but this is almost a necessity given how easy it is to take down a site with a DoS attack and how polarized politics is in America.

To get to those IP addresses, DNS names like www.king.senate.gov (one of the Senators from my state) needs to be translated to IP addresses. DNS queries are also data gold mines and everyone from your ISP to the DNS server that knows the name-to-IP mapping likely sees your IP address. Here are the DNS servers that serve up the directory lookups for all of the House and Senate domains:

nameserver gov_hosted
e4776.g.akamaiedge.net. FALSE
wc.house.gov.edgekey.net. FALSE
e509.b.akamaiedge.net. FALSE
evsan2.senate.gov.edgekey.net. FALSE
e485.b.akamaiedge.net. FALSE
evsan1.senate.gov.edgekey.net. FALSE
e483.g.akamaiedge.net. FALSE
evsan3.senate.gov.edgekey.net. FALSE
wwwhdv1.house.gov. TRUE
firesideweb02cc.house.gov. TRUE
firesideweb01cc.house.gov. TRUE
firesideweb03cc.house.gov. TRUE
dchouse01cc.house.gov. TRUE
c3pocc.house.gov. TRUE
ceweb.house.gov. TRUE
wwwd2-cdn.house.gov. TRUE
45press.house.gov. TRUE
gopweb1a.house.gov. TRUE
eleven11web.house.gov. TRUE
frontierweb.house.gov. TRUE
primitivesocialweb.house.gov. TRUE

Akamai kinda does need to serve up DNS for the sites they host, so this list also makes sense. But, you’ve now had two touch-points logged and we haven’t even loaded a single web page yet.

Safe? & Secure? Connections

When we finally make a connection to a Congress-critter’s site, it is going to be over SSL/TLS. They all support it (which is ?, but SSL/TLS confidentiality is not as bullet-proof as many “HTTPS Everywhere” proponents would like to con you into believing). However, I took a look at the SSL certificates for House and Senate sites. Here’s a sampling from, again, my state (one House representative):

The *.house.gov “Common Name (CN)” is a wildcard certificate. Many SSL certificates have just one valid CN, but it’s also possible to list alternate, valid “alt” names that can all use the same, single certificate. Wildcard certificates ease the burden of administration but it also means that if, say, I managed to get my hands on the certificate chain and private key file, I could setup vladimirputin.house.gov somewhere and your browser would think it’s A-OK. Granted, there are far more Representatives than there are Senators and their tenure length is pretty erratic these days, so I can sort of forgive them for taking the easy route, but I also in no way, shape or form believe they protect those chains and private keys well.

In contrast, the Senate can and does embed the alt-names:

Are We There Yet?

We’ve got the IP address of the site and established a “secure” connection. Now it’s time to grab the index page and all the rest of the resources that come along for the ride. As noted in the Privacy Primer (above), the loading of third-party resources is problematic from a privacy (and security) perspective. Just how many third party resources do House and Senate member sites rely on?

To figure that out, I tallied up all of the non-.gov resources loaded by each web site and plotted the distribution of House and Senate (separately) in a “beeswarm” plot with a boxplot shadowing underneath so you can make out the pertinent quantiles:

As noted, the median is around 30 for both House and Senate member sites. In other words, they value your browsing privacy so little that most Congress-critters gladly share your browser session with many other sites.

We also talked about confidentiality above. If an https site loads http resources the contents of what you see on the page cannot but guaranteed. So, how responsible are they when it comes to at least ensuring these third-party resources are loaded over https?

You’re mostly covered from a pseudo-confidentiality perspective, but what are they serving up to you? Here’s a summary of the MIME types being delivered to you:

MIME Type Number of Resources Loaded
image/jpeg 6,445
image/png 3,512
text/html 2,850
text/css 1,830
image/gif 1,518
text/javascript 1,512
font/ttf 1,266
video/mp4 974
application/json 673
application/javascript 670
application/x-javascript 353
application/octet-stream 187
application/font-woff2 99
image/bmp 44
image/svg+xml 39
text/plain 33
application/xml 15
image/jpeg, video/mp2t 12
application/x-protobuf 9
binary/octet-stream 5
font/woff 4
image/jpg 4
application/font-woff 2
application/vnd.google.gdata.error+xml 1

We’ll cover some of these in more detail a bit further into the post.

Facebook & “Friends”

Facebook started all this, so just how cozy are these Congress-critters with Facebook?

Turns out that both Senators and House members are very comfortable letting you give Facebook a love-tap when you come visit their sites since over 60% of House and 40% of Senate sites use 2 or more Facebook resources. Not all Facebook resources are created equal[ly evil] and we’ll look at some of the more invasive ones soon.

Facebook is not the only devil out there. I added in the public filter list from Disconnect and the numbers go up from 60% to 70% for the House and from 40% to 60% for the Senate when it comes to a larger corpus of known tracking sites/resources.

Here’s a list of some (first 20) of the top domains (with one of Twitter’s media-serving domains taking the individual top-spot):

Main third-party domain # of ‘pings’ %
twimg.com 764 13.7%
fbcdn.net 655 11.8%
twitter.com 573 10.3%
google-analytics.com 489 8.8%
doubleclick.net 462 8.3%
facebook.com 451 8.1%
gstatic.com 385 6.9%
fonts.googleapis.com 270 4.9%
youtube.com 246 4.4%
google.com 183 3.3%
maps.googleapis.com 144 2.6%
webtrendslive.com 95 1.7%
instagram.com 75 1.3%
bootstrapcdn.com 68 1.2%
cdninstagram.com 63 1.1%
fonts.net 51 0.9%
ajax.googleapis.com 50 0.9%
staticflickr.com 34 0.6%
translate.googleapis.com 34 0.6%
sharethis.com 32 0.6%

So, when you go to check out what your representative is ‘officially’ up to, you’re being served…up on a silver platter to a plethora of sites where you are the product.

It’s starting to look like Congress-folk aren’t as sincere about your privacy as they may have led us all to believe this week.

A [Java]Script for Success[ful Privacy Destruction]

As stated earlier, not all third-party content is created equally malicious. JavaScript resources run code in your browser on your device and while there are limits to what it can do, those limits diminish weekly as crafty coders figure out more ways to use JavaScript to collect information and perform shady or malicious deeds.

So, how many House/Senate sites load one or more third-party JavaScript resources?

Virtually all of them.

To make matters worse, no .gov or third-party resource of any kind was loaded using subresource integrity validation. Subresource integrity validation means that the site owner — at some point — ensured that the resource being loaded was not malicious and then created a fingerprint for it and told your browser what that fingerprint is so it can compare it to what got loaded. If the fingerprints don’t match, the content is not loaded/executed. Using subresource integrity is not trivial since it requires a top-notch content management team and failure to synchronize/checkpoint third-party content fingerprints will result in resources failing to load.

Congress was quick to demand that Facebook implement stronger policies and controls, but they, themselves, cannot be bothered.

Future Work

There are plenty more avenues to explore in this data set (such as “security headers” — they all 100% use strict-transport-security pretty well, but are deeply deficient in others) and more targets for future works, such as the campaign sites of House and Senate members. I may follow up with a look at a specific slice from this data set (the members of the committees who were berating Zuckerberg this week).

The bottom line is that while the beating Facebook took this week was just, those inflicting the pain have a long way to go themselves before they can truly judge what other social media and general internet sites do when it comes to ensuring the safety and privacy of their visitors.

In other words, “Legislator, regulate thyself” before thy regulatists others.

FIN

Apart from some egregiously bad (or benign) examples, I tried not to “name and shame”. I also won’t answer any questions about facets by party since that really doesn’t matter too much as they’re all pretty bad when it comes to understanding and implementing privacy and safey on their sites.

The data set can be found over at Zenodo (alternately, click/tap/select the badge below). I converted the R data frame to ndjson/streaming JSON/jsonlines (however you refer to the format) and tested it out in Apache Drill.

I’ll toss up some R code using data extracts later this week (meaning by April 20th).

DOI

The 2018 IEEE Security & Privacy Conference is in May but they’ve posted their full proceedings and it’s better to grab them early than to wait for it to become part of a paid journal offering.

There are alot of papers. Not all match my interests but (fortunately?) many did and I’ve filtered down a list of the more interesting (to me) ones. It’s encouraging to see academic cybersecurity researchers branching out across a whole host of areas.

I can’t promise a the morning paper-esque daily treatment of these on the blog but I’ll likely exposit a few of them over the coming weeks. I’ve emoji’d a few that stood out. Order is the order I read them in (no other meaning to the order).

What’s Up?

The NPR Visuals Team created and maintains a javascript library that makes it super easy to embed iframes on web pages and have said documents still be responsive.

The widgetframe R htmlwidget uses pym.js to bring this (much needed) functionality into widgets and (eventually) shiny apps.

NPR reported a critical vulnerability in this library on February 15th, 2018 with no details (said details will be coming next week).

Per NPR’s guidance, any production code using pym.js needs to be pulled or updated to use this new library.

I created an issue & pushed up a PR that incorporates the new version. NOTE that the YAML config file in the existing CRAN package and GitHub dev version incorrectly has 1.3.2 as the version (it’s really the 1.3.1 dev version).

A look at the diff:

suggest that the library was not performing URL sanitization (and now is).

Watch Out For Standalone Docs

Any R markdown docs compiled in “standalone” mode will need to be recompiled and re-published as the vulnerable pym.js library comes along for the ride in those documents.

Regardless of “standalone mode”, if you used widgetframe in any context, including:

anything created is vulnerable regardless of standalone compilation or not.

FIN

Once the final details are released I’ll update this post and may do a new post. Until then:

  • check if you’ve used widgetframe (directly or indirectly)
  • REMOVE ALL VULNERABLE DOCS from RPubs, GitHub pages, your web site (etc) today
  • regenerate all standalone documents ASAP
  • regenerate your blogs, books, dashboards, etc ASAP with the patched code; DO THIS FOR INTERNAL as well as internet-facing content.
  • monitor this space

NOTE: This is mainly for those of us in the Colonies, but some tips apply globally.

Black Friday / Cyber Monday / Cyber November / Holiday ?hopping is upon us. You’re going to buy stuff. You’re going to use digital transactions to do so. Here are some tips in a semi-coherent order:

  • Sign up for a “reputable” credit card (is there such a thing? FinServs are pretty evil) with a low interest rate/cash back, multi-factor authentication on their web/app and a limit on total credit and a per-transaction limit. This card is just for shopping. Pay with petrol and groceries with something else.
  • Assign that to your PayPal, Amazon, Apple Pay, et al accounts and keep that as your only physical & digital card for your shopping sprees until the season ends.
  • Setup multi-factor auth on PayPal, Amazon, Apple Pay and anywhere else you shop. Don’t shop where you can’t do this.
  • Use Amazon or a site that accepts PayPal, Apple Pay, or Amazon payments. Yes, all those orgs are evil. But they do a better job than most when it comes to account security.
  • Use Quantum Firefox or the latest Chrome Betas to shop online. Nothing else. Check for updates daily & apply when they are out.
  • Double-check URLs when shopping. Make sure you’re on the site you want to be on. Let’s Encrypt made it super easy for attackers to pwn you this season. You can afford an extra 5 minutes since that’ll save you years battling identity theft or account bankruptcy.
  • Type all URLs into Google’s safety net — https://transparencyreport.google.com/safe-browsing/search — if at all possible before even considering trusting them.
  • Don’t use any storefront that uses a Let’s Encrypt certificate. Any.
  • Never let sites store your credit card or bank info.
  • Never shop on a site that has any errors associated with their SSL/TLS certificates. Let’s Encrypt killed the integrity of the lock icon and well-resourced adversaries can thwart the encryption but the opportunistic attackers likely to try to pwn you are going to be stopped
  • Avoid shopping with Apps. App developers are generally daft and have wretched security practices baked into their apps.
  • Use “Private Browsing” mode to shop if at all possible and start new browser sessions per-site. Your shopping habits and purchase info is as or even more valuable than your card digits, esp to trackers.
  • Use Ublock Origin or other reputable ad-blockers and tracking blockers to prevent orgs from tracking you as you shop. A good hosts file wouldn’t hurt, either.
  • Use Quad9 as your DNS provider starting now.
  • Never shop online from public Wi-Fi.
  • Don’t shop online from your company’s network (even the “guest” network). They track you. They all do or at least send data (whether they know it or not) to security appliance and “cloud” services that will use it against you or to profit off of you.
  • Absolutely do not use a store’s Wi-Fi to shop.
  • If using Amazon, avoid third-party sellers if at all possible. Scammers abound.
  • Never use social networks to share what you just purchased.
  • Never “SQUEEE” on social media that any shipments are “arriving today” and you’re “so excited!”.
  • Don’t use that daft, new Amazon video-delivery-bluetooth-alexa lock thing. Ever.
  • If you can afford it, use an in-home (not cloud-based) security camera pointed at the place where deliveries come and review the footage daily if you are expecting deliveries.
  • In-person/brick-and-mortar shopping should be done at chip+pin establishments or use cash at all others.
  • Review your day’s purchases online at the end of the day or the next morning.
  • Report all issues immediately to authorities then the establishments.

Why this particular slice of advice?

The U.S. moved to chip & signature in October of 2016. This has forced attackers to find different, creative ways to get your credit card info. Yes, there were scads of breaches this year, but a good chunk of digital crime is plain ‘ol theft. Web sites make great targets. Public Wi-Fi makes a great target. You need to protect yourself since no store, org, bank, politician or authority really cares that your identity was stolen. If they did, we wouldn’t be in the breach mess we’re in now.

Attackers know you’re in deep “breach fatigue” and figure you’re all in a “Meh. Nothing matters” mood. Don’t be pwnd! A wrong move could put you in identity theft limbo for years.

The Identity Theft Resource Center — http://www.idtheftcenter.org/ — is a great resource and can definitely help you in the right direction if you don’t follow the above advice and run into issues.

?tay ?afe thi? ?hopping ?sea?on!