Skip navigation

Category Archives: Encryption

Phishing is [still] the primary way attackers either commit a primary criminal act (i.e. phish a target to, say, install ransomware) or is the initial vehicle used to gain a foothold in an organization so they can perform other criminal operations to achieve some goal. As such, security teams, vendors and active members of the cybersecurity community work diligently to neutralize phishing campaigns as quickly as possible.

One popular community tool/resource in this pursuit is PhishTank which is a collaborative clearing house for data and information about phishing on the Internet. Also, PhishTank provides an open API for developers and researchers to integrate anti-phishing data into their applications at no charge.

While the PhishTank API is useful for real-time anti-phishing operations the data is also useful for security researchers as we work to understand the ebb, flow and evolution of these attacks. One avenue of research is to track the various features associated with phishing campaigns which include (amongst many other elements) network (internet) location of the phishing site, industry being targeted, domain names being used, what type of sites are being cloned/copied and a feature we’ll be looking at in this post: what percentage of new phishing sites use SSL encryption and — of these — which type of SSL certificates are “en vogue”.

Phishing sites are increasingly using and relying on SSL certificates because we in the information security industry spent a decade instructing the general internet surfing population to trust sites with the green lock icon near the location bar. Initially, phishers worked to compromise existing, encryption-enabled web properties to install phishing sites/pages since they could leech off of the “trusted” status of the associated SSL certificates. However, the advent of services like Let’s Encrypt have made it possible for attacker to setup their own phishing domains that look legitimate to current-generation internet browsers and prey upon the decade’s old “trust the lock icon” mantra that most internet users still believe. We’ll table that path of discussion (since it’s fraught with peril if you don’t support the internet-do-gooder-consequences-be-darned cabal’s personal agendas) and just focus on how to work with PhishTank data in R and take a look at the most prevalent SSL certs used in the past week (you can extend the provided example to go back as far as you like provided the phishing sites are still online).

Accessing PhishTank From R

You can use the aquarium package [GL|GH] to gain access to the data provided by PhishTank’s API (you need to sign up for access and put you API key into the PHISHTANK_API_KEY environment variable which is best done via your ~/.Renviron file).

Let’s setup all the packages we’ll need and cache a current copy of the PhishTank data. The package forces you to utilize your own caching strategy since it doesn’t make sense for it to decide that for you. I’d suggest either using the time-stamped approach below or using some type of database system (or, say, Apache Drill) to actually manage the data.

Here are the packages we’ll need:

library(psl) # git[la|hu]b/hrbrmstr/psl
library(curlparse) # git[la|hu]b/hrbrmstr/curlparse
library(aquarium) # git[la|hu]b/hrbrmstr/aquarium
library(gt) # github/rstudio/gt
library(furrr)
library(stringi)
library(openssl)
library(tidyverse)

NOTE: The psl and curlparse packages are optional. Windows users will find it difficult to get them working and it may be easier to review the functions provided by the urlparse package and substitute equivalents for the domain() and apex_domain() functions used below. Now, we get a copy of the current PhishTank dataset & cache it:

if (!file.exists("~/Data/2018-12-23-fishtank.rds")) {
  xdf <- pt_read_db()
  saveRDS(xdf, "~/Data/2018-12-23-fishtank.rds")
} else {
  xdf <- readRDS("~/Data/2018-12-23-fishtank.rds")
}

Let’s take a look:

glimpse(xdf)
## Observations: 16,446
## Variables: 9
## $ phish_id          <chr> "5884184", "5884138", "5884136", "5884135", ...
## $ url               <chr> "http://internetbanking-bancointer.com.br/lo...
## $ phish_detail_url  <chr> "http://www.phishtank.com/phish_detail.php?p...
## $ submission_time   <dttm> 2018-12-22 20:45:09, 2018-12-22 18:40:24, 2...
## $ verified          <chr> "yes", "yes", "yes", "yes", "yes", "yes", "y...
## $ verification_time <dttm> 2018-12-22 20:45:52, 2018-12-22 21:26:49, 2...
## $ online            <chr> "yes", "yes", "yes", "yes", "yes", "yes", "y...
## $ details           <list> [<209.132.252.7, 209.132.252.0/24, 7296 468...
## $ target            <chr> "Other", "Other", "Other", "PayPal", "Other"...

The data is really straightforward. We have unique ids for each site/campaign the URL of the site along with a URL to extra descriptive info PhishTank has on the site/campaign. We also know when the site was submitted/discovered and other details, such as the network/internet space the site is in:

glimpse(xdf$details[1])
## List of 1
##  $ :'data.frame':    1 obs. of  6 variables:
##   ..$ ip_address        : chr "209.132.252.7"
##   ..$ cidr_block        : chr "209.132.252.0/24"
##   ..$ announcing_network: chr "7296 468"
##   ..$ rir               : chr "arin"
##   ..$ country           : chr "US"
##   ..$ detail_time       : chr "2018-12-23T01:46:16+00:00"

We’re going to focus on recent phishing sites (in this case, ones that are less than a week old) and those that use SSL certificates:

filter(xdf, verified == "yes") %>%
  filter(online == "yes") %>%
  mutate(diff = as.numeric(difftime(Sys.Date(), verification_time), "days")) %>%
  filter(diff <= 7) %>%
  { all_ct <<- nrow(.) ; . } %>%
  filter(grepl("^https", url)) %>%
  { ssl_ct <<- nrow(.) ; . } %>%
  mutate(
    domain = domain(url),
    apex = apex_domain(domain)
  ) -> recent

Let’s ee how many are using SSL:

(ssl_ct)
## [1] 383

(pct_ssl <- ssl_ct / all_ct)
## [1] 0.2919207

This percentage is lower than a recent “50% of all phishing sites use encryption” statistic going around of late. There are many reasons for the difference:

  • PhishTank doesn’t have all phishing sites in it
  • We just looked at a week of examples
  • Some sites were offline at the time of access attempt
  • Diverse attacker groups with varying degrees of competence engage in phishing attacks

Despite the 20% deviation, 30% is still a decent percentage, and a green, “everything’s ??” icon is a still a valued prize so we shall pursue our investigation.

Now we need to retrieve all those certs. This can be a slow operation that so we’ll grab them in parallel. It’s also quite possible the “online”status above data frame glimpse is inaccurate (sites can go offline quickly) so we’ll catch certificate request failures with safely() and cache the results:

cert_dl <- purrr::safely(openssl::download_ssl_cert)

plan(multiprocess)

if (!file.exists("~/Data/recent.rds")) {

  recent <- mutate(recent, cert = future_map(domain, cert_dl))
  saveRDS(recent, "~/Data/recent.rds")

} else {
  recent <- readRDS("~/Data/recent.rds")
}

Let see how many request failures we had:

(failed <- sum(map_lgl(recent$cert, ~is.null(.x$result))))
## [1] 25

(failed / nrow(recent))
## [1] 0.06527415

As noted in the introduction to the blog, when attackers want to use SSL for the lock icon ruse they can either try to piggyback off of legitimate domains or rely on Let’s Encrypt to help them commit crimes. Let’s see what the top p”apex” domains](https://help.github.com/articles/about-supported-custom-domains/#apex-domains) were in use in the past week:

count(recent, apex, sort = TRUE)
## # A tibble: 255 x 2
##    apex                              n
##    <chr>                         <int>
##  1 000webhostapp.com                42
##  2 google.com                       17
##  3 umbler.net                        8
##  4 sharepoint.com                    6
##  5 com-fl.cz                         5
##  6 lbcpzonasegurabeta-viabcp.com     4
##  7 windows.net                       4
##  8 ashaaudio.net                     3
##  9 brijprints.com                    3
## 10 portaleisp.com                    3
## # ... with 245 more rows

We can see that a large hosting provider (000webhostapp.com) bore a decent number of these sites, but Google Sites (which is what the full domain represented by the google.com apex domain here is usually pointing to) Microsoft SharePoint (sharepoint.com) and Microsoft forums (windows.net) are in active use as well (which is smart give the pervasive trust associated with those properties). There are 241 distinct apex domains in this 1-week set so what is the SSL cert diversity across these pages/campaigns?

We ultimately used openssl::download_ssl_cert to retrieve the SSL certs of each site that was online, so let’s get the issuer and intermediary certs from them and look at the prevalence of each. We’ll extract the fields from the issuer component returned by openssl::download_ssl_cert then just do some basic maths:

filter(recent, map_lgl(cert, ~!is.null(.x$result))) %>%
  mutate(issuers = map(cert, ~map_chr(.x$result, ~.x$issuer))) %>%
  mutate(
    inter = map_chr(issuers, ~.x[1]), # the order is not guaranteed here but the goal of the exercise is
    root = map_chr(issuers, ~.x[2])   # to get you working with the data vs build a 100% complete solution
  ) %>%
  mutate(
    inter = stri_replace_all_regex(inter, ",([[:alpha:]])+=", ";;;$1=") %>%
      stri_split_fixed(";;;") %>% # there are parswers for the cert info fields but this hack is quick and works
      map(stri_split_fixed, "=", 2, simplify = TRUE) %>%
      map(~setNames(as.list(.x[,2]), .x[,1])) %>%
      map(bind_cols),
    root = stri_replace_all_regex(root, ",([[:alpha:]])+=", ";;;$1=") %>%
      stri_split_fixed(";;;") %>%
      map(stri_split_fixed, "=", 2, simplify = TRUE) %>%
      map(~setNames(as.list(.x[,2]), .x[,1])) %>%
      map(bind_cols)
  ) -> recent

Let’s take a look at roots:

unnest(recent, root) %>%
  distinct(phish_id, apex, CN) %>%
  count(CN, sort = TRUE) %>%
  mutate(pct = n/sum(n)) %>%
  gt::gt() %>%
  gt::fmt_number("n", decimals = 0) %>%
  gt::fmt_percent("pct")

CN n pct
DST Root CA X3 96 26.82%
COMODO RSA Certification Authority 93 25.98%
DigiCert Global Root G2 45 12.57%
Baltimore CyberTrust Root 30 8.38%
GlobalSign 27 7.54%
DigiCert Global Root CA 15 4.19%
Go Daddy Root Certificate Authority – G2 14 3.91%
COMODO ECC Certification Authority 11 3.07%
Actalis Authentication Root CA 9 2.51%
GlobalSign Root CA 4 1.12%
Amazon Root CA 1 3 0.84%
Let’s Encrypt Authority X3 3 0.84%
AddTrust External CA Root 2 0.56%
DigiCert High Assurance EV Root CA 2 0.56%
USERTrust RSA Certification Authority 2 0.56%
GeoTrust Global CA 1 0.28%
SecureTrust CA 1 0.28%

DST Root CA X3 is (wait for it) Let’s Encrypt! Now, Comodo is not far behind and indeed surpasses LE if we combine the extra-special “enhanced” versions they provide and it’s important for you to read the comments near the lines of code making assumptions about order of returned issuer information above. Now, let’s take a look at intermediaries:

unnest(recent, inter) %>%
  distinct(phish_id, apex, CN) %>%
  count(CN, sort = TRUE) %>%
  mutate(pct = n/sum(n)) %>%
  gt::gt() %>%
  gt::fmt_number("n", decimals = 0) %>%
  gt::fmt_percent("pct")

CN n pct
Let’s Encrypt Authority X3 99 27.65%
cPanel\, Inc. Certification Authority 75 20.95%
RapidSSL TLS RSA CA G1 45 12.57%
Google Internet Authority G3 24 6.70%
COMODO RSA Domain Validation Secure Server CA 20 5.59%
CloudFlare Inc ECC CA-2 18 5.03%
Go Daddy Secure Certificate Authority – G2 14 3.91%
COMODO ECC Domain Validation Secure Server CA 2 11 3.07%
Actalis Domain Validation Server CA G1 9 2.51%
RapidSSL RSA CA 2018 9 2.51%
Microsoft IT TLS CA 1 6 1.68%
Microsoft IT TLS CA 5 6 1.68%
DigiCert SHA2 Secure Server CA 5 1.40%
Amazon 3 0.84%
GlobalSign CloudSSL CA – SHA256 – G3 2 0.56%
GTS CA 1O1 2 0.56%
AlphaSSL CA – SHA256 – G2 1 0.28%
DigiCert SHA2 Extended Validation Server CA 1 0.28%
DigiCert SHA2 High Assurance Server CA 1 0.28%
Don Dominio / MrDomain RSA DV CA 1 0.28%
GlobalSign Extended Validation CA – SHA256 – G3 1 0.28%
GlobalSign Organization Validation CA – SHA256 – G2 1 0.28%
RapidSSL SHA256 CA 1 0.28%
TrustAsia TLS RSA CA 1 0.28%
USERTrust RSA Domain Validation Secure Server CA 1 0.28%
NA 1 0.28%

LE is number one again! But, it’s important to note that these issuer CommonNames can roll up into a single issuing organization given just how messed up integrity and encryption capability is when it comes to web site certs, so the raw results could do with a bit of post-processing for a more complete picture (an exercise left to intrepid readers).

FIN

There are tons of avenues to explore with this data, so I hope this post whet your collective appetites sufficiently for you to dig into it, especially if you have some dowm-time coming.

Let me also take this opportunity to resissue guidance I and many others have uttered this holiday season: be super careful about what you click on, which sites you even just visit, and just how much you really trust the site, provider and entity behind the form about to enter your personal information and credit card info into.

I didn’t read through the Massachusetts 2011 Report on Data Breach Notifications [PDF] until recently, but once I went through the report my brain kept telling me “something is wrong”. Not something earth shattering, but more of a “something is off” signal. This happens more than I’d like as I tend to constantly background process what I intake visually.

As Twitter followers may lament, I have been known to transcribe useful tabular information from reports such as these, especially when I need to communicate them internally and I have done so with this report [gdocs] as well.

After working through the whole document, the last page of data is where I found the “off by one” error (see figure below). Someone performed “head math” vs copying & formatting from a spreadsheet. Never a good idea if you aren’t going to double-check the report thoroughly.

 

Off By One

My transcription (“Lost Stolen Misplaced” tab in the aforelinked workbook) assumes the “5” and “48” are correct and has the correct total (“53”). One of the problems when an error like this crops up is that you do not know where the error occurred, but since the sums of “12” and “277” are both correct in the spreadsheet and in the report, I think I’ve found the culprit. Unfortunately, a computational error such as this does foster suspicion on the accuracy of the rest of the report data.

It’s a lesson report writers should heed well: compute twice, publish once. Errant data can cut as deeply as a saw blade.

While I Have Your Attention

Since there aren’t many visualizations in  Massachusetts 2011 Report on Data Breach Notifications (3D numbers do not count), here are a few I made that I found helpful during my interpretation (2011 data unless otherwise specified):

# Residents Impacted By Breah Org

Number Of Breached By Org

Number of Breaches by Type 2008-2011

Residents Impacted By Breach Type

Lost/Stolen/Misplaced

Malicious/Non-Malicious

 

 

If you are concerned about the Dropbox design flaw exposed by the dbClone attack, then have we got a link for you!

The intrepid DB devs have tossed up a forum release which purports to fix all the thorny security issues. You can no longer just copy a config file to a separate machine to clone a filesystem and the file itself is now also encrypted. (Forum builds do not automagically download like standard Dropbox updates)

Given the fact that Dropbox did not prompt me for any credentials when I started the new version, I’m still a bit skeptical that it has truly fixed the problem. Given my schedule today, I doubt I’ll have time to poke at it before someone else does, but the thoroughness of this fix does need to be independently validated. The local Dropbox client has to be getting the encryption key/passphrase from *somewhere*, and if it’s not prompting me on start, then it’s stored online or locally and that’s a recipe for another hack.

There is nothing overt in the application bundle (looking on OS X) or quickly discernable from a dump of a few of the app’s .pyc files. Granted, a bit of obfuscation will stop the current hack and dissuade some other sophomoric attempts, but I can almost guarantee that the passphrase (or the algorithm one needs to discern the passphrase) will be found by folks.

The new build replaces your local configuration file with a new, encrypted one (now named config.dbx). I didn’t see signs of either SQLiteEncrypt, SEE, SQLCipher or SQLiteCrypt but haven’t had time to dig more thoroughly. It’s completely possible the Dropbox devs just built an encryption layer over the Dropbox calls themselves (which is not a difficult task).

Please note that forum builds are not necessarily stable and that this is a pretty major architecture change. I had no issues on OS X, but I suspect that any micro-errors in your SQLite config.db may cause some heartache if you do attempt the upgrade. Best to wait for a full production release if you do not have your Dropbox backed up somewhere.