Skip navigation

RIPE 76 is going on this week and — as usual — there are scads of great talks. The selected ones below are just my (slightly) thinner slice at what may have broader appeal outside pure networking circles.

Do not read anything more into the order than the end-number of the “Main URL” since this was auto-generated from a script that processed my Firefox tab URLs.

Artyom Gavrichenkov – Memcache Amplification DDoS: Lessons Learned

Erik Bais – Why Do We Still See Amplification DDOS Traffic

Jordi Palet Martinez – A New Internet Intro to HTTP/2, QUIC, DOH and DNS over QUIC

Sara Dickinson – DNS Privacy BCP

Jordi Palet Martinez – Email Servers on IPv6

Martin Winter – Real-Time BGP Toolkit: A New BGP Monitor Service

Job Snijders – Practical Data Sources For BGP Routing Security

Charles Eckel – Combining Open Source and Open Standards

Kostas Zorbadelos – Towards IPv6 Only: A large scale lw4o6 deployment (rfc7596) for broadband users @AS6799

Louis Poinsignon – Internet Noise (Announcing 1.1.1.0/24)

Filiz Yilmaz – Current Policy Topics – Global Policy Proposals

Geoff Huston – Measuring ATR

Moritz Muller, SIDN – DNSSEC Rollovers

Anand Buddhdev – DNS Status Report

Victoria Risk – A Survey on DNS Privacy

Baptiste Jonglez – High-Performance DNS over TCP

Sara Dickinson – Latest Measurements on DNS Privacy

Willem Toorop – Sunrise DNS-over-TLS! Sunset DNSSEC – Who Needs Reasons, When You’ve Got Heroes

Laurenz Wagner – A Modern Chatbot Approach for Accessing the RIPE Database

The U.S. FBI Internet Crime Complaint Center was established in 2000 to receive complaints of Internet crime. They produce an annual report, just released 2017’s edition, and I need the data from it. Since I have to wrangle it out, I thought some folks might like to play long at home, especially since it turns out I had to use both tabulizer and pdftools to accomplish my goal.

Concepts presented:

  • PDF scraping (with both tabulizer and pdftools)
  • asciiruler
  • general string manipulation
  • case_when() vs ifelse() for text cleanup
  • reformatting data for ggraph treemaps

Let’s get started! (NOTE: you can click/tap on any image for a larger version)


library(stringi)
library(pdftools)
library(tabulizer)
library(igraph)
library(ggraph) # devtools::install_github("thomasp85/ggraph")
library(hrbrthemes)
library(tidyverse)

ic3_file <- "~/Data/2017-ic3-report.pdf" # change "~/Data" for your system

if (!file.exists(ic3_file)) { # don't waste anyone's bandwidth
  download.file("https://pdf.ic3.gov/2017_IC3Report.pdf", ic3_file)
}

Let's try pdftools since I like text wrangling


cat(pdftools::pdf_text(ic3_file)[[20]])
##                                                             2017 Internet Crime Report         20
## 2017 Crime Types
##                                  By Victim Count
## Crime Type                         Victims     Crime Type                            Victims
## Non-Payment/Non-Delivery           84,079      Misrepresentation                        5,437
## Personal Data Breach               30,904      Corporate Data Breach                    3,785
## Phishing/Vishing/Smishing/Pharming 25,344      Investment                               3,089
## Overpayment                        23,135      Malware/Scareware/Virus                  3,089
## No Lead Value                      20,241      Lottery/Sweepstakes                      3,012
## Identity Theft                     17,636      IPR/Copyright and                        2,644
##                                                Counterfeit
## Advanced Fee                       16,368      Ransomware                               1,783
## Harassment/Threats of Violence     16,194      Crimes Against Children                  1,300
## Employment                         15,784      Denial of Service/TDoS                   1,201
## BEC/EAC                            15,690      Civil Matter                             1,057
## Confidence Fraud/Romance           15,372      Re-shipping                              1,025
## Credit Card Fraud                  15,220      Charity                                    436
## Extortion                          14,938      Health Care Related                        406
## Other                              14,023      Gambling                                   203
## Tech Support                       10,949      Terrorism                                  177
## Real Estate/Rental                  9,645      Hacktivist                                 158
## Government Impersonation            9,149
## Descriptors*
## Social Media                       19,986      *These descriptors relate to the medium or
## Virtual Currency                    4,139      tool used to facilitate the crime, and are used
##                                                by the IC3 for tracking purposes only. They
##                                                are available only after another crime type
##                                                has been selected.

OK, I don't like text wrangling that much. How about tabulizer?


tabulizer::extract_tables(ic3_file, pages = 20)

## list()

Well, that's disappointing. Perhaps if we target the tables on the PDF pages we'll have better luck. You can find them on pages 20 and 21 if you downloaded your own copy. Here are some smaller, static views of them:

I can't show the tabulizer pane (well I could if I had time to screen capture and make an animated gif) but run this to get the areas:


areas <- tabulizer::locate_areas(ic3_file, pages = 20:21)

# this is what ^^ produces for my rectangles:

list(
  c(top = 137.11911357341, left = 66.864265927978, bottom = 413.5512465374, right = 519.90581717452),
  c(top = 134.92520775623, left = 64.670360110803, bottom = 458.52631578947, right = 529.7783933518)
) -> areas

Now, see if tabulizer can do a better job. We'll start with the first page:


tab <- tabulizer::extract_tables(ic3_file, pages = 20, area = areas[1])

tab
## [[1]]
##       [,1]                                 [,2]              
##  [1,] ""                                   "By Victim Cou nt"
##  [2,] "Crime Type"                         "Victims"         
##  [3,] "Non-Payment/Non-Delivery"           "84,079"          
##  [4,] "Personal Data Breach"               "30,904"          
##  [5,] "Phishing/Vishing/Smishing/Pharming" "25,344"          
##  [6,] "Overpayment"                        "23,135"          
##  [7,] "No Lead Value"                      "20,241"          
##  [8,] "Identity Theft"                     "17,636"          
##  [9,] ""                                   ""                
## [10,] "Advanced Fee"                       "16,368"          
## [11,] "Harassment/Threats of Violence"     "16,194"          
## [12,] "Employment"                         "15,784"          
## [13,] "BEC/EAC"                            "15,690"          
## [14,] "Confidence Fraud/Romance"           "15,372"          
## [15,] "Credit Card Fraud"                  "15,220"          
## [16,] "Extortion"                          "14,938"          
## [17,] "Other"                              "14,023"          
## [18,] "Tech Support"                       "10,949"          
## [19,] "Real Estate/Rental"                 "9,645"           
## [20,] "G overnment Impersonation"          "9,149"           
## [21,] ""                                   ""                
## [22,] "Descriptors*"                       ""                
##       [,3]                      [,4]     
##  [1,] ""                        ""       
##  [2,] "Crime Type"              "Victims"
##  [3,] "Misrepresentation"       "5,437"  
##  [4,] "Corporate Data Breach"   "3,785"  
##  [5,] "Investment"              "3,089"  
##  [6,] "Malware/Scareware/Virus" "3,089"  
##  [7,] "Lottery/Sweepstakes"     "3,012"  
##  [8,] "IPR/Copyright and"       "2,644"  
##  [9,] "Counterfeit"             ""       
## [10,] "Ransomware"              "1,783"  
## [11,] "Crimes Against Children" "1,300"  
## [12,] "Denial of Service/TDoS"  "1,201"  
## [13,] "Civil Matter"            "1,057"  
## [14,] "Re-shipping"             "1,025"  
## [15,] "Charity"                 "436"    
## [16,] "Health Care Related"     "406"    
## [17,] "Gambling"                "203"    
## [18,] "Terrorism"               "177"    
## [19,] "Hacktivist"              "158"    
## [20,] ""                        ""       
## [21,] ""                        ""       
## [22,] ""                        ""

Looking good. How does it look data-frame'd?


tab <- as_data_frame(tab[[1]])

print(tab, n=50)
## # A tibble: 22 x 4
##    V1                                 V2               V3            V4   
##  1 ""                                 By Victim Cou nt ""            ""   
##  2 Crime Type                         Victims          Crime Type    Vict…
##  3 Non-Payment/Non-Delivery           84,079           Misrepresent… 5,437
##  4 Personal Data Breach               30,904           Corporate Da… 3,785
##  5 Phishing/Vishing/Smishing/Pharming 25,344           Investment    3,089
##  6 Overpayment                        23,135           Malware/Scar… 3,089
##  7 No Lead Value                      20,241           Lottery/Swee… 3,012
##  8 Identity Theft                     17,636           IPR/Copyrigh… 2,644
##  9 ""                                 ""               Counterfeit   ""   
## 10 Advanced Fee                       16,368           Ransomware    1,783
## 11 Harassment/Threats of Violence     16,194           Crimes Again… 1,300
## 12 Employment                         15,784           Denial of Se… 1,201
## 13 BEC/EAC                            15,690           Civil Matter  1,057
## 14 Confidence Fraud/Romance           15,372           Re-shipping   1,025
## 15 Credit Card Fraud                  15,220           Charity       436  
## 16 Extortion                          14,938           Health Care … 406  
## 17 Other                              14,023           Gambling      203  
## 18 Tech Support                       10,949           Terrorism     177  
## 19 Real Estate/Rental                 9,645            Hacktivist    158  
## 20 G overnment Impersonation          9,149            ""            ""   
## 21 ""                                 ""               ""            ""   
## 22 Descriptors*                       ""               ""            ""

Still pretty good. Cleaning it up is pretty simple from here. Just filter out some rows, parse some numbers, fix some chopped labels and boom - done:


tab <- filter(tab[3:21,], !V2 == "")

bind_rows(
  select(tab, crime = V1, n_victims = V2),
  select(tab, crime = V3, n_victims = V4)
) %>%
  filter(crime != "") %>%
  mutate(n_victims = readr::parse_number(n_victims)) %>%
  mutate(crime = case_when(
    stri_detect_fixed(crime, "G o") ~ "Government Impersonation",
    stri_detect_fixed(crime, "IPR/C") ~ "IPR/Copyright and Counterfeit",
    TRUE ~ crime
  )) %>%
  print(n=50) -> ic3_2017_crimes_victim_ct
## # A tibble: 33 x 2
##    crime                              n_victims
##                                      
##  1 Non-Payment/Non-Delivery              84079.
##  2 Personal Data Breach                  30904.
##  3 Phishing/Vishing/Smishing/Pharming    25344.
##  4 Overpayment                           23135.
##  5 No Lead Value                         20241.
##  6 Identity Theft                        17636.
##  7 Advanced Fee                          16368.
##  8 Harassment/Threats of Violence        16194.
##  9 Employment                            15784.
## 10 BEC/EAC                               15690.
## 11 Confidence Fraud/Romance              15372.
## 12 Credit Card Fraud                     15220.
## 13 Extortion                             14938.
## 14 Other                                 14023.
## 15 Tech Support                          10949.
## 16 Real Estate/Rental                     9645.
## 17 Government Impersonation               9149.
## 18 Misrepresentation                      5437.
## 19 Corporate Data Breach                  3785.
## 20 Investment                             3089.
## 21 Malware/Scareware/Virus                3089.
## 22 Lottery/Sweepstakes                    3012.
## 23 IPR/Copyright and Counterfeit          2644.
## 24 Ransomware                             1783.
## 25 Crimes Against Children                1300.
## 26 Denial of Service/TDoS                 1201.
## 27 Civil Matter                           1057.
## 28 Re-shipping                            1025.
## 29 Charity                                 436.
## 30 Health Care Related                     406.
## 31 Gambling                                203.
## 32 Terrorism                               177.
## 33 Hacktivist                              158.

Now, on to page 2!


tab <- tabulizer::extract_tables(ic3_file, pages = 21, area = areas[2])

tab
## [[1]]
##       [,1]                         [,2]                                
##  [1,] ""                           "By Victim Lo ss"                   
##  [2,] "Crime Type"                 "Loss  Crime Type"                  
##  [3,] "BEC/EAC"                    "$676,151,185 Misrepresentation"    
##  [4,] "Confidence Fraud/Romance"   "$211,382,989 Harassment/Threats"   
##  [5,] ""                           "of Violence"                       
##  [6,] "Non-Payment/Non-Delivery"   "$141,110,441 Government"           
##  [7,] ""                           "Impersonation"                     
##  [8,] "Investment"                 "$96,844,144 Civil Matter"          
##  [9,] "Personal Data Breach"       "$77,134,865 IPR/Copyright and"     
## [10,] ""                           "Counterfeit"                       
## [11,] "Identity Theft"             "$66,815,298 Malware/Scareware/"    
## [12,] ""                           "Virus"                             
## [13,] "Corporate Data Breach"      "$60,942,306 Ransomware"            
## [14,] "Advanced Fee"               "$57,861,324 Denial of Service/TDoS"
## [15,] "Credit Card Fraud"          "$57,207,248 Charity"               
## [16,] "Real Estate/Rental"         "$56,231,333 Health Care Related"   
## [17,] "Overpayment"                "$53,450,830 Re-Shipping"           
## [18,] "Employment"                 "$38,883,616 Gambling"              
## [19,] "Phishing/Vishing/Smishing/" "$29,703,421 Crimes Against"        
## [20,] "Pharming"                   "Children"                          
## [21,] "Other"                      "$23,853,704 Hacktivist"            
## [22,] "Lottery/Sweepstakes"        "$16,835,001 Terrorism"             
## [23,] "Extortion"                  "$15,302,792 N o Lead Value"        
## [24,] "Tech Support"               "$14,810,080"                       
## [25,] ""                           ""                                  
## [26,] ""                           ""                                  
##       [,3]          
##  [1,] ""            
##  [2,] "Loss"        
##  [3,] "$14,580,907" 
##  [4,] "$12,569,185" 
##  [5,] ""            
##  [6,] "$12,467,380" 
##  [7,] ""            
##  [8,] "$5,766,550"  
##  [9,] "$5,536,912"  
## [10,] ""            
## [11,] "$5,003,434"  
## [12,] ""            
## [13,] "$2,344,365"  
## [14,] "$1,466,195"  
## [15,] "$1,405,460"  
## [16,] "$925,849"    
## [17,] "$809,746"    
## [18,] "$598,853"    
## [19,] "$46,411"     
## [20,] ""            
## [21,] "$20,147"     
## [22,] "$18,926"     
## [23,] "$0"          
## [24,] ""            
## [25,] ""            
## [26,] "Descriptors*"

:facepalm: That's disappointing. Way too much scrambled content. So, back to pdftools!


cat(pg21 <- pdftools::pdf_text(ic3_file)[[21]])
##                                                    Internet Crime Complaint Center         21
## 2017 Crime Types Continued
##                             By Victim Loss
## Crime Type                 Loss            Crime Type                      Loss
## BEC/EAC                    $676,151,185    Misrepresentation               $14,580,907
## Confidence Fraud/Romance   $211,382,989    Harassment/Threats              $12,569,185
##                                            of Violence
## Non-Payment/Non-Delivery   $141,110,441    Government                      $12,467,380
##                                            Impersonation
## Investment                  $96,844,144    Civil Matter                      $5,766,550
## Personal Data Breach        $77,134,865    IPR/Copyright and                 $5,536,912
##                                            Counterfeit
## Identity Theft              $66,815,298    Malware/Scareware/                $5,003,434
##                                            Virus
## Corporate Data Breach       $60,942,306    Ransomware                        $2,344,365
## Advanced Fee                $57,861,324    Denial of Service/TDoS            $1,466,195
## Credit Card Fraud           $57,207,248    Charity                           $1,405,460
## Real Estate/Rental          $56,231,333    Health Care Related                 $925,849
## Overpayment                 $53,450,830    Re-Shipping                         $809,746
## Employment                  $38,883,616    Gambling                            $598,853
## Phishing/Vishing/Smishing/  $29,703,421    Crimes Against                        $46,411
## Pharming                                   Children
## Other                       $23,853,704    Hacktivist                            $20,147
## Lottery/Sweepstakes         $16,835,001    Terrorism                             $18,926
## Extortion                   $15,302,792    No Lead Value                                $0
## Tech Support                $14,810,080
##                                                                            Descriptors*
## Social Media                $56,478,483    *These descriptors relate to the medium or
## Virtual Currency            $58,391,810    tool used to facilitate the crime, and are used
##                                            by the IC3 for tracking purposes only. They
##                                            are available only after another crime type
##                                            has been selected.

This is (truthfully) not too bad. Just make columns from substring ranges and do some cleanup. The asciiruler package can definitely help here since it makes it much easier to see start/stop points (I used a new editor pane and copied some lines into it):


stri_split_lines(pg21)[[1]] %>%
  .[-(1:4)] %>% # remove header & bits above header
  .[-(26:30)] %>% # remove trailing bits
  map_df(~{
    list(
      crime = stri_trim_both(c(stri_sub(.x, 1, 25), stri_sub(.x, 43, 73))),
      cost = stri_trim_both(c(stri_sub(.x, 27, 39), stri_sub(.x, 74))) # no length/to in the last one so it goes until eol
    )
  }) %>%
  filter(!(crime == "" | cost == "")) %>% # get rid of blank rows
  mutate(cost = suppressWarnings(readr::parse_number(cost))) %>% # we can use NAs generated to remove non-data rows
  filter(!is.na(cost)) %>%
  mutate(crime = case_when(
    stri_detect_fixed(crime, "Phish") ~ "Phishing/Vishing/Smishing/Pharming",
    stri_detect_fixed(crime, "Malware") ~ "Malware/Scareware/Virus",
    stri_detect_fixed(crime, "IPR") ~ "IPR/Copyright and Counterfeit",
    stri_detect_fixed(crime, "Harassment") ~ "Harassment/Threats of Violence",
    TRUE ~ crime
  )) %>%
  print(n=50) -> ic3_2017_crimes_cost
## # A tibble: 35 x 2
##    crime                                    cost
##  1 BEC/EAC                            676151185.
##  2 Misrepresentation                   14580907.
##  3 Confidence Fraud/Romance           211382989.
##  4 Harassment/Threats of Violence      12569185.
##  5 Non-Payment/Non-Delivery           141110441.
##  6 Government                          12467380.
##  7 Investment                          96844144.
##  8 Civil Matter                         5766550.
##  9 Personal Data Breach                77134865.
## 10 IPR/Copyright and Counterfeit        5536912.
## 11 Identity Theft                      66815298.
## 12 Malware/Scareware/Virus              5003434.
## 13 Corporate Data Breach               60942306.
## 14 Ransomware                           2344365.
## 15 Advanced Fee                        57861324.
## 16 Denial of Service/TDoS               1466195.
## 17 Credit Card Fraud                   57207248.
## 18 Charity                              1405460.
## 19 Real Estate/Rental                  56231333.
## 20 Health Care Related                   925849.
## 21 Overpayment                         53450830.
## 22 Re-Shipping                           809746.
## 23 Employment                          38883616.
## 24 Gambling                              598853.
## 25 Phishing/Vishing/Smishing/Pharming  29703421.
## 26 Crimes Against                         46411.
## 27 Other                               23853704.
## 28 Hacktivist                             20147.
## 29 Lottery/Sweepstakes                 16835001.
## 30 Terrorism                              18926.
## 31 Extortion                           15302792.
## 32 No Lead Value                              0.
## 33 Tech Support                        14810080.
## 34 Social Media                        56478483.
## 35 Virtual Currency                    58391810.

Now that we have real data, we can take a look at the IC3 crimes by loss and victims.

We'll use treemaps first then take a quick look at the relationship between counts and losses.

Just need to do some data wrangling for ggraph, starting with victims:


ic3_2017_crimes_victim_ct %>%
  mutate(crime = case_when(
    crime == "Government Impersonation" ~ "Government\nImpersonation",
    crime == "Corporate Data Breach" ~ "Corporate\nData\nBreach",
    TRUE ~ crime
  )) %>%
  mutate(crime = stri_replace_all_fixed(crime, "/", "/\n")) %>%
  mutate(grp = "ROOT") %>%
  add_row(grp = "ROOT", crime="ROOT", n_victims=0) %>%
  select(grp, crime, n_victims) %>%
  arrange(desc(n_victims)) -> gdf

select(gdf, -grp) %>%
  mutate(lab = sprintf("%s\n(%s)", crime, scales::comma(n_victims))) %>%
  mutate(lab = ifelse(n_victims > 1300, lab, "")) %>% # don't show a label when blocks are small
  mutate(lab_col = ifelse(n_victims > 40000, "#2b2b2b", "#cccccc")) -> vdf # change up colors when blocks are lighter

g <- graph_from_data_frame(gdf, vertices=vdf)

ggraph(g, "treemap", weight=n_victims) +
  geom_node_tile(aes(fill=n_victims), size=0.25) +
  geom_text(
    aes(x, y, label=lab, size=n_victims, color = I(lab_col)),
    family=font_ps, lineheight=0.875
  ) +
  scale_x_reverse(expand=c(0,0)) +
  scale_y_continuous(expand=c(0,0)) +
  scale_size_continuous(trans = "sqrt", range = c(0.5, 8)) +
  labs(title = "FBI 2017 Internet Crime Report — Crimes By Victim Count") +
  ggraph::theme_graph(base_family = font_ps) +
  theme(plot.title = element_text(color="#cccccc", family = "IBMPlexSans-Bold")) +
  theme(panel.background = element_rect(fill="black")) +
  theme(plot.background = element_rect(fill="black")) +
  theme(legend.position="none")


# Now, do the same for losses:

ic3_2017_crimes_cost %>%
  mutate(crime = case_when(
    crime == "Phishing/Vishing/Smishing/Pharming" ~ "Phishing/Vishing/\nSmishing/Pharming",
    crime == "Harassment/Threats of Violence" ~ "Harassment/\nThreats of Violence",
    crime == "Lottery/Sweepstakes" ~ "Lottery\nSweepstakes",
    TRUE ~ crime
  )) %>%
  filter(cost > 0) %>%
  mutate(cost = cost / 1000000) %>%
  mutate(grp = "ROOT") %>%
  add_row(grp = "ROOT", crime="ROOT", cost=0) %>%
  select(grp, crime, cost) %>%
  arrange(desc(cost)) -> gdf_cost

select(gdf_cost, -grp) %>%
  mutate(lab = sprintf("%s\n($%s M)", crime, prettyNum(cost, digits=2))) %>%
  mutate(lab = ifelse(cost > 5.6, lab, "")) %>%
  mutate(lab_col = ifelse(cost > 600, "#2b2b2b", "#cccccc")) -> vdf_cost

g_cost <- graph_from_data_frame(gdf_cost, vertices=vdf_cost)

ggraph(g_cost, "treemap", weight=cost) +
  geom_node_tile(aes(fill=cost), size=0.25) +
  geom_text(
    aes(x, y, label=lab, size=cost, color=I(lab_col)),
    family=font_ps, lineheight=0.875
  ) +
  scale_x_reverse(expand=c(0,0)) +
  scale_y_continuous(expand=c(0,0)) +
  scale_size_continuous(trans = "sqrt", range = c(0.5, 8)) +
  labs(title = "FBI 2017 Internet Crime Report — Crime Loss By Category") +
  ggraph::theme_graph(base_family = font_ps) +
  theme(plot.title = element_text(color="#cccccc", family = "IBMPlexSans-Bold")) +
  theme(panel.background = element_rect(fill="black")) +
  theme(plot.background = element_rect(fill="black")) +
  theme(legend.position="none")

Let's plot victim counts vs losses to see what stands out:


left_join(ic3_2017_crimes_victim_ct, ic3_2017_crimes_cost) %>%
  filter(!is.na(cost)) %>%
  ggplot(aes(n_victims, cost)) +
  geom_point() +
  ggrepel::geom_label_repel(aes(label = crime), family=font_ps, size=3) +
  scale_x_comma() +
  scale_y_continuous(labels=scales::dollar) +
  labs(
    x = "# of Victims", y = "Loss magnitude",
    title = "FBI 2017 Internet Crime Report — Crime Loss By Victim Count ~ Category"
  ) +
  theme_ipsum_ps(grid="XY")

BEC == "Business email compromise" and it's definitely a major problem, but those two count/loss outliers are not helping us see the rest of the data. Let's zoom in:


left_join(ic3_2017_crimes_victim_ct, ic3_2017_crimes_cost) %>%
  filter(!is.na(cost)) %>%
  filter(cost < 300000000) %>%
  filter(n_victims < 40000) %>%
  ggplot(aes(n_victims, cost)) +
  geom_point() +
  ggrepel::geom_label_repel(aes(label = crime), family=font_ps, size=3) +
  scale_x_comma() +
  scale_y_continuous(labels=scales::dollar) +
  labs(
    x = "# of Victims", y = "Loss magnitude",
    title = "FBI 2017 Internet Crime Report — Crime Loss By Victim Count ~ Category",
    subtitle = "NOTE: BEC/EAC and Non-payment/Non-delivery removed"
  ) +
  theme_ipsum_ps(grid="XY")

Better, but let's go zoom in a bit more:


left_join(ic3_2017_crimes_victim_ct, ic3_2017_crimes_cost) %>%
  filter(!is.na(cost)) %>%
  filter(cost < 50000000) %>%
  filter(n_victims < 10000) %>%
  ggplot(aes(n_victims, cost)) +
  geom_point() +
  ggrepel::geom_label_repel(aes(label = crime), family=font_ps, size=3) +
  scale_x_comma() +
  scale_y_continuous(labels=scales::dollar) +
  labs(
    x = "# of Victims", y = "Loss magnitude",
    title = "FBI 2017 Internet Crime Report — Crime Loss By Victim Count ~ Category",
    subtitle = "NOTE: Only includes losses between $0-50 M USD & victim counts <= 10,000 "
  ) +
  theme_ipsum_ps(grid="XY")

Looks like the ransomware folks have quite a bit of catching up to do (at least when it comes to crimes reported to the IC3).

Earlier today, @noamross posted to Twitter:

The answer was a 1:1 “file upload” curl to httr translation:

httr::POST(
  url = "https://file.io",
  encode = "multipart",
  body = list(file = httr::upload_file("/some/path/to/file")),
)

but I wanted to do more than that since Noam took 20 minutes out his day this week (with no advance warning) to speak to my intro-to-stats class about his work and R.

The Twitter request was (ultimately) a question on how to use R to post content to https://file.io. They have a really simple API, and the timespan from Noam’s request to the initial commit of a fully functional package was roughly 17 minutes. The end product included the ability to post files, strings and R data (something that seemed like a good thing to add).

Not too long after came a v0.1.0 release complete with tests and passing CRAN checks on all platforms.

Noam also suggested I do a screencast:

I don’t normally do screencasts but had some conference call time so folks can follow along at home:

That’s not the best screencast in the world, but it’s very representative of the workflow I used. A great deal of boilerplate package machinations is accomplished with this bash script.

I wasn’t happy with the hurried function name choices I made nor was I thrilled with the package title, description, tests and basic docs, so I revamped all those into another release. That took a while, mostly due to constantly triggering API warnings about being rate-limited.

So, if you have a 5 GB or less file, character vector or in-memory R data you’d like to ephemerally share with others, take the fileio package for a spin:

devtools::install_github("hrbrmstr/fileio")

fileio::fi_post_text("TWFrZSBzdXJlIHRvIEAgbWUgb24gVHdpdHRlciBpZiB5b3UgZGVjb2RlIHRoaXM=")
##   success    key                   link  expiry
## 1    TRUE n18ZSB https://file.io/n18ZSB 14 days

(bonus points if you can figure out what that seemingly random sequence of characters says).

I spent some time this morning upgrading the JDBC driver (and changing up some supporting code to account for changes to it) for my metis package? which connects R up to Amazon Athena via RJDBC. I’m used to JDBC and have to deal with Java separately from R so I’m also comfortable with Java, JDBC and keeping R working with Java. I notified the Twitterverse about it and it started this thread (click on the embed to go to it — and, yes, this means Twitter is tracking you via this post unless you’ve blocked their JavaScript):

If you do scroll through the thread you’ll see @hadleywickham suggested using the odbc package with the ODBC driver for Athena.

I, and others, have noted that ODBC on macOS (and — for me, at least — Linux) never really played well together for us. Given that I’m familiar with JDBC, I just gravitated towards using it after trying it out with raw Java and it worked fine in R.

Never one to discount advice from Hadley, I quickly grabbed the Athena ODBC driver and installed it and wired up an odbc + dplyr connection almost instantly:

library(odbc)
library(tidyverse)

DBI::dbConnect(
  odbc::odbc(), 
  driver = "/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib", 
  Schema = "sampledb",
  AwsRegion = "us-east-1",
  AuthenticationType = "Default Credentials",
  S3OutputLocation = "s3://aws-athena-query-results-redacted"
) -> con

some_tbl <- tbl(con, "elb_logs")

some_tbl
## # Source:   table<elb_logs> [?? x 16]
## # Database: Amazon Athena 01.00.0000[@Amazon Athena/AwsDataCatalog]
##    timestamp    elbname requestip  requestport backendip backendport
##    <chr>        <chr>   <chr>            <int> <chr>           <int>
##  1 2014-09-26T… lb-demo 249.6.80.…        5123 249.6.80…        8888
##  2 2014-09-26T… lb-demo 246.22.15…        5123 248.178.…        8888
##  3 2014-09-26T… lb-demo 248.179.3…       45667 254.70.2…         443
##  4 2014-09-26T… lb-demo 243.2.127…       14496 248.178.…          80
##  5 2014-09-26T… lb-demo 247.76.18…        6887 252.0.81…        8888
##  6 2014-09-26T… lb-demo 254.110.3…       22052 248.178.…        8888
##  7 2014-09-26T… lb-demo 249.113.2…       24902 245.241.…        8888
##  8 2014-09-26T… lb-demo 246.128.7…        5123 244.202.…        8888
##  9 2014-09-26T… lb-demo 249.6.80.…       24902 255.226.…        8888
## 10 2014-09-26T… lb-demo 253.102.6…        6887 246.22.1…        8888
## # ... with more rows, and 10 more variables:
## #   requestprocessingtime <dbl>, backendprocessingtime <dbl>,
## #   clientresponsetime <dbl>, elbresponsecode <chr>,
## #   backendresponsecode <chr>, receivedbytes <S3: integer64>,
## #   sentbytes <S3: integer64>, requestverb <chr>, url <chr>,
## #   protocol <chr>## 

The TLDR is that I can now use 100% dplyr idioms with Athena vs add one to the RJDBC driver I made for metis. The metis package will still be around to support JDBC on systems that do have issues with ODBC and to add other methods that work with the AWS Athena API (managing Athena vs the interactive queries part).

The downside is that I’m now even more likely to run up the AWS bill ;-)

What About Drill?

I also maintain the sergeant package? which provides REST API and REST query access to Apache Drill along with a REST API DBI driver and an RJDBC interface for Drill. I remember trying to get the MapR ODBC client working with R a few years ago so I made the package (which was also a great learning experience).

I noticed there was a very recent MapR Drill ODBC driver released. Since I was on a roll, I figured why not try it one more time, especially since the RStudio team has made it dead simple to work with ODBC from R.

library(odbc)
library(tidyverse)

DBI::dbConnect(
  odbc::odbc(), 
  driver = "/Library/mapr/drill/lib/libdrillodbc_sbu.dylib",
  ConnectionType = "Zookeeper",
  AuthenticationType = "No Authentication",
  ZKCLusterID = "CLUSTERID",
  ZkQuorum = "HOST:2181",
  AdvancedProperties = "CastAnyToVarchar=true;HandshakeTimeout=30;QueryTimeout=180;TimestampTZDisplayTimezone=utc;
ExcludedSchemas=sys,INFORMATION_SCHEMA;NumberOfPrefetchBuffers=5;"
) -> drill_con

(employee <- tbl(drill_con, sql("SELECT * FROM cp.`employee.json`")))
## # Source:   SQL [?? x 16]
## # Database: Drill 01.13.0000[@Apache Drill Server/DRILL]
##    employee_id   full_name    first_name last_name position_id   position_title   store_id  
##    <S3: integer> <chr>        <chr>      <chr>     <S3: integer> <chr>            <S3: inte>
##  1 1             Sheri Nowmer Sheri      Nowmer    1             President        0         
##  2 2             Derrick Whe… Derrick    Whelply   2             VP Country Mana… 0         
##  3 4             Michael Spe… Michael    Spence    2             VP Country Mana… 0         
##  4 5             Maya Gutier… Maya       Gutierrez 2             VP Country Mana… 0         
##  5 6             Roberta Dam… Roberta    Damstra   3             VP Information … 0         
##  6 7             Rebecca Kan… Rebecca    Kanagaki  4             VP Human Resour… 0         
##  7 8             Kim Brunner  Kim        Brunner   11            Store Manager    9         
##  8 9             Brenda Blum… Brenda     Blumberg  11            Store Manager    21        
##  9 10            Darren Stanz Darren     Stanz     5             VP Finance       0         
## 10 11            Jonathan Mu… Jonathan   Murraiin  11            Store Manager    1         
## # ... with more rows, and 9 more variables: department_id <S3: integer64>, birth_date <chr>,
## #   hire_date <chr>, salary <dbl>, supervisor_id <S3: integer64>, education_level <chr>,
## #   marital_status <chr>, gender <chr>, management_role <chr>## 

count(employee, position_title, sort=TRUE)
## # Source:     lazy query [?? x 2]
## # Database:   Drill 01.13.0000[@Apache Drill Server/DRILL]
## # Ordered by: desc(n)
##    position_title            n              
##    <chr>                     <S3: integer64>
##  1 Store Temporary Checker   268            
##  2 Store Temporary Stocker   264            
##  3 Store Permanent Checker   226            
##  4 Store Permanent Stocker   222            
##  5 Store Shift Supervisor    52             
##  6 Store Permanent Butcher   32             
##  7 Store Manager             24             
##  8 Store Assistant Manager   24             
##  9 Store Information Systems 16             
## 10 HQ Finance and Accounting 8              
## # ... with more rows##

Apart from having to do that sql(…) to make the table connection work, it was pretty painless and I had both Athena and Drill working with dplyr verbs in under ten minutes (total).

You can head on over to the main Apache Drill site to learn all about the ODBC driver configuration parameters and I’ve updated my ongoing Using Apache Drill with R e-book to include this information. I will also keep maintaining the existing sergeant package but also be including some additional methods provide ODBC usage guidance and potentially other helpers if there are any “gotchas” that arise.

FIN

The odbc package is super-slick and it’s refreshing to be able to use dplyr verbs with Athena vs gosh-awful SQL. However, for some of our needs the hand-crafted queries will still be necessary as they are far more optimized than what would likely get pieced together via the dplyr verbs. However, those queries can also be put right into sql() with the Athena ODBC driver connection and used via the same dplyr verb magic afterwards.

Today is, indeed, a good day to query!

This week’s edition of Data is Plural had two really fun data sets. One is serious fun (the first comprehensive data set on U.S. evictions, and the other I knew about but had forgotten: The Federal Register Executive Order (EO) data set(s).

The EO data is also comprehensive as the summary JSON (or CSV) files have links to more metadata and even more links to the full-text in various formats.

What follows is a quick post to help bootstrap folks who may want to do some tidy text mining on this data. We’ll look at EOs-per-year (per-POTUS) and also take a look at the “top 5 ‘first words'” in the titles of the EOS (also by POTUS).

Ingesting the Data

The EO main page has a list of EO JSON files by POTUS. We’re going to scrape this so we can classify the EOs by POTUS (we could also just use the Federal Register API since @thosjleeper wrote a spiffy package to access it):

library(rvest)
library(stringi)
library(pluralize) # devtools::install_github("hrbrmstr/pluralize")
library(hrbrthemes)
library(tidyverse)

#' Retrieve the Federal Register main EO page so we can get the links for each POTUS
pg <- read_html("https://www.federalregister.gov/executive-orders") 

#' Find the POTUS EO data nodes, excluding the one for "All"
html_nodes(pg, "ul.bulk-files") %>% 
  html_nodes(xpath = ".//li[span[a[contains(@href, 'json')]] and 
                            not(span[contains(., 'All')])]") -> potus_nodes

#' Turn the POTUS info into a data frame with the POTUS name and EO JSON link,
#' then retrieve the JSON file and make a data frame of individual data elements
data_frame(
  potus = html_nodes(potus_nodes, "span:nth-of-type(1)") %>% html_text(),
  eo_link = html_nodes(potus_nodes, "a[href *= 'json']") %>% 
    html_attr("href") %>% 
    sprintf("https://www.federalregister.gov%s", .)
) %>% 
  mutate(eo = map(eo_link, jsonlite::fromJSON)) %>% 
  mutate(eo = map(eo, "results")) %>% 
  unnest() -> eo_df

glimpse(eo_df)
## Observations: 887
## Variables: 16
## $ potus                  <chr> "Donald Trump", "Donald Trump", "Donald Trump", "Donald Trump", "Donald Trump", "D...
## $ eo_link                <chr> "https://www.federalregister.gov/documents/search.json?conditions%5Bcorrection%5D=...
## $ citation               <chr> "82 FR 8351", "82 FR 8657", "82 FR 8793", "82 FR 8799", "82 FR 8977", "82 FR 9333"...
## $ document_number        <chr> "2017-01799", "2017-02029", "2017-02095", "2017-02102", "2017-02281", "2017-02450"...
## $ end_page               <int> 8352, 8658, 8797, 8803, 8982, 9338, 9341, 9966, 10693, 10696, 10698, 10700, 12287,...
## $ executive_order_notes  <chr> NA, "See: EO 13807, August 15, 2017", NA, NA, "See: EO 13780, March 6, 2017", "Sup...
## $ executive_order_number <int> 13765, 13766, 13767, 13768, 13769, 13770, 13771, 13772, 13773, 13774, 13775, 13776...
## $ html_url               <chr> "https://www.federalregister.gov/documents/2017/01/24/2017-01799/minimizing-the-ec...
## $ pdf_url                <chr> "https://www.gpo.gov/fdsys/pkg/FR-2017-01-24/pdf/2017-01799.pdf", "https://www.gpo...
## $ publication_date       <chr> "2017-01-24", "2017-01-30", "2017-01-30", "2017-01-30", "2017-02-01", "2017-02-03"...
## $ signing_date           <chr> "2017-01-20", "2017-01-24", "2017-01-25", "2017-01-25", "2017-01-27", "2017-01-28"...
## $ start_page             <int> 8351, 8657, 8793, 8799, 8977, 9333, 9339, 9965, 10691, 10695, 10697, 10699, 12285,...
## $ title                  <chr> "Minimizing the Economic Burden of the Patient Protection and Affordable Care Act ...
## $ full_text_xml_url      <chr> "https://www.federalregister.gov/documents/full_text/xml/2017/01/24/2017-01799.xml...
## $ body_html_url          <chr> "https://www.federalregister.gov/documents/full_text/html/2017/01/24/2017-01799.ht...
## $ json_url               <chr> "https://www.federalregister.gov/api/v1/documents/2017-01799.json", "https://www.f...

EOs By Year

To see how many EOs were signed per-year, per-POTUS, we’ll convert the signing_date into a year (and return it back to a Date object so we get spiffier plot labels), factor order the POTUS names and mark the start of each POTUS term. I’m not usually a fan of stacked bar charts, but since there will only be — at most — two segments, I think they work well and it also shows just how many EOs are established in year one of a POTUS term:

mutate(eo_df, year = lubridate::year(signing_date)) %>% 
  mutate(year = as.Date(sprintf("%s-01-01", year))) %>% 
  count(year, potus) %>%
  mutate(
    potus = factor(
      potus, 
      levels = c("Donald Trump", "Barack Obama", "George W. Bush", "William J. Clinton")
    )
  ) %>%
  ggplot(aes(year, n, group=potus)) +
  geom_col(position = "stack", aes(fill = potus)) +
  scale_x_date(
    name = NULL,
    expand = c(0,0),
    breaks = as.Date(c("1993-01-01", "2001-01-01", "2009-01-01", "2017-01-01")),
    date_labels = "%Y",
    limits = as.Date(c("1992-01-01", "2020-12-31"))
  ) +
  scale_y_comma(name = "# EOs") +
  scale_fill_ipsum(name = NULL) +
  guides(fill = guide_legend(reverse=TRUE)) +
  labs(
    title = "Number of Executive Orders Signed Per-Year, Per-POTUS",
    subtitle = "1993-Present",
    caption = "Source: Federal Register <https://www.federalregister.gov/executive-orders>"
  ) +
  theme_ipsum_rc(grid = "Y") +
  theme(legend.position = "bottom")

Favourite First (Title) Words

I’ll let some eager tidy text miners go-to-town on the full text links and just focus on one aspect of the EO titles: the “first” words. These are generally words like “Amending”, “Establishing”, “Promoting”, etc. to give citizens a quick idea of what’s the order is supposed to be doing. We’ll remove common words, turn plurals into singulars and also get rid of years/dates to make the data a bit more useful and focus on the “top 5” first words used by each POTUS (and show all the first words across each POTUS). I’m using raw counts here (since this is a quick post) but another view normalized by percent of all POTUS EOs might prove more interesting/valuable:

mutate(titles_df, first_word = singularize(first_word)) %>% 
  count(potus, first_word, sort=TRUE) %>% 
  filter(!stri_detect_regex(first_word, "President|Federal|National")) %>%
  mutate(first_word = stri_replace_all_fixed(first_word, "Establishment", "Establishing")) %>% 
  mutate(first_word = stri_replace_all_fixed(first_word, "Amendment", "Amending")) -> first_words

group_by(first_words, potus) %>% 
    top_n(5) %>%  
    ungroup() %>% 
    distinct(first_word) %>% 
    pull(first_word) -> all_first_words

filter(first_words, first_word %in% all_first_words) %>% 
  mutate(
    potus = factor(
      potus, 
      levels = c("Donald Trump", "Barack Obama", "George W. Bush", "William J. Clinton")
    )
  ) %>% 
  mutate(
    first_word = factor(
      first_word, 
      levels = rev(sort(unique(first_word)))
    )
  ) -> first_df

ggplot(first_df, aes(n, first_word)) +
  geom_segment(aes(xend=0, yend=first_word, color=potus), size=4) +
  scale_x_comma(limits=c(0,40)) +
  scale_y_discrete(limits = sort(unique(first_df$first_word))) +
  facet_wrap(~potus, scales = "free", ncol = 2) +
  labs(
    x = "# EOs",
    y = NULL,
    title = "Top 5 Executive Order 'First Words' by POTUS",
    subtitle = "1993-Present",
    caption = "Source: Federal Register <https://www.federalregister.gov/executive-orders>"
  ) +
  theme_ipsum_rc(grid="X", strip_text_face = "bold") +
  theme(panel.spacing.x = unit(5, "lines")) +
  theme(legend.position="none")

FWIW I expected more “Revocation”/”Removing” from the current tangerine-in-chief, but there’s plenty “Enforcing” and “Blocking” to make up for it (being the “tough guy” that he likes to pretend he is).

FIN

There’s way more that can be done with this data set and hopefully folks will take it for a spin and come up with their own interesting views. If you do, drop a note in the comments with a link to your creation(s)!

The code blocks are all combined into this gist.

If you come here often you’ve noticed that I’ve been writing a semi-frequent series on using the Feedly API with R.

A recent post was created to help someone use the API. It worked for them but — as you can see in the comment — an assertion was made that these items were “locked away”. This is far from the case.

Feedly lets you hookup Dropbox to Feedly. That does a bunch of things, the first of which is that your Dropbox folder (i.e. ~/Dropbox) now has a ~/Dropbox/Apps/Feedly Vault directory where Feedly will store all sorts of wonderful items:

.
├── ? OPML Backup
├── ? Saved For Later
└── ? Tags

Copies of your OPML file (the XML container that has the references to all the RSS feeds you subscribe to) are backed up in OPML Backup every time there is a change to them. I’ve made 127 changes to my RSS feeds since 2014 and they’re all backed up in OPML Backup, ready to be processed with R or some other, inferior programming language.

The Saved for Later folder has a set of sub-directories by year:

Saved For Later/
├── ? 2011
├── ? 2012
├── ? 2013
├── ? 2014
├── ? 2015
├── ? 2016
├── ? 2017
└── ? 2018

Inside each of those annums are HTML files for all the posts you’ve, well, saved for later. The HTML contains the view you saw in the Feedly reader pane.

Astute readers will notice directories for 2011, 2012 and 2013. Feedly was not around back then. So, what are they? They are the “saved posts” you had when/if you used Google Reader (back in the day) and did an initial import from GReader to Feedly to begin your new RSS journey. (Feedly devs are 100% awesome).

Similarly, the Tags folder has copies of the HTML for anything you’ve filed under a tag/board.

So, if you’re not keen on using the Feedly API but want direct or programmatic access to your OPML file and saved content, look no further than a simple Dropbox directory traversal.

@mkjcktzn asked if one can access Feedly “Saved for Later” items via the API. The answer is “Yes!”, and it builds off of that previous post. You’ll need to read it and get your authentication key (still no package ?) before continuing.

We’ll use most (I think “all”) of the code from the previous post, so let’s bring that over here:

library(httr)
library(tidyverse)

.pkgenv <- new.env(parent=emptyenv())
.pkgenv$token <- Sys.getenv("FEEDLY_ACCESS_TOKEN")

.feedly_token <- function() return(.pkgenv$token)

feedly_stream <- function(stream_id, ct=100L, continuation=NULL) {
  
  ct <- as.integer(ct)
  
  if (!is.null(continuation)) ct <- 1000L
  
  httr::GET(
    url = "https://cloud.feedly.com/v3/streams/contents",
    httr::add_headers(
      `Authorization` = sprintf("OAuth %s", .feedly_token())
    ),
    query = list(
      streamId = stream_id,
      count = ct,
      continuation = continuation
    )
  ) -> res
  
  httr::stop_for_status(res)
  
  res <- httr::content(res, as="text")
  res <- jsonlite::fromJSON(res)
  
  res
  
}

According to the Feedly API Overview there is a “global resource id” which is formatted like user/:userId/tag/global.saved and defined as “Users can save articles for later. Equivalent of starring articles in Google Reader.”.

The “Saved for Later” feature is quite handy and all we need to do to get access to it is substitute our user id for :userId. To do that, we’ll build a helper function:

feedly_profile <- function() {
  
  httr::GET(
    url = "https://cloud.feedly.com/v3/profile",
    httr::add_headers(
      `Authorization` = sprintf("OAuth %s", .feedly_token())
    )
  ) -> res
  
  httr::stop_for_status(res)
  
  res <- httr::content(res, as="text")
  res <- jsonlite::fromJSON(res)
  
  class(res) <- c("feedly_profile")
  
  res
  
}

When that function is called, it returns a ton of user profile information in a list, including the id that we need:

me <- feedly_profile()

str(me, 1)
## List of 46
##  $ id                          : chr "9b61e777-6ee2-476d-a158-03050694896a"
##  $ client                      : chr "feedly"
##  $ email                       : chr "...@example.com"
##  $ wave                        : chr "2013.26"
##  $ logins                      :'data.frame': 4 obs. of  6 variables:
##  $ product                     : chr "Feedly..."
##  $ picture                     : chr "https://..."
##  $ twitter                     : chr "hrbrmstr"
##  $ givenName                   : chr "..."
##  $ evernoteUserId              : chr "112233"
##  $ familyName                  : chr "..."
##  $ google                      : chr "1100199130101939"
##  $ gender                      : chr "..."
##  $ windowsLiveId               : chr "1020d010389281e3"
##  $ twitterUserId               : chr "99119939"
##  $ twitterProfileBannerImageUrl: chr "https://..."
##  $ evernoteStoreUrl            : chr "https://..."
##  $ evernoteWebApiPrefix        : chr "https://..."
##  $ evernotePartialOAuth        : logi ...
##  $ dropboxUid                  : chr "54555"
##  $ subscriptionPaymentProvider : chr "......"
##  $ productExpiration           : num 2.65e+12
##  $ subscriptionRenewalDate     : num 2.65e+12
##  $ subscriptionStatus          : chr "Active"
##  $ upgradeDate                 : num 2.5e+12
##  $ backupTags                  : logi TRUE
##  $ backupOpml                  : logi TRUE
##  $ dropboxConnected            : logi TRUE
##  $ twitterConnected            : logi TRUE
##  $ customGivenName             : chr "..."
##  $ customFamilyName            : chr "..."
##  $ customEmail                 : chr "...@example.com"
##  $ pocketUsername              : chr "...@example.com"
##  $ windowsLivePartialOAuth     : logi TRUE
##  $ facebookConnected           : logi FALSE
##  $ productRenewalAmount        : int 1111
##  $ evernoteConnected           : logi TRUE
##  $ pocketConnected             : logi TRUE
##  $ wordPressConnected          : logi FALSE
##  $ windowsLiveConnected        : logi TRUE
##  $ dropboxOpmlBackup           : logi TRUE
##  $ dropboxTagBackup            : logi TRUE
##  $ backupPageFormat            : chr "Html"
##  $ dropboxFormat               : chr "Html"
##  $ locale                      : chr "en_US"
##  $ fullName                    : chr "..."
##  - attr(*, "class")= chr "feedly_profile"

(You didn’t think I wouldn’t redact that, did you? Note that I made up a unique id as well.)

Now we can call our stream function and get the results:

entries <- feedly_stream(sprintf("user/%s/tag/global.saved", me$id))

str(entries$items, 1)
# output not shown as you don't really need to see what I've Saved for Later

The structure is the same as in the previous post.

Now, you can go to town and programmatically access your Feedly “Saved for Later” entries.

You an also find more “Resource Ids” and “Global Resource Ids” formats on the API Overview page.

Since I just railed against Congress for being a bit two-faced about privacy I thought some rud.is site disclosure would be in order.

At present, third-party tracking is limited to:

  • Something in my WordPress configuration adding a DNS pre-fetch for fonts.googleapis.com. There are a few more other DNS pre-fetches that I’m also going to try to eradicate (but that aren’t showing up in my uBlock Origin likely to to /etc/hosts blocks);
  • Gravatar (which displays logos near comment author names). I’m torn on this one but Gravatar is owned by Automattic (who owns WordPress). See next bullet on that;
  • WordPress. Vain site stats tracking, JetPack uptime warnings and some other WordPress pings happen (including some automatic short-linking) as well as the previous bullet bits. I’m not likely going to do the site surgery necessary to stop this but you have full disclosure and can easily avoid pings to those sites via uBlock Origin site-specific rules;
  • SendPulse; I’m running an experiment on user behaviours when it comes to authorizing web notifications (and I just kinda ruined said experiment). I’ll be disabling it later this year (after a full year of it being on so I can have more than just a few sentences to say).

The above came from an in-browser uBlock Origin report.

I ran a splashr::render_har() — which is how I measured things for the Congressional privacy post — on one of my pages and this is the result:

tld                 n
1 rud.is           67
2 wp.com           21
3 gravatar.com      6
4 wordpress.com     3
5 w.org             3
6 sendpulse.com     2

Props on WordPress capturing w.org! I’m still ticked Microsoft stole bob.com from me ages ago.

As you can see, most resources load from my site and none come from Twitter, Facebook or Google Plus.

I run WordPress for a ton of reasons too long to go into for this post, so I’m likely not going to change anything about that list (apart from the DNS pre-fetching).

Hopefully that will abate any concerns visitors might have, especially after reading the post about Congress.