Skip navigation

It’s no secret that I’m a fan of Apache Drill. One big strength of the platform is that it normalizes the access to diverse data sources down to ANSI SQL calls, which means that I can pull data from parquet, Hie, HBase, Kudu, CSV, JSON, MongoDB and MariaDB with the same SQL syntax. This also means that I get access to all those platforms in R centrally through the sergeant package that rests atop d[b]plyr. However, it further means that when support for a new file type is added, I get that same functionality without any extra effort.

Why am I calling this out?

Well, the intrepid Drill developers are in the process of finalizing the release candidate for version 1.11.0 and one feature they’ve added is the ability to query individual and entire directories full of PCAP files from within Drill. While I provided a link to the Wikipedia article on PCAP files, the TL;DR on them is that it’s an optimized binary file format for recording network activity. If you’re on macOS or a linux-ish system go do something like this:

sudo tcpdump -ni en0 -s0 -w capture01.pcap

And, wait a bit.

NOTE: Some of you may have to change the en0 to your main network interface name (a quick google for that for your platform should get you to the right one to use).

That command will passively record all network activity on your system until you ctrl-c it. The longer it goes the larger it gets.

When you’ve recorded a minute or two of packets, ctrl-c the program and then try to look at the PCAP file. It’s a binary mess. You can re-read it with tcpdump or Wireshark and there are many C[++] libraries and other utilities that can read them. You can even convert them to CSV or XML, but the PCAP itself requires custom tools to work with them effectively. I had started creating crafter to work with these files but my use case/project dried up and haven’t gone back to it.

Adding the capability into Drill means I don’t really have to work any further on that specialized package as I can do this:

library(sergeant)
library(iptools)
library(tidyverse)
library(cymruservices)

db <- src_drill("localhost")

my_pcaps <- tbl(db, "dfs.caps.`/capture02.pcap`")

glimpse(my_pcaps)
## Observations: 25
## Variables: 12
## $ src_ip          <chr> "192.168.10.100", "54.159.166.81", "192.168.10...
## $ src_port        <int> 60025, 443, 60025, 443, 60025, 58976, 443, 535...
## $ tcp_session     <dbl> -2.082796e+17, -2.082796e+17, -2.082796e+17, -...
## $ packet_length   <int> 129, 129, 66, 703, 66, 65, 75, 364, 65, 65, 75...
## $ data            <chr> "...g9B..c.<..O..@=,0R.`........K..EzYd=.........
## $ src_mac_address <chr> "78:4F:43:77:02:00", "D4:8C:B5:C9:6C:1B", "78:...
## $ dst_port        <int> 443, 60025, 443, 60025, 443, 443, 58976, 5353,...
## $ type            <chr> "TCP", "TCP", "TCP", "TCP", "TCP", "UDP", "UDP...
## $ dst_ip          <chr> "54.159.166.81", "192.168.10.100", "54.159.166...
## $ dst_mac_address <chr> "D4:8C:B5:C9:6C:1B", "78:4F:43:77:02:00", "D4:...
## $ network         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ timestamp       <dttm> 2017-07-27 23:54:58, 2017-07-27 23:54:59, 201...

summarise(my_pcaps, max = max(timestamp), min = min(timestamp)) %>% 
  collect() %>% 
  summarise(max - min)
## # A tibble: 1 x 1
##     `max - min`
##          <time>
## 1 1.924583 mins

count(my_pcaps, type)
## # Source:   lazy query [?? x 2]
## # Database: DrillConnection
##    type     n
##   <chr> <int>
## 1   TCP  4974
## 2   UDP   774

filter(my_pcaps, type=="TCP") %>% 
  count(dst_port, sort=TRUE)
## # Source:     lazy query [?? x 2]
## # Database:   DrillConnection
## # Ordered by: desc(n)
##    dst_port     n
##       <int> <int>
##  1      443  2580
##  2    56202   476
##  3    56229   226
##  4    56147   169
##  5    56215   103
##  6    56143    94
##  7    56085    89
##  8    56203    56
##  9    56205    39
## 10    56209    39
## # ... with more rows

filter(my_pcaps, type=="TCP") %>% 
  count(dst_ip, sort=TRUE) %>% 
  collect() -> dst_ips

filter(dst_ips, !is.na(dst_ip)) %>%
  left_join(ips_in_cidrs(.$dst_ip, c("10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16")),
            by = c("dst_ip"="ips")) %>%
  filter(!in_cidr) %>%
  left_join(distinct(bulk_origin(.$dst_ip), ip, .keep_all=TRUE), c("dst_ip" = "ip")) %>%
  select(dst_ip, n, as_name)
## # A tibble: 37 x 3
##            dst_ip     n                              as_name
##             <chr> <int>                                <chr>
##  1   104.244.42.2   862           TWITTER - Twitter Inc., US
##  2 104.244.46.103   556           TWITTER - Twitter Inc., US
##  3  104.20.60.241   183 CLOUDFLARENET - CloudFlare, Inc., US
##  4     31.13.80.8   160        FACEBOOK - Facebook, Inc., US
##  5  52.218.160.76   100     AMAZON-02 - Amazon.com, Inc., US
##  6  104.20.59.241    79 CLOUDFLARENET - CloudFlare, Inc., US
##  7  52.218.160.92    66     AMAZON-02 - Amazon.com, Inc., US
##  8  199.16.156.81    58           TWITTER - Twitter Inc., US
##  9 104.244.42.193    47           TWITTER - Twitter Inc., US
## 10  52.86.113.212    42    AMAZON-AES - Amazon.com, Inc., US
## # ... with 27 more rows

No custom R code. No modification to the sergeant package. Just query it like any other data source.

One really cool part of this is that — while similar functionality has been available in various Hadoop contexts for a few years — we’re doing this query from a local file system outside of a Hadoop context.

I had to add "pcap": { "type": "pcap" } to the formats section of the dfs storage configuration (#ty to the Drill community for helping me figure that out) and, I setup a directory that defaults to the pcap type. But after that, it just works.

Well, kinda.

The Java code that the plugin is based on doesn’t like busted PCAP files (which we get quite a bit of in infosec- & honeypot-lands) and it seems to bork on IPv6 packets a bit. And, my sergeant package (for now) can’t do much with the data component (neither can Drill-proper, either). But, it’s a great start and I can use it to do bulk parquet file creation of basic protocols & connection information or take a quick look at some honeypot captures whenever I need to, right from R, without converting them first.

Drill 1.11.0 is only at RC0 right now, so some of these issues may be gone by the time the full release is baked. Some fixes may have to wait for 1.12.0. And, much work needs to be done on the UDF-side and sergeant side to help make the data element more useful.

Even with the issues and limitations, this is an amazing new feature that’s been added to an incredibly useful tool and much thanks goes out to the Drill dev team for sneaking this in to 1.11.0.

If you have cause to work with PCAP files, give this a go and see if it helps speed up parts of your workflow.

Matt @stiles is a spiffy data journalist at the @latimes and he posted an interesting chart on U.S. Attorneys General longevity (given that the current US AG is on thin ice):

I thought it would be neat (since Matt did the data scraping part already) to look at AG tenure distribution by party, while also pointing out where Sessions falls.

Now, while Matt did scrape the data, it’s tucked away into a javascript variable in an iframe on the page that contains his vis.

It’s still easier to get it from there vs re-scrape Wikipedia (like Matt did) thanks to the V8 package by @opencpu.

The following code:

  • grabs the vis iframe
  • extracts and evaluates the target javascript to get a nice data frame
  • performs some factor re-coding (for better grouping and to make it easier to identify Sessions)
  • plots the distributions using the beeswarm quasirandom alogrithm
library(V8)
library(rvest)
library(ggbeeswarm)
library(hrbrthemes)
library(tidyverse)

pg <- read_html("http://mattstiles.org/dailygraphics/graphics/attorney-general-tenure-20172517/child.html?initialWidth=840&childId=pym_0&parentTitle=Chart%3A%20If%20Ousted%2C%20Jeff%20Sessions%20Would%20Have%20a%20Historically%20Short%20Tenure%20%7C%20The%20Daily%20Viz&parentUrl=http%3A%2F%2Fthedailyviz.com%2F2017%2F07%2F25%2Fchart-if-ousted-jeff-sessions-would-have-a-historically-short-tenure%2F")

ctx <- v8()
ctx$eval(html_nodes(pg, xpath=".//script[contains(., 'DATA')]") %>% html_text())

ctx$get("DATA") %>% 
  as_tibble() %>% 
  readr::type_convert() %>% 
  mutate(party = ifelse(is.na(party), "Other", party)) %>% 
  mutate(party = fct_lump(party)) %>% 
  mutate(color1 = case_when(
    party == "Democratic" ~ "#313695",
    party == "Republican" ~ "#a50026",
    party == "Other" ~ "#4d4d4d")
  ) %>% 
  mutate(color2 = ifelse(grepl("Sessions", label), "#2b2b2b", "#00000000")) -> ags

ggplot() + 
  geom_quasirandom(data = ags, aes(party, amt, color = color1)) +
  geom_quasirandom(data = ags, aes(party, amt, color = color2), 
                   fill = "#ffffff00", size = 4, stroke = 0.25, shape = 21) +
  geom_text(data = data_frame(), aes(x = "Republican", y = 100, label = "Jeff Sessions"), 
            nudge_x = -0.15, family = font_rc, size = 3, hjust = 1) +
  scale_color_identity() +
  scale_y_comma(limits = c(0, 4200)) +
  labs(x = "Party", y = "Tenure (days)", 
       title = "U.S. Attorneys General",
       subtitle = "Distribution of tenure in office, by days & party: 1789-2017",
       caption = "Source data/idea: Matt Stiles <bit.ly/2vXAHTM>") +
  theme_ipsum_rc(grid = "XY")

I turned the data into a CSV and stuck it in this gist if folks want to play w/o doing the js scraping.

I’m extremely pleased to announce that the sergeant package is now on CRAN or will be hitting your local CRAN mirror soon.

sergeant provides JDBC, DBI and dplyr/dbplyr interfaces to Apache Drill. I’ve also wrapped a few goodies into the dplyr custom functions that work with Drill and if you have Drill UDFs that don’t work “out of the box” with sergeant‘s dplyr interface, file an issue and I’ll make a special one for it in the package.

I’ve written about drill on the blog before so check out those posts for some history and stay tuned for more examples. The README should get you started using sergeant and/or Drill (if you aren’t running Drill now, take a look and you’ll likely get hooked).

I’d like to take a moment to call out special thanks to Edward Visel for bootstrapping the dbplyr update to sergeant when the dplyr/dbplyr interfaces split. It saved me loads of time and really helped the progress of this package move faster towards a CRAN release.

I couldn’t let this stand unchallenged:

Rasmussen makes their Presidential polling data available for both ? & O. Why not compare their ratings from day 1 in office (skipping days that Rasmussen doesn’t poll)?


library(hrbrthemes)
library(rvest)
library(tidyverse)

list(
  Obama="http://m.rasmussenreports.com/public_content/politics/obama_administration/obama_approval_index_history",
  Trump="http://m.rasmussenreports.com/public_content/politics/trump_administration/trump_approval_index_history"
) %>%
  map_df(~{
    read_html(.x) %>%
      html_table() %>%
      .[[1]] %>%
      tbl_df() %>%
      select(date=Date, approve=`Total Approve`, disapprove=`Total Disapprove`)
  }, .id="who") -> ratings

mutate_at(ratings, c("approve", "disapprove"), function(x) as.numeric(gsub("%", "", x, fixed=TRUE))/100) %>%
  mutate(date = lubridate::dmy(date)) %>%
  filter(!is.na(approve)) %>%
  group_by(who) %>%
  arrange(date) %>%
  mutate(dnum = 1:n()) %>%
  ungroup() %>%
  ggplot(aes(dnum, approve, color=who)) +
  geom_hline(yintercept = 0.5, size=0.5) +
  geom_point(size=0.25) +
  scale_y_percent(limits=c(0,1)) +
  scale_color_manual(name=NULL, values=c("Obama"="#313695", "Trump"="#a50026")) +
  labs(x="Day in office", y="Approval Rating",
       title="Presidential approval ratings from day 1 in office",
       subtitle="For fairness, data was taken solely from Trump's favorite polling site (Rasmussen)",
       caption="Data Source: <rasmussenreports.com>\nCode: <https://gist.github.com/hrbrmstr/a7310e1b64d0797401d01d0c6195bd8b>") +
  theme_ipsum_rc(grid="XY", base_size = 16) +
  theme(legend.direction = "horizontal") +
  theme(legend.position=c(0.8, 1.05))

I’ll make a new post occasionally throughout ?’s term.

The Apache Drill folks have a nice walk-through tutorial on how to analyze the Yelp Academic Dataset with Drill. It’s a bit out of date (the current Yelp data set structure is different enough that the tutorial will error out at various points), but it’s a great example of how to work with large, nested JSON files as a SQL data source. By ‘large’ I mean around 4GB of JSON data spread across 5 files.

If you have enough memory and wanted to work with “flattened” versions of the files in R you could use my ndjson package (there are other JSON “flattener” packages as well, and a new one — corpus::read_ndjson — is even faster than mine, but it fails to read this file). Drill doesn’t necessarily load the entire JSON structure into memory (you can check out the query profiles after the fact to see how much each worker component ended up using) and I’m only mentioning that “R can do this w/o Drill” to stave off some of those types of comments.

The main reasons for replicating their Yelp example was to both have a more robust test suite for sergeant (it’s hitting CRAN soon now that dplyr 0.7.0 is out) and to show some Drill SQL to R conversions. Part of the latter reason is also to show how to use SQL calls to create a tbl that you can then use dplyr verbs to manipulate.

The full tutorial replication is at https://rud.is/rpubs/yelp.html but also iframe’d below.

I caught a mention of this project by Pete Warden on Four Short Links today. If his name sounds familiar, he’s the creator of the DSTK, an O’Reilly author, and now works at Google. A decidedly clever and decent chap.

The project goal is noble: crowdsource and make a repository of open speech data for researchers to make a better world. Said sourcing is done by asking folks to record themselves saying “Yes”, “No” and other short words.

As I meandered over the blog post I looked in horror on the URL for the application that did the recording: https://open-speech-commands.appspot.com/.

Why would the goal of the project combined with that URL give pause? Read on!

You’ve Got Scams!

Picking up the phone and saying something as simple as ‘Yes’ has been a major scam this year. By recording your voice, attackers can replay it on phone prompts and because it’s your voice it makes it harder to refute the evidence and can foil recognition systems that look for your actual voice.

As the chart above shows, the Better Business Bureau has logged over 5,000 of these scams this year (searching for ‘phishing’ and ‘yes’). You can play with the data (a bit — the package needs work) in R with scamtracker.

Now, these are “analog” attacks (i.e. a human spends time socially engineering a human). Bookmark this as you peruse section 2.

Integrity Challenges in 2017

I “trust” Pete’s intentions, but I sure don’t trust open-speech-commands.appspot.com (and, you shouldn’t either). Why? Go visit https://totally-harmless-app.appspot.com. It’s a Google App Engine app I made for this post. Anyone can make an appspot app and the https is meaningless as far as integrity & authenticity goes since I’m running on google’s infrastructure but I’m not google.

You can’t really trust most SSL/TLS sessions as far as site integrity goes anyway. Let’s Encrypt put the final nail in the coffin with their Certs Gone Wild! initiative. With super-recent browser updates you can almost trust your eyes again when it comes to URLs, but you should be very wary of entering your info — especially uploading voice, prints or eye/face images — into any input box on any site if you aren’t 100% sure it’s a legit site that you trust.

Tracking the Trackers

If you don’t know that you’re being tracked 100% of the time on the internet then you really need to read up on the modern internet.

In many cases your IP address can directly identify you. In most cases your device & browser profile — which most commercial sites log — can directly identify you. So, just visiting a web site means that it’s highly likely that web site can know that you are both not a dog and are in fact you.

Still Waiting for the “So, What?”

Many states and municipalities have engaged in awareness campaigns to warn citizens about the “Say ‘Yes'” scam. Asking someone to record themselves saying ‘Yes’ into a random web site pretty much negates that advice.

Folks like me regularly warn about trust on the internet. I could have cloned the functionality of the original site to open-speech-commmands.appspot.com. Did you even catch the 3rd ‘m’ there? Even without that, it’s an appspot.com domain. Anyone can set one up.

Even if the site doesn’t ask for your name or other info and just asks for your ‘Yes’, it can know who you are. In fact, when you’re enabling the microphone to do the recording, it could even take a picture of you if it wanted to (and you’d likely not know or not object since it’s for SCIENCE!).

So, in the worst case scenario a malicious entity could be asking you for your ‘Yes’, tying it right to you and then executing the post-scam attacks that were being performed in the analog version.

But, go so far as to assume this is a legit site with good intentions. Do you really know what’s being logged when you commit your voice info? If the data was mishandled, it would be just as easy to tie the voice files back to you (assuming a certain level of data logging).

The “so what” is not really a warning to users but a message to researchers: You need to threat model your experiments and research initiatives, especially when innocent end users are potentially being put at risk. Data is the new gold, diamonds and other precious bits that attackers are after. You may think you’re not putting folks at risk and aren’t even a hacker target, but how you design data gathering can reinforce good or bad behaviour on the part of users. It can solidify solid security messages or tear them down. And, you and your data may be more of a target than you really know.

Reach out to interdisciplinary colleagues to help threat model your data collection, storage and dissemination methods to ensure you aren’t putting yourself or others at risk.

FIN

Pete did the right thing:

and, I’m sure the site will be on a “proper” domain soon. When it is, I’ll be one of the first in line to help make a much-needed open data set for research purposes.

I caught the 0.7.0 release of dplyr on my home CRAN server early Friday morning and immediately set out to install it since I’m eager to finish up my sergeant package and get it on CRAN. “Tidyverse” upgrades aren’t trivial for me as I tinker quite a bit with the tidyverse and create packages that depend on various components. The sergeant package provides — amongst other things — a dplyr back-end for Apache Drill, so it has more tidyverse tendrils than other bits of code I maintain.

macOS binaries weren’t available yet (it generally takes 24-48 hrs for that) so I did an install.packages("dplyr", type="source") and was immediately hit with gcc 7 compilation errors. This seemed odd, but switching back to clang worked fine.

I, then, proceeded to run chunks in an Rmd I’m working on and hit “Encoding” errors on mutate() calls. Not having time to debug further I reverted to 0.5.0 of dplyr and went about my day and promised the tidyverse maintainers that I’d work on a reproducible example after work.

I made R data files from the data frames that were tossing errors and extracted & tweaked a code snippet that consistently generated the error and created a rocker container on one of my linux boxes to validate that this was an error and a cross-platform one. The rocker container used a full fresh-from-source copy of the tidyverse including dplyr 0.7.0. The code worked and no error was generated, so I immediately suspected package rot on my main dev macOS box.

Now, my situation is complicated by an insanely hasty migration to macOS 10.13β1 (I refuse to use the Apple macOS catchy names anymore since the most recent one is just silly) and a move to the gcc 7 toolchain (initially prompted to both get rJava working nicely and reproduce some CRAN noted errors with some packages). Further complications were also created by many invocations of install_github() of various packages regularly overwriting bits of the tidyverse over the past few weeks since the R 3.4.0 release. In other words, the integrity of the “tidyverse” was in serious question on my system and it was time for the Clean Slate Protocol.

Rather than itemize package versions and surgically nipping and tucking, I opted to use packrat to get to my desired end-state of a full-integrity tidyverse install. There are many ways to do this. Feel free to “one-up” me and show your l33t method in the comments. This one will likely be accessible to most — if not all — R users.

I started a new RStudio project in a new session and told it to use packrat. In the new project console, I did install.packages("tidyverse", type="source") and let it go for many minutes. I, then, navigated to the packrat subdirectory where the 3.4 package binaries are housed (just follow the project packrat tree down to the R version directory) and moved all 51 packages (yes, 51 O_o) to the main R library path (which you can figure out by running .libPaths() in any non-packrat-maintained project).

After doing that, I fired up the originally failing Rmd and everything worked fine. ?

I don’t do the Clean Slate Protocol too often (we all get to for new R dot-releases) but it came in handy this time. If you run into errors when trying to get the new dplyr working, you may benefit from the Clean Slate Protocol as well.

If you haven’t seen the changes in 0.6.0/0.7.0 you should check them out and give it a go.

I’ve been doing intermittent prep work for a follow-up to an earlier post on store closings and came across this CNN Money “article” on it. Said “article” is a deliberately obfuscated or lazily crafted series of GIF images that contain all the Radio Shack impending store closings. It’s the most comprehensive list I’ve found, but the format is terrible and there’s no easy, in-browser way to download them all.

CNN has ToS that prevent automated data gathering from CNN-proper. But, they used Adobe Document Cloud for these images which has no similar restrictions from a quick glance at their ToS. That means you get an R⁶ post on how to grab the individual 38 images and combine them into one PDF. I did this all with the hopes of OCRing the text, which has not panned out too well since the image quality and font was likely deliberately set to make it hard to do precisely what I’m trying to do.

If you work through the example, you’ll get a feel for:

  • using sprintf() to take a template and build a vector of URLs
  • use dplyr progress bars
  • customize httr verb options to ensure you can get to content
  • use purrr to iterate through a process of turning raw image bytes into image content (via magick) and turn a list of images into a PDF
library(httr)
library(magick)
library(tidyverse)

url_template <- "https://assets.documentcloud.org/documents/1657793/pages/radioshack-convert-p%s-large.gif"

pb <- progress_estimated(38)

sprintf(url_template, 1:38) %>% 
  map(~{
    pb$tick()$print()
    GET(url = .x, 
        add_headers(
          accept = "image/webp,image/apng,image/*,*/*;q=0.8", 
          referer = "http://money.cnn.com/interactive/technology/radio-shack-closure-list/index.html", 
          authority = "assets.documentcloud.org"))    
  }) -> store_list_pages

map(store_list_pages, content) %>% 
  map(image_read) %>% 
  reduce(image_join) %>% 
  image_write("combined_pages.pdf", format = "pdf")

I figured out the Document Cloud links and necessary httr::GET() options by using Chrome Developer Tools and my curlconverter package.

If any academic-y folks have a test subjectsummer intern with a free hour and would be willing to have them transcribe this list and stick it on GitHub, you’d have my eternal thanks.