It’s no secret that I’m a fan of Apache Drill. One big strength of the platform is that it normalizes the access to diverse data sources down to ANSI SQL calls, which means that I can pull data from parquet, Hie, HBase, Kudu, CSV, JSON, MongoDB and MariaDB with the same SQL syntax. This also means that I get access to all those platforms in R centrally through the sergeant
package that rests atop d[b]plyr
. However, it further means that when support for a new file type is added, I get that same functionality without any extra effort.
Why am I calling this out?
Well, the intrepid Drill developers are in the process of finalizing the release candidate for version 1.11.0 and one feature they’ve added is the ability to query individual and entire directories full of PCAP files from within Drill. While I provided a link to the Wikipedia article on PCAP files, the TL;DR on them is that it’s an optimized binary file format for recording network activity. If you’re on macOS or a linux-ish system go do something like this:
sudo tcpdump -ni en0 -s0 -w capture01.pcap
And, wait a bit.
NOTE: Some of you may have to change the en0
to your main network interface name (a quick google for that for your platform should get you to the right one to use).
That command will passively record all network activity on your system until you ctrl-c
it. The longer it goes the larger it gets.
When you’ve recorded a minute or two of packets, ctrl-c
the program and then try to look at the PCAP file. It’s a binary mess. You can re-read it with tcpdump
or Wireshark and there are many C[++] libraries and other utilities that can read them. You can even convert them to CSV or XML, but the PCAP itself requires custom tools to work with them effectively. I had started creating crafter
to work with these files but my use case/project dried up and haven’t gone back to it.
Adding the capability into Drill means I don’t really have to work any further on that specialized package as I can do this:
library(sergeant)
library(iptools)
library(tidyverse)
library(cymruservices)
db <- src_drill("localhost")
my_pcaps <- tbl(db, "dfs.caps.`/capture02.pcap`")
glimpse(my_pcaps)
## Observations: 25
## Variables: 12
## $ src_ip <chr> "192.168.10.100", "54.159.166.81", "192.168.10...
## $ src_port <int> 60025, 443, 60025, 443, 60025, 58976, 443, 535...
## $ tcp_session <dbl> -2.082796e+17, -2.082796e+17, -2.082796e+17, -...
## $ packet_length <int> 129, 129, 66, 703, 66, 65, 75, 364, 65, 65, 75...
## $ data <chr> "...g9B..c.<..O..@=,0R.`........K..EzYd=.........
## $ src_mac_address <chr> "78:4F:43:77:02:00", "D4:8C:B5:C9:6C:1B", "78:...
## $ dst_port <int> 443, 60025, 443, 60025, 443, 443, 58976, 5353,...
## $ type <chr> "TCP", "TCP", "TCP", "TCP", "TCP", "UDP", "UDP...
## $ dst_ip <chr> "54.159.166.81", "192.168.10.100", "54.159.166...
## $ dst_mac_address <chr> "D4:8C:B5:C9:6C:1B", "78:4F:43:77:02:00", "D4:...
## $ network <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ timestamp <dttm> 2017-07-27 23:54:58, 2017-07-27 23:54:59, 201...
summarise(my_pcaps, max = max(timestamp), min = min(timestamp)) %>%
collect() %>%
summarise(max - min)
## # A tibble: 1 x 1
## `max - min`
## <time>
## 1 1.924583 mins
count(my_pcaps, type)
## # Source: lazy query [?? x 2]
## # Database: DrillConnection
## type n
## <chr> <int>
## 1 TCP 4974
## 2 UDP 774
filter(my_pcaps, type=="TCP") %>%
count(dst_port, sort=TRUE)
## # Source: lazy query [?? x 2]
## # Database: DrillConnection
## # Ordered by: desc(n)
## dst_port n
## <int> <int>
## 1 443 2580
## 2 56202 476
## 3 56229 226
## 4 56147 169
## 5 56215 103
## 6 56143 94
## 7 56085 89
## 8 56203 56
## 9 56205 39
## 10 56209 39
## # ... with more rows
filter(my_pcaps, type=="TCP") %>%
count(dst_ip, sort=TRUE) %>%
collect() -> dst_ips
filter(dst_ips, !is.na(dst_ip)) %>%
left_join(ips_in_cidrs(.$dst_ip, c("10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16")),
by = c("dst_ip"="ips")) %>%
filter(!in_cidr) %>%
left_join(distinct(bulk_origin(.$dst_ip), ip, .keep_all=TRUE), c("dst_ip" = "ip")) %>%
select(dst_ip, n, as_name)
## # A tibble: 37 x 3
## dst_ip n as_name
## <chr> <int> <chr>
## 1 104.244.42.2 862 TWITTER - Twitter Inc., US
## 2 104.244.46.103 556 TWITTER - Twitter Inc., US
## 3 104.20.60.241 183 CLOUDFLARENET - CloudFlare, Inc., US
## 4 31.13.80.8 160 FACEBOOK - Facebook, Inc., US
## 5 52.218.160.76 100 AMAZON-02 - Amazon.com, Inc., US
## 6 104.20.59.241 79 CLOUDFLARENET - CloudFlare, Inc., US
## 7 52.218.160.92 66 AMAZON-02 - Amazon.com, Inc., US
## 8 199.16.156.81 58 TWITTER - Twitter Inc., US
## 9 104.244.42.193 47 TWITTER - Twitter Inc., US
## 10 52.86.113.212 42 AMAZON-AES - Amazon.com, Inc., US
## # ... with 27 more rows
No custom R code. No modification to the sergeant
package. Just query it like any other data source.
One really cool part of this is that — while similar functionality has been available in various Hadoop contexts for a few years — we’re doing this query from a local file system outside of a Hadoop context.
I had to add "pcap": { "type": "pcap" }
to the formats
section of the dfs
storage configuration (#ty to the Drill community for helping me figure that out) and, I setup a directory that defaults to the pcap
type. But after that, it just works.
Well, kinda.
The Java code that the plugin is based on doesn’t like busted PCAP files (which we get quite a bit of in infosec- & honeypot-lands) and it seems to bork on IPv6 packets a bit. And, my sergeant
package (for now) can’t do much with the data
component (neither can Drill-proper, either). But, it’s a great start and I can use it to do bulk parquet file creation of basic protocols & connection information or take a quick look at some honeypot captures whenever I need to, right from R, without converting them first.
Drill 1.11.0 is only at RC0 right now, so some of these issues may be gone by the time the full release is baked. Some fixes may have to wait for 1.12.0. And, much work needs to be done on the UDF-side and sergeant
side to help make the data
element more useful.
Even with the issues and limitations, this is an amazing new feature that’s been added to an incredibly useful tool and much thanks goes out to the Drill dev team for sneaking this in to 1.11.0.
If you have cause to work with PCAP files, give this a go and see if it helps speed up parts of your workflow.
6 Comments
Seeing all your drill + sergeant posts, I’ve been meaning to ask.
Does drill have the capability to read in a user specified subset of rows?
I figure,
dbplyr + sergeant (with this row subset feature) + collect()
would be a nice in-memory bean counting feature for data analysts that are ‘prototyping’ their script on a small sample of the larger data file
What wld the SQL look like for that? I’ll give some
row_number()
queries a go in the AM to see if that works & matches the idiom you’re looking for.1) I’ll skip what the sql would look like question, since I’m hoping that dplyr handles that sql translation for me
https://rdrr.io/cran/dplyr/man/slice.html
but yes, i think it would involve dplyr::row_number()
2) i’m hoping the dbplyr step handles connecting to a spreadsheet (drill) / database (mariadb) and keep the data ‘out-of-memory’ for me
https://stackoverflow.com/questions/33658711/selecting-a-subset-of-a-sqlite-database-with-dplyr
note: i know that dplyr has trouble with dplyr::row_number() when connecting with SQLite
3) Then, when i hit ‘collect()’, that’s when I hope to bring in the ‘pre-specified subsampled rows’ efficiently into memory
The reason to do this would be, the user can now interactively prototype / explore with the ‘columns’ of the pre-specified sub-sampled rows in-memory.
Thanks Bob, this is really interesting (although I don’t quite follow all the technical details). Would it be possible to use sergeant to query Hive tables on a remote cluster?
Pretty sure you’re looking for this, Jacob: https://drill.apache.org/docs/hive-storage-plugin/#connect-drill-to-the-hive-remote-metastore-directly
Thanks Bob, that’s great!
4 Trackbacks/Pingbacks
[…] leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data […]
[…] leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data […]
[…] It’s no secret that I’m a fan of Apache Drill. One big strength of the platform is that it normalizes the access to diverse data sources down to ANSI SQL calls, which means that I can pull data from parquet, Hie, HBase, Kudu, CSV, JSON, MongoDB and MariaDB with the same SQL syntax. This also… Continue reading → […]
[…] article was first published on R – rud.is, and kindly contributed to […]