Chapter 3 Turning PCAPs into Data
Technically PCAP files are, in fact, data files. We could try to operate on them directly with some hacky R packages this author might have made, or via the {reticulate}
R package which makes the Python universe accessible directly in R (and, said universe has many modules to work with PCAPs). However, zeek
and tshark
do a significant amount of heavy-lifting so we do not have to, and in a “real life” incident response situation, time can be paramount.
3.1 PCAP Metadata
We can get an overview of the PCAP file contents with the capinfos
utility that comes along for the ride with tshark
:
cat(system("capinfos maze/maze.pcapng", intern=TRUE), sep="\n")
## File name: maze/maze.pcapng
## File type: Wireshark/... - pcapng
## File encapsulation: Ethernet
## File timestamp precision: nanoseconds (9)
## Packet size limit: file hdr: (not set)
## Number of packets: 45k
## File size: 38MB
## Data size: 36MB
## Capture duration: 450.466647147 seconds
## First packet time: 2021-04-29 21:00:51.031550031
## Last packet time: 2021-04-29 21:08:21.498197178
## Data byte rate: 82kBps
## Data bit rate: 656kbps
## Average packet size: 821.02 bytes
## Average packet rate: 99 packets/s
## SHA256: 3cc2061959afb116aeedce2736809f28236b96e20b89b4199194f4a30a0802ba
## RIPEMD160: e278720f93a6d58b45d1614877eb92b91ef0a684
## SHA1: 0a40bcf2c2329ddf72bd864f05855314cb76514d
## Strict time order: False
## Capture hardware: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz (with SSE4.2)
## Capture oper-sys: Linux 5.4.0-72-generic
## Capture application: Dumpcap (Wireshark) 3.2.3 (Git v3.2.3 packaged as 3.2.3-1)
## Number of interfaces in file: 1
## Interface #0 info:
## Name = wlo1
## Encapsulation = Ethernet (1 - ether)
## Capture length = 262144
## Time precision = nanoseconds (9)
## Time ticks per second = 1000000000
## Time resolution = 0x09
## Operating system = Linux 5.4.0-72-generic
## Number of stat entries = 1
## Number of packets = 45024
For ~8 minutes of activity we have a 38MB PCAP file with ~45 thousand packets captured in April. “Real life” PCAPs of target user activity would likely not be so diminutive. Thankfully, the maze runners were kind to us.
3.2 Processing PCAPs with Zeek
We’ll first generate a series of standard Zeek “log” files that are packet-capture feature-specific structured files. We’ve enabled the mac-logging
rules so certain log files will also contain MAC addresses of the nodes (since some questions ask about those).
<- getwd()
wd setwd("maze")
system("/opt/zeek/bin/zeek --no-checksums --readfile maze.pcapng policy/protocols/conn/mac-logging")
We can see if that worked by getting a directory listing:
list.files() # we're still in the `maze` directory
## [1] "conn.log" "dns.log" "files.log"
## [4] "ftp-q.json" "ftp-q1.json" "ftp.log"
## [7] "hosts.txt" "http.log" "maze.pcapng"
## [10] "maze.txt" "packet_filter.log" "proton-q.json"
## [13] "ssl.log" "tunnel.log" "weird.log"
## [16] "x509.log"
Each log file has different information based upon what was contained in the PCAP. For our example, these are the logs that were generated. Follow the links to learn more about what is in each of them.
conn.log
: TCP/UDP/ICMP connectionsdns.log
: DNS Activityfiles.log
: File analysis resultsftp.log
: FTP activityhttp.log
: HTTP requests and repliespacket_filter.log
: List packet filters that were appliedssl.log
: SSL/TLS handshake infotunnel.log
: Tunneling protocol eventsweird.log
: Unexpected network-level activityx509.log
: X.509 certificate info
3.2.1 Zeek Log File Helper Function
Zeek logs are well-structured files that, by default, have a very informative header:
read_lines("conn.log", n_max = 8) # still in the `maze` dir
## [1] "#separator \\x09"
## [2] "#set_separator\t,"
## [3] "#empty_field\t(empty)"
## [4] "#unset_field\t-"
## [5] "#path\tconn"
## [6] "#open\t2021-07-19-16-04-28"
## [7] "#fields\tts\tuid\tid.orig_h\tid.orig_p\tid.resp_h\tid.resp_p\tproto\tservice\tduration\torig_bytes\tresp_bytes\tconn_state\tlocal_orig\tlocal_resp\tmissed_bytes\thistory\torig_pkts\torig_ip_bytes\tresp_pkts\tresp_ip_bytes\ttunnel_parents\torig_l2_addr\tresp_l2_addr"
## [8] "#types\ttime\tstring\taddr\tport\taddr\tport\tenum\tstring\tinterval\tcount\tcount\tstring\tbool\tbool\tcount\tstring\tcount\tcount\tcount\tcount\tset[string]\tstring\tstring"
## [9] "1619744451.035623\tC27owk3xZcWuCyTBL7\t192.168.1.26\t51754\t173.223.18.66\t80\ttcp\t-\t0.000607\t0\t0\tSF\t-\t-\t0\t^fFa\t1\t52\t2\t104\t-\tc8:09:a8:57:47:93\tca:0b:ad:ad:20:ba"
As such, having a small helper function to deal with assigning valid column names and skipping past the header will be helpful:
<- function(path) {
read_zeek_log
# get column names
read_lines(path[1], n_max = 7) %>%
last() %>%
strsplit("\t") %>%
unlist() %>%
tail(-1) -> cols
read_tsv(path[1], col_names = cols, comment = "#")
suppressMessages(
read_tsv(
file = path[1],
col_names = cols,
comment = "#" # read, skipping header
)
)
}
A more robust version would examine all the header parameters and use them accordingly, but we don’t need something that sophisticated to deal with this packet maze challenge.
3.3 Processing PCAPs with tshark
Zeek is great, but some questions ask about packet numbers (and, it’s often helpful to have packet-level information available in general). For this, we’ll turn to tshark
to generate a lightweight delimited text file with basic, per-packet metadata:
system("tshark -T tabs -r maze.pcapng > maze.txt") # still in the `maze` dir
Let’s take a look at at the first few lines:
library(tidyverse)
read_lines("maze.txt", n_max = 10) # still in the `maze` dir
## [1] " 1\t0.000000000\t192.168.1.26\t→\t13.107.21.200\tTCP\t74\t33066 → 443 [SYN] Seq=0 Win=65330 Len=0 MSS=1390 SACK_PERM=1 TSval=1979376479 TSecr=0 WS=128"
## [2] " 2\t0.004073155\t173.223.18.66\t→\t192.168.1.26\tTCP\t66\t80 → 51754 [FIN, ACK] Seq=1 Ack=1 Win=17 Len=0 TSval=4167144401 TSecr=3610316592"
## [3] " 3\t0.004680329\t192.168.1.26\t→\t173.223.18.66\tTCP\t66\t51754 → 80 [FIN, ACK] Seq=1 Ack=2 Win=502 Len=0 TSval=3610347326 TSecr=4167144401"
## [4] " 4\t0.068286008\t192.168.1.26\t→\t13.107.21.200\tTLSv1.2\t525\tApplication Data"
## [5] " 5\t0.068326815\t192.168.1.26\t→\t13.107.21.200\tTLSv1.2\t624\tApplication Data"
## [6] " 6\t0.140702265\t192.168.1.26\t→\t13.107.21.200\tTLSv1.2\t104\tApplication Data"
## [7] " 7\t0.143929532\t13.107.21.200\t→\t192.168.1.26\tTCP\t74\t443 → 33066 [SYN, ACK] Seq=0 Ack=1 Win=14600 Len=0 MSS=1380 SACK_PERM=1 TSval=4167144541 TSecr=1979376479 WS=1024"
## [8] " 8\t0.144015466\t192.168.1.26\t→\t13.107.21.200\tTCP\t66\t33066 → 443 [ACK] Seq=1 Ack=1 Win=65408 Len=0 TSval=1979376623 TSecr=4167144541"
## [9] " 9\t0.145324645\t192.168.1.26\t→\t13.107.21.200\tTLSv1\t191\tClient Hello"
## [10] " 10\t0.148841745\t173.223.18.66\t→\t192.168.1.26\tTCP\t66\t80 → 51754 [ACK] Seq=2 Ack=2 Win=17 Len=0 TSval=4167144541 TSecr=3610347326"
This is a straightforward tab separated values (TSV) file without a header, which means using something like readr::read_tsv()
will work fine, but column names will be X1
, X2
, etc. We could leave them like that since this is just a small exercise and we won’t be using this packet information much, but it’s nicer to work with column names that mean something, so we’ll assign the following names when we read in the file:
packet_num
: Packet numberts
: Time (relative to the start of the capture) the packet was seensrc
: Source address- Kinda useless arrow that we’ll leave out of the data frame
dst
: Destination addressproto
: Protocollength
: Packet length (bytes)info
: General information about the packet
We can squeeze a more up-front metadata that may come in handy later on using the tshark
-z
option which lets us gather different statistics. Specifically, we’ll generate a list of IP address → host mappings (from the DNS queries that were performed during the session) :
system("tshark -q -z hosts -r maze/maze.pcapng > maze/hosts.txt") # still in `maze` dir
This is yet-another plaintext, tab-separated file with comments and no header line (we’ll read this in and look at it in a future chapter).