Chapter 3 Turning PCAPs into Data

Technically PCAP files are, in fact, data files. We could try to operate on them directly with some hacky R packages this author might have made, or via the {reticulate} R package which makes the Python universe accessible directly in R (and, said universe has many modules to work with PCAPs). However, zeek and tshark do a significant amount of heavy-lifting so we do not have to, and in a “real life” incident response situation, time can be paramount.

3.1 PCAP Metadata

We can get an overview of the PCAP file contents with the capinfos utility that comes along for the ride with tshark:

cat(system("capinfos maze/maze.pcapng", intern=TRUE), sep="\n")
## File name:           maze/maze.pcapng
## File type:           Wireshark/... - pcapng
## File encapsulation:  Ethernet
## File timestamp precision:  nanoseconds (9)
## Packet size limit:   file hdr: (not set)
## Number of packets:   45k
## File size:           38MB
## Data size:           36MB
## Capture duration:    450.466647147 seconds
## First packet time:   2021-04-29 21:00:51.031550031
## Last packet time:    2021-04-29 21:08:21.498197178
## Data byte rate:      82kBps
## Data bit rate:       656kbps
## Average packet size: 821.02 bytes
## Average packet rate: 99 packets/s
## SHA256:              3cc2061959afb116aeedce2736809f28236b96e20b89b4199194f4a30a0802ba
## RIPEMD160:           e278720f93a6d58b45d1614877eb92b91ef0a684
## SHA1:                0a40bcf2c2329ddf72bd864f05855314cb76514d
## Strict time order:   False
## Capture hardware:    Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz (with SSE4.2)
## Capture oper-sys:    Linux 5.4.0-72-generic
## Capture application: Dumpcap (Wireshark) 3.2.3 (Git v3.2.3 packaged as 3.2.3-1)
## Number of interfaces in file: 1
## Interface #0 info:
##                      Name = wlo1
##                      Encapsulation = Ethernet (1 - ether)
##                      Capture length = 262144
##                      Time precision = nanoseconds (9)
##                      Time ticks per second = 1000000000
##                      Time resolution = 0x09
##                      Operating system = Linux 5.4.0-72-generic
##                      Number of stat entries = 1
##                      Number of packets = 45024

For ~8 minutes of activity we have a 38MB PCAP file with ~45 thousand packets captured in April. “Real life” PCAPs of target user activity would likely not be so diminutive. Thankfully, the maze runners were kind to us.

3.2 Processing PCAPs with Zeek

We’ll first generate a series of standard Zeek “log” files that are packet-capture feature-specific structured files. We’ve enabled the mac-logging rules so certain log files will also contain MAC addresses of the nodes (since some questions ask about those).

wd <- getwd()
setwd("maze")

system("/opt/zeek/bin/zeek --no-checksums --readfile maze.pcapng policy/protocols/conn/mac-logging")

We can see if that worked by getting a directory listing:

list.files() # we're still in the `maze` directory
##  [1] "conn.log"          "dns.log"           "files.log"        
##  [4] "ftp-q.json"        "ftp-q1.json"       "ftp.log"          
##  [7] "hosts.txt"         "http.log"          "maze.pcapng"      
## [10] "maze.txt"          "packet_filter.log" "proton-q.json"    
## [13] "ssl.log"           "tunnel.log"        "weird.log"        
## [16] "x509.log"

Each log file has different information based upon what was contained in the PCAP. For our example, these are the logs that were generated. Follow the links to learn more about what is in each of them.

3.2.1 Zeek Log File Helper Function

Zeek logs are well-structured files that, by default, have a very informative header:

read_lines("conn.log", n_max = 8) # still in the `maze` dir
## [1] "#separator \\x09"                                                                                                                                                                                                                                                         
## [2] "#set_separator\t,"                                                                                                                                                                                                                                                        
## [3] "#empty_field\t(empty)"                                                                                                                                                                                                                                                    
## [4] "#unset_field\t-"                                                                                                                                                                                                                                                          
## [5] "#path\tconn"                                                                                                                                                                                                                                                              
## [6] "#open\t2021-07-19-16-04-28"                                                                                                                                                                                                                                               
## [7] "#fields\tts\tuid\tid.orig_h\tid.orig_p\tid.resp_h\tid.resp_p\tproto\tservice\tduration\torig_bytes\tresp_bytes\tconn_state\tlocal_orig\tlocal_resp\tmissed_bytes\thistory\torig_pkts\torig_ip_bytes\tresp_pkts\tresp_ip_bytes\ttunnel_parents\torig_l2_addr\tresp_l2_addr"
## [8] "#types\ttime\tstring\taddr\tport\taddr\tport\tenum\tstring\tinterval\tcount\tcount\tstring\tbool\tbool\tcount\tstring\tcount\tcount\tcount\tcount\tset[string]\tstring\tstring"                                                                                           
## [9] "1619744451.035623\tC27owk3xZcWuCyTBL7\t192.168.1.26\t51754\t173.223.18.66\t80\ttcp\t-\t0.000607\t0\t0\tSF\t-\t-\t0\t^fFa\t1\t52\t2\t104\t-\tc8:09:a8:57:47:93\tca:0b:ad:ad:20:ba"

As such, having a small helper function to deal with assigning valid column names and skipping past the header will be helpful:

read_zeek_log <- function(path) {
  
  # get column names
  read_lines(path[1], n_max = 7) %>% 
    last() %>% 
    strsplit("\t") %>% 
    unlist() %>% 
    tail(-1) -> cols
  
  read_tsv(path[1], col_names = cols, comment = "#")  
  
  suppressMessages(
    read_tsv(
      file = path[1], 
      col_names = cols, 
      comment = "#" # read, skipping header
    )
  )
  
}

A more robust version would examine all the header parameters and use them accordingly, but we don’t need something that sophisticated to deal with this packet maze challenge.

3.3 Processing PCAPs with tshark

Zeek is great, but some questions ask about packet numbers (and, it’s often helpful to have packet-level information available in general). For this, we’ll turn to tshark to generate a lightweight delimited text file with basic, per-packet metadata:

system("tshark -T tabs -r maze.pcapng > maze.txt") # still in the `maze` dir

Let’s take a look at at the first few lines:

library(tidyverse)

read_lines("maze.txt", n_max = 10) # still in the `maze` dir
##  [1] "    1\t0.000000000\t192.168.1.26\t→\t13.107.21.200\tTCP\t74\t33066 → 443 [SYN] Seq=0 Win=65330 Len=0 MSS=1390 SACK_PERM=1 TSval=1979376479 TSecr=0 WS=128"                     
##  [2] "    2\t0.004073155\t173.223.18.66\t→\t192.168.1.26\tTCP\t66\t80 → 51754 [FIN, ACK] Seq=1 Ack=1 Win=17 Len=0 TSval=4167144401 TSecr=3610316592"                                 
##  [3] "    3\t0.004680329\t192.168.1.26\t→\t173.223.18.66\tTCP\t66\t51754 → 80 [FIN, ACK] Seq=1 Ack=2 Win=502 Len=0 TSval=3610347326 TSecr=4167144401"                                
##  [4] "    4\t0.068286008\t192.168.1.26\t→\t13.107.21.200\tTLSv1.2\t525\tApplication Data"                                                                                            
##  [5] "    5\t0.068326815\t192.168.1.26\t→\t13.107.21.200\tTLSv1.2\t624\tApplication Data"                                                                                            
##  [6] "    6\t0.140702265\t192.168.1.26\t→\t13.107.21.200\tTLSv1.2\t104\tApplication Data"                                                                                            
##  [7] "    7\t0.143929532\t13.107.21.200\t→\t192.168.1.26\tTCP\t74\t443 → 33066 [SYN, ACK] Seq=0 Ack=1 Win=14600 Len=0 MSS=1380 SACK_PERM=1 TSval=4167144541 TSecr=1979376479 WS=1024"
##  [8] "    8\t0.144015466\t192.168.1.26\t→\t13.107.21.200\tTCP\t66\t33066 → 443 [ACK] Seq=1 Ack=1 Win=65408 Len=0 TSval=1979376623 TSecr=4167144541"                                  
##  [9] "    9\t0.145324645\t192.168.1.26\t→\t13.107.21.200\tTLSv1\t191\tClient Hello"                                                                                                  
## [10] "   10\t0.148841745\t173.223.18.66\t→\t192.168.1.26\tTCP\t66\t80 → 51754 [ACK] Seq=2 Ack=2 Win=17 Len=0 TSval=4167144541 TSecr=3610347326"

This is a straightforward tab separated values (TSV) file without a header, which means using something like readr::read_tsv() will work fine, but column names will be X1, X2, etc. We could leave them like that since this is just a small exercise and we won’t be using this packet information much, but it’s nicer to work with column names that mean something, so we’ll assign the following names when we read in the file:

  • packet_num: Packet number
  • ts: Time (relative to the start of the capture) the packet was seen
  • src: Source address
  • Kinda useless arrow that we’ll leave out of the data frame
  • dst: Destination address
  • proto: Protocol
  • length: Packet length (bytes)
  • info: General information about the packet

We can squeeze a more up-front metadata that may come in handy later on using the tshark -z option which lets us gather different statistics. Specifically, we’ll generate a list of IP address → host mappings (from the DNS queries that were performed during the session) :

system("tshark -q -z hosts -r maze/maze.pcapng > maze/hosts.txt") # still in `maze` dir

This is yet-another plaintext, tab-separated file with comments and no header line (we’ll read this in and look at it in a future chapter).