Skip navigation

Category Archives: duckdb

2024-08-30 UPDATE:
Binary versions of this extension are available for amd64 Linux (linux_amd64 & linux_amd64_gcc4) and Apple Silicon. (osx_arm64).

$ duckdb -unsigned
v1.0.0 1f98600c2c
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D SET custom_extension_repository='https://w3c2.c20.e2-5.dev/ppcap/latest';
D INSTALL ppcap;
D LOAD ppcap;

2024-08-29 UPDATE: The Apple Silicon macOS and Linux AMD64 versions of the plugin now work with PCAP files that are “Raw IP” vs. just “Ethernet

We generate a ton of PCAP files at $DAYJOB. Since I do not always have to work directly with them, I regularly mix up or forget the various tshark, tcpdump, etc., filters and CLI parameters. While this is less of an issue in the age of LLM/GPTs (just ask local ollama to gen the CLI incantation, and it usually does a good job), each failed command makes me miss Apache Drill just a tad, since it had/has a decent, albeit basic, PCAP reading capability.

For the past few months, I’ve had an “I should build a DuckDB extension to read PCAP files” idea floating in the back of my mind. Thanks to lingering issues from long covid, I’m back in the “let’s wake him up at 0-dark-30 and not let him get back to sleep” routine, so I decided to try to scratch this itch (I was actually hoping super focused work would engender slumber, but that, too, was a big fail).

The DuckDB folks have a spiffy extension template that you can use/fork to get started. It’s been a minute since I’ve had to work in C++ land, and I’m also used to working with system-level, or vendored libraries when doing said work. So, first I had to figure out vcpkg — a C/C++ dependency manager from (ugh) Microsoft — as the DuckDB folks strongly encourage using it (and they use it). You likely do not have to get in the weeds, since there are three lines in the extension template that are (pretty much) all you really need to know/do.

Once that was done, I added libpcap to the DuckDB vcpkg deps. Then, a review of the structure of the example extension and the JSON, CSV, and Parquet reader extensions was in order to get a feel for how to add new functions, and return rectangular data from an entirely new file type.

To get started, I focused on some easy fields: source/destination IPs, timestamp, and payload length and had some oddly great success. So, of course, I had to start a Mastodon thread.

The brilliant minds at DuckDB truly made it pretty straightforward to work with list/array columns, and write new utility functions, so I just kept adding fields and functionality until time ran out (adulting is hard).

At present, the extension exposes the following fields from a PCAP file:

  • timestamp
  • source_ip
  • dest_ip
  • source_port
  • dest_port
  • length
  • tcp_session
  • source_mac
  • dest_mac
  • protocols
  • payload
  • tcp_flags
  • tcp_seq_num

It also has a read_pcap function that supports wildcards or an array of filenames. And, there are three utility functions, one that does a naive test for whether a payload is an HTTP request or response, another that extracts HTTP request headers (if present), and one more that extracts some info from ICMP packets.

Stop Telling Me And Show Me

Fine.

Here’s an incantation that naively converts all HTTP request and response packets to Parquet, since it will always be faster to use Parquet than it will be to use PCAPs:

duckdb -unsigned <<EOF
LOAD ppcap;

COPY (
  FROM 
    read_pcap('scans.pcap')
  SELECT
    *,
    is_http(payload) AS is_http,
    extract_http_request_headers(payload) AS req
) TO 'scans.parquet' (FORMAT PARQUET);
EOF

duckdb -json -s "FROM read_parquet('scans.parquet') WHERE is_http LIMIT 2" | jq
[
  {
    "timestamp": "2024-07-23 16:31:06",
    "source_ip": "94.156.71.207",
    "dest_ip": "203.161.44.208",
    "source_port": 49678,
    "dest_port": 80,
    "length": 154,
    "tcp_session": "94.156.71.207:49678-203.161.44.208:80",
    "source_mac": "64:64:9b:4f:37:00",
    "dest_mac": "00:16:3c:cb:72:42",
    "protocols": "[Ethernet, IP, TCP]",
    "payload": "GET /_profiler/phpinfo HTTP/1.1\\x0D\\x0AHost: 203.161.44.208\\x0D\\x0AUser-Agent: Web Downloader/6.9\\x0D\\x0AAccept-Charset: utf-8\\x0D\\x0AAccept-Encoding: gzip\\x0D\\x0AConnection: close\\x0D\\x0A\\x0D\\x0A",
    "tcp_flags": "[ACK, PSH]",
    "tcp_seq_num": "2072884123",
    "is_http": true,
    "req": "[{'key': Host, 'value': 203.161.44.208}, {'key': User-Agent, 'value': Web Downloader/6.9}, {'key': Accept-Charset, 'value': utf-8}, {'key': Accept-Encoding, 'value': gzip}, {'key': Connection, 'value': close}]"
  },
  {
    "timestamp": "2024-07-23 16:31:06",
    "source_ip": "203.161.44.208",
    "dest_ip": "94.156.71.207",
    "source_port": 80,
    "dest_port": 49678,
    "length": 456,
    "tcp_session": "203.161.44.208:80-94.156.71.207:49678",
    "source_mac": "00:16:3c:cb:72:42",
    "dest_mac": "64:64:9b:4f:37:00",
    "protocols": "[Ethernet, IP, TCP]",
    "payload": "HTTP/1.1 404 Not Found\\x0D\\x0ADate: Tue, 23 Jul 2024 16:31:06 GMT\\x0D\\x0AServer: Apache/2.4.52 (Ubuntu)\\x0D\\x0AContent-Length: 276\\x0D\\x0AConnection: close\\x0D\\x0AContent-Type: text/html; charset=iso-8859-1\\x0D\\x0A\\x0D\\x0A<!DOCTYPE HTML PUBLIC \\x22-//IETF//DTD HTML 2.0//EN\\x22>\\x0A<html><head>\\x0A<title>404 Not Found</title>\\x0A</head><body>\\x0A<h1>Not Found</h1>\\x0A<p>The requested URL was not found on this server.</p>\\x0A<hr>\\x0A<address>Apache/2.4.52 (Ubuntu) Server at 203.161.44.208 Port 80</address>\\x0A</body></html>\\x0A",
    "tcp_flags": "[ACK, PSH]",
    "tcp_seq_num": "2821588265",
    "is_http": true,
    "req": null
  }
]

The reason for ppcap is that I was too lazy to deal with some symbol name collisions (between the extension and libpcap) in a more fancy manner. I’ll eventually figure out how to make it just pcap. PRs welcome.

How Do I Get This?

Well, for now, it’s a bit more complex than an INSTALL ppcap. My extension is not ready for prime time, so it won’t be in the DuckDB community extensions for a while. Which means, you’ll need to install them manually, and also get used to using the -unsigned CLI flag (I’ve aliased that to duckdbu).

NOTE: you need to be running v1.0.0+ of DuckDB for this extension to work.

Here’s how to install it on macOS + Apple Silicon and test to see if it worked:

# where extensions live on macOS + Apple Silicon
mkdir -p ~/.duckdb/extensions/v1.0.0/osx_arm64

# grab and "install" the extension
curl --output ~/.duckdb/extensions/v1.0.0/osx_arm64/ppcap.duckdb_extension https://rud.is/dl/pcap/darwin-arm64/ppcap.duckdb_extension

# this should not output anyting if it worked
duckdb -unsigned -s "load ppcap"

Linux folks can sub out osx_arm64 and darwin-arm64 with linux_amd64 or linux_amd64_gcc4, depending on your system architecture, which you can find via duckdb -s "PRAGMA platform". linux_amd64_gcc4 is the architecture of the Linux amd64/x86_64 binary offered for download from DuckDB-proper.

Source is, sadly, on GitHub: https://github.com/hrbrmstr/duckdb-pcap.