![](https://i0.wp.com/rud.is/b/wp-content/uploads/2024/08/vincent-van-zalinge-MM2bY4W7G2Y-unsplash.jpg?resize=510%2C295&ssl=1)
Photo by Vincent van Zalinge on Unsplash
2024-08-30 UPDATE:
Binary versions of this extension are available for amd64 Linux (linux_amd64
& linux_amd64_gcc4
) and Apple Silicon. (osx_arm64
).
$ duckdb -unsigned
v1.0.0 1f98600c2c
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D SET custom_extension_repository='https://w3c2.c20.e2-5.dev/ppcap/latest';
D INSTALL ppcap;
D LOAD ppcap;
2024-08-29 UPDATE: The Apple Silicon macOS and Linux AMD64 versions of the plugin now work with PCAP files that are “Raw IP” vs. just “Ethernet
We generate a ton of PCAP files at $DAYJOB
. Since I do not always have to work directly with them, I regularly mix up or forget the various tshark
, tcpdump
, etc., filters and CLI parameters. While this is less of an issue in the age of LLM/GPTs (just ask local ollama to gen the CLI incantation, and it usually does a good job), each failed command makes me miss Apache Drill just a tad, since it had/has a decent, albeit basic, PCAP reading capability.
For the past few months, I’ve had an “I should build a DuckDB extension to read PCAP files” idea floating in the back of my mind. Thanks to lingering issues from long covid, I’m back in the “let’s wake him up at 0-dark-30 and not let him get back to sleep” routine, so I decided to try to scratch this itch (I was actually hoping super focused work would engender slumber, but that, too, was a big fail).
The DuckDB folks have a spiffy extension template that you can use/fork to get started. It’s been a minute since I’ve had to work in C++ land, and I’m also used to working with system-level, or vendored libraries when doing said work. So, first I had to figure out vcpkg — a C/C++ dependency manager from (ugh) Microsoft — as the DuckDB folks strongly encourage using it (and they use it). You likely do not have to get in the weeds, since there are three lines in the extension template that are (pretty much) all you really need to know/do.
Once that was done, I added libpcap
to the DuckDB vcpkg deps. Then, a review of the structure of the example extension and the JSON, CSV, and Parquet reader extensions was in order to get a feel for how to add new functions, and return rectangular data from an entirely new file type.
To get started, I focused on some easy fields: source/destination IPs, timestamp, and payload length and had some oddly great success. So, of course, I had to start a Mastodon thread.
The brilliant minds at DuckDB truly made it pretty straightforward to work with list/array columns, and write new utility functions, so I just kept adding fields and functionality until time ran out (adulting is hard).
At present, the extension exposes the following fields from a PCAP file:
timestamp
source_ip
dest_ip
source_port
dest_port
length
tcp_session
source_mac
dest_mac
protocols
payload
tcp_flags
tcp_seq_num
It also has a read_pcap
function that supports wildcards or an array of filenames. And, there are three utility functions, one that does a naive test for whether a payload is an HTTP request or response, another that extracts HTTP request headers (if present), and one more that extracts some info from ICMP packets.
Stop Telling Me And Show Me
Fine.
Here’s an incantation that naively converts all HTTP request and response packets to Parquet, since it will always be faster to use Parquet than it will be to use PCAPs:
duckdb -unsigned <<EOF
LOAD ppcap;
COPY (
FROM
read_pcap('scans.pcap')
SELECT
*,
is_http(payload) AS is_http,
extract_http_request_headers(payload) AS req
) TO 'scans.parquet' (FORMAT PARQUET);
EOF
duckdb -json -s "FROM read_parquet('scans.parquet') WHERE is_http LIMIT 2" | jq
[
{
"timestamp": "2024-07-23 16:31:06",
"source_ip": "94.156.71.207",
"dest_ip": "203.161.44.208",
"source_port": 49678,
"dest_port": 80,
"length": 154,
"tcp_session": "94.156.71.207:49678-203.161.44.208:80",
"source_mac": "64:64:9b:4f:37:00",
"dest_mac": "00:16:3c:cb:72:42",
"protocols": "[Ethernet, IP, TCP]",
"payload": "GET /_profiler/phpinfo HTTP/1.1\\x0D\\x0AHost: 203.161.44.208\\x0D\\x0AUser-Agent: Web Downloader/6.9\\x0D\\x0AAccept-Charset: utf-8\\x0D\\x0AAccept-Encoding: gzip\\x0D\\x0AConnection: close\\x0D\\x0A\\x0D\\x0A",
"tcp_flags": "[ACK, PSH]",
"tcp_seq_num": "2072884123",
"is_http": true,
"req": "[{'key': Host, 'value': 203.161.44.208}, {'key': User-Agent, 'value': Web Downloader/6.9}, {'key': Accept-Charset, 'value': utf-8}, {'key': Accept-Encoding, 'value': gzip}, {'key': Connection, 'value': close}]"
},
{
"timestamp": "2024-07-23 16:31:06",
"source_ip": "203.161.44.208",
"dest_ip": "94.156.71.207",
"source_port": 80,
"dest_port": 49678,
"length": 456,
"tcp_session": "203.161.44.208:80-94.156.71.207:49678",
"source_mac": "00:16:3c:cb:72:42",
"dest_mac": "64:64:9b:4f:37:00",
"protocols": "[Ethernet, IP, TCP]",
"payload": "HTTP/1.1 404 Not Found\\x0D\\x0ADate: Tue, 23 Jul 2024 16:31:06 GMT\\x0D\\x0AServer: Apache/2.4.52 (Ubuntu)\\x0D\\x0AContent-Length: 276\\x0D\\x0AConnection: close\\x0D\\x0AContent-Type: text/html; charset=iso-8859-1\\x0D\\x0A\\x0D\\x0A<!DOCTYPE HTML PUBLIC \\x22-//IETF//DTD HTML 2.0//EN\\x22>\\x0A<html><head>\\x0A<title>404 Not Found</title>\\x0A</head><body>\\x0A<h1>Not Found</h1>\\x0A<p>The requested URL was not found on this server.</p>\\x0A<hr>\\x0A<address>Apache/2.4.52 (Ubuntu) Server at 203.161.44.208 Port 80</address>\\x0A</body></html>\\x0A",
"tcp_flags": "[ACK, PSH]",
"tcp_seq_num": "2821588265",
"is_http": true,
"req": null
}
]
The reason for ppcap
is that I was too lazy to deal with some symbol name collisions (between the extension and libpcap
) in a more fancy manner. I’ll eventually figure out how to make it just pcap
. PRs welcome.
How Do I Get This?
Well, for now, it’s a bit more complex than an INSTALL ppcap
. My extension is not ready for prime time, so it won’t be in the DuckDB community extensions for a while. Which means, you’ll need to install them manually, and also get used to using the -unsigned
CLI flag (I’ve aliased that to duckdbu
).
NOTE: you need to be running v1.0.0+ of DuckDB for this extension to work.
Here’s how to install it on macOS + Apple Silicon and test to see if it worked:
# where extensions live on macOS + Apple Silicon
mkdir -p ~/.duckdb/extensions/v1.0.0/osx_arm64
# grab and "install" the extension
curl --output ~/.duckdb/extensions/v1.0.0/osx_arm64/ppcap.duckdb_extension https://rud.is/dl/pcap/darwin-arm64/ppcap.duckdb_extension
# this should not output anyting if it worked
duckdb -unsigned -s "load ppcap"
Linux folks can sub out osx_arm64
and darwin-arm64
with linux_amd64
or linux_amd64_gcc4
, depending on your system architecture, which you can find via duckdb -s "PRAGMA platform"
. linux_amd64_gcc4
is the architecture of the Linux amd64/x86_64 binary offered for download from DuckDB-proper.
Source is, sadly, on GitHub: https://github.com/hrbrmstr/duckdb-pcap.
Where Do We Go From Here?
Christian nationalists, the GOP, clueless Dems, & SCOTUS have done quite a bit of real, serious damage to the fragile state of democracy & discourse in the U.S. They’ve also set back over 70+ years of hard fought advancements at record speed.
It’s a big, complex problem to solve, and, if it all feels overwhelming, asymmetric & unfair, well, it is. Demoralizing us into anger-fueled apathy is one of the goals.
In one side of my profession, a process called “decomposition” is used to break a complex problem into smaller subproblems that are easier to solve. The subproblems are solved recursively or iteratively. The solutions to the subproblems are then combined to solve the original larger problem. Some subproblems can be solved in parallel, with different folks working on different subproblems; others need to wait for some subproblems to be solved first.
The big list of components of this present concern/danger needs to be enumerated & documented, so the actual subproblems can be identified. Nobody is going to do that for you/us. It’s a cognitively & emotionally painful, but important, step. But, it will also help folks explain the big picture to friends/family. The massive scope has to be understood, if only to help others grok that there is no quick, magic solution. It’s going to be a long, hard slog.
Once that settles in, the most essential thing happens next: what is the first subproblem that needs to be solved? It provides focus and — more importantly — an accomplishable goal.
I think it’s fair to say that one bigger subproblem that fits into the next category. It is working to ensure no member of the GOP gets elected to any office in any capacity anywhere in the U.S. for the next forty years. They’ve demonstrated they can’t be trusted with power, and that they have no integrity or shame.
That’s a big subproblem.
Decomposition ultimately brings that down to what an individual can do in the place they live, which means you & I need to prevent GOP-aligned folks from being elected to:
Organizations like Indivisible can help provide tangible, accomplishable tasks to complete in order to make that reality. Also: it kind of doesn’t matter “who” is opposite a GOP contender. What matters is that they’re neither in the GOP nor a fringe third-party with no chance of being elected. We’re all going to have to put away our pet desires/agendas for a few decades if we really want a foundation for change that can be built on.
A parallel, bigger subproblem, is watching what local, state, and federal legislation is being floated/worked on, and pushing back hard. GovTrack, POGO, Common Cause, Public Citizen and others can help with that, but the onus is on us to do the actual challenging.
And, another, parallel subproblem to work on is building community resilience and mutual aid networks because we’re not going fix every instance of every subproblem (e.g., despite solid efforts, GOP folks are almost certainly going to get elected to positions of power, and, thus, pose a real threat). GWU’s Center for Community Resilience is not a bad place to get started learning how to do that.
Those are three, tangible, accomplishable subproblems anyone can work on (even if you just focus on one of them).
It’s going to take a very long time to course correct. It will be very painful. And, far too many folks will get hurt along the way until things get better. But, giving in and doing nothing aren’t options.