Skip navigation

Category Archives: data science

At our previous employer, the global deception and detection infrastructure generates tons of events that eventually make their way into an ever-growing data lake with (as of February 2026) 22 TB of PCAPs and 32 TB of session protocol data. When trying to find novel and truly dangerous attacker behavior, the bottleneck isn’t the data — it’s the analyst trying to hold it all in their head while toggling between Arkime, Censys, VirusTotal, and five other tabs.

Glenn Thorpe and I built Orbie to attack that problem. It’s a prompt-engineered analytical system running in Claude Code that coordinates 16 data source integrations, 8 investigation skills, and 2 background enrichment agents across structured, reproducible workflows — with one rule we never bent on: never assume, always query, show your work.

The full architecture, the failure modes, and where it’s going are in the talk we gave at the February 2026 installment of [un]prompted, above, and you can get some more info and freebies at https://github.com/GreyNoise-Intelligence/2026-labs-unprompted.

There’s going to be another [un]prompted likely later this year and I highly recommend attending and — if you have some of your own accomplishments to share — presenting. It was an incredible experience.

In the past ~4 weeks I have personally observed some irrefutable things in “AI” that are very likely going to cause massive shocks to employment models in IT, software development, systems administration, and cybersecurity. I know some have already seen minor shocks. They are nothing compared to what’s highly probably ahead.

Nobody likely wants to hear this, but you absolutely need to make or take time this year to identify what you can do that AI cannot do and create some of those items if your list is short or empty.

The weavers in the 1800s used violence to get a 20-year pseudo-reprieve before they were pushed into obsolescence. We’ve got ~maybe 18 months. I’m as pushback-on-this-“AI”-thing as makes sense. I’d like for the bubble to burst. Even if it does, the rulers of our clicktatorship will just fuel a quick rebuild.

Four human-only capabilities in security

In my (broad) field, I think there are some things that make humans 110% necessary. Here’s my list — and it’d be great if folks in very subdomain-specific parts of cyber would provide similar ones. I try to stay in my lane.

1. Judgment under uncertainty with real consequences

These new “AI” systems can use tools to analyze a gazillion sessions and cluster payloads, but they do not (or absolutely should not) bear responsibility for the “we’re pulling the plug on production” decision at 3am. This “weight of consequence” shapes human expertise in ways that inform intuition, risk tolerance, and the ability to act decisively with incomplete information.

Organizations will continue needing people who can own outcomes, not just produce analysis.

2. Adversarial creativity and novel problem framing

The more recent “AI” systems are actually darn good at pattern matching against known patterns and recombining existing approaches. They absolutely suck at the “genuinely novel” — the attack vector nobody has documented, the defensive technique that requires understanding how a specific organization actually operates versus how it should operate.

The best security practitioners think like attackers in ways that go beyond “here are common TTPs.”

3. Institutional knowledge and relationship capital

A yuge one.

Understanding that the finance team always ignores security warnings — especially Dave — during quarter-close. That the legacy SCADA system can’t be patched because the vendor went bankrupt in 2019. That the CISO and CTO have a long-running disagreement about cloud migration.

This context shapes what recommendations are actually actionable. Many technically correct analyses are organizationally useless.

4. The ability to build and maintain trust

The biggest one.

When a breach happens, executives don’t want a report from an “AI”. They want someone who can look them in the eye, explain what happened, and take ownership of the path forward. The human element of security leadership is absolutely not going away.

How to develop these capabilities

Develop depth in areas that require your presence or legal accountability. Disciplines such as incident response, compliance attestation, or security architecture for air-gapped or classified environments. These have regulatory and practical barriers to full automation.

Build expertise in the seams between systems. Understanding how a given combination of legacy mainframe, cloud services, and OT environment actually interconnects requires the kind of institutional archaeology (or the powers of a sexton) that doesn’t exist in training data.

Get comfortable being the human in the loop. I know this will get me tapping mute or block a lot, but you’re going to need to get comfortable being the human in the loop for “AI”-augmented workflows. The analyst who can effectively direct tools, validate outputs (b/c these things will always make stuff up), and translate findings for different audiences has a different job than before but still a necessary one.

Learn to ask better questions. Bring your hypotheses, domain expertise, and knowing which threads are worth pulling to the table. That editorial judgment about what matters is undervalued, and is going to take a while to infuse into “AI” systems.

We’re all John Henry now

A year ago, even with long covid brain fog, I could out-“John Henry” all of the commercial AI models at programming, cyber, and writing tasks. Both in speed and quality.

Now, with the fog gone, I’m likely ~3 months away from being slower than “AI” on a substantial number of core tasks that it can absolutely do. I’ve seen it. I’ve validated the outputs. It sucks. It really really sucks. And it’s not because I’m feeble or have some other undisclosed brain condition (unlike 47). These systems are being curated to do exactly that: erase all of us John Henrys.

The folks who thrive will be those who can figure out what “AI” capabilities aren’t complete garbage and wield them with uniquely human judgment rather than competing on tasks where “AI” has clear advantages.

The pipeline problem

The very uncomfortable truth: there will be fewer entry-level positions that consist primarily of “look at alerts and escalate.” That pipeline into the field is narrowing at a frightening pace.

What concerns me most isn’t the senior practitioners. We’ll adapt and likely become that much more effective. It’s the junior folks who won’t get the years of pattern exposure that built our intuition in the first place.

That’s a pipeline problem the industry hasn’t seriously grappled with yet — and isn’t likely to b/c of the hot, thin air in the offices and boardrooms of myopic and greedy senior executives.

It was a rainy weekend in southern Maine and I really didn’t feel like doing chores, so I was skimming through RSS feeds and noticed a link to a PacketMaze challenge in the latest This Week In 4n6.

Since it’s also been a while since I’ve done any serious content delivery (on the personal side, anyway), I thought it’d be fun to solve the challenge with some tools I like — namely Zeek, tshark, and R (links to those in the e-book I’m linking to below), craft some real expository around each solution, and bundle it all up into an e-book and lighter-weight GitHub repo.

There are 11 “quests” in the challenge, requiring sifting through a packet capture (PCAP) and looking for various odds and ends (some are very windy maze passages). The challenge ranges from extracting images and image metadata from FTP sessions to pulling out precise elements in TLS sessions, to dealing with IPv6.

This is far from an expert challenge, and anyone can likely work through it with a little bit of elbow grease.

As it says on the tin, not all data is ‘big’ nor do all data-driven cybersecurity projects require advanced modeling capabilities. Sometimes you just need to dissect some network packet capture (PCAP) data and don’t want to click through a GUI to get the job done. This short book works through the questions in CyberDefenders Lab #68 to show how you can get the Zeek open source network security tool, tshark command-line PCAP analysis Swiss army knife, and R (via RStudio) working together.

FIN

If you find the resource helpful or have other feedback, drop a note on Twitter (@hrbrmstr), in a comment here, or as a GitHub issue.

The 2018 IEEE Security & Privacy Conference is in May but they’ve posted their full proceedings and it’s better to grab them early than to wait for it to become part of a paid journal offering.

There are alot of papers. Not all match my interests but (fortunately?) many did and I’ve filtered down a list of the more interesting (to me) ones. It’s encouraging to see academic cybersecurity researchers branching out across a whole host of areas.

I can’t promise a the morning paper-esque daily treatment of these on the blog but I’ll likely exposit a few of them over the coming weeks. I’ve emoji’d a few that stood out. Order is the order I read them in (no other meaning to the order).

I caught a mention of this project by Pete Warden on Four Short Links today. If his name sounds familiar, he’s the creator of the DSTK, an O’Reilly author, and now works at Google. A decidedly clever and decent chap.

The project goal is noble: crowdsource and make a repository of open speech data for researchers to make a better world. Said sourcing is done by asking folks to record themselves saying “Yes”, “No” and other short words.

As I meandered over the blog post I looked in horror on the URL for the application that did the recording: https://open-speech-commands.appspot.com/.

Why would the goal of the project combined with that URL give pause? Read on!

You’ve Got Scams!

Picking up the phone and saying something as simple as ‘Yes’ has been a major scam this year. By recording your voice, attackers can replay it on phone prompts and because it’s your voice it makes it harder to refute the evidence and can foil recognition systems that look for your actual voice.

As the chart above shows, the Better Business Bureau has logged over 5,000 of these scams this year (searching for ‘phishing’ and ‘yes’). You can play with the data (a bit — the package needs work) in R with scamtracker.

Now, these are “analog” attacks (i.e. a human spends time socially engineering a human). Bookmark this as you peruse section 2.

Integrity Challenges in 2017

I “trust” Pete’s intentions, but I sure don’t trust open-speech-commands.appspot.com (and, you shouldn’t either). Why? Go visit https://totally-harmless-app.appspot.com. It’s a Google App Engine app I made for this post. Anyone can make an appspot app and the https is meaningless as far as integrity & authenticity goes since I’m running on google’s infrastructure but I’m not google.

You can’t really trust most SSL/TLS sessions as far as site integrity goes anyway. Let’s Encrypt put the final nail in the coffin with their Certs Gone Wild! initiative. With super-recent browser updates you can almost trust your eyes again when it comes to URLs, but you should be very wary of entering your info — especially uploading voice, prints or eye/face images — into any input box on any site if you aren’t 100% sure it’s a legit site that you trust.

Tracking the Trackers

If you don’t know that you’re being tracked 100% of the time on the internet then you really need to read up on the modern internet.

In many cases your IP address can directly identify you. In most cases your device & browser profile — which most commercial sites log — can directly identify you. So, just visiting a web site means that it’s highly likely that web site can know that you are both not a dog and are in fact you.

Still Waiting for the “So, What?”

Many states and municipalities have engaged in awareness campaigns to warn citizens about the “Say ‘Yes'” scam. Asking someone to record themselves saying ‘Yes’ into a random web site pretty much negates that advice.

Folks like me regularly warn about trust on the internet. I could have cloned the functionality of the original site to open-speech-commmands.appspot.com. Did you even catch the 3rd ‘m’ there? Even without that, it’s an appspot.com domain. Anyone can set one up.

Even if the site doesn’t ask for your name or other info and just asks for your ‘Yes’, it can know who you are. In fact, when you’re enabling the microphone to do the recording, it could even take a picture of you if it wanted to (and you’d likely not know or not object since it’s for SCIENCE!).

So, in the worst case scenario a malicious entity could be asking you for your ‘Yes’, tying it right to you and then executing the post-scam attacks that were being performed in the analog version.

But, go so far as to assume this is a legit site with good intentions. Do you really know what’s being logged when you commit your voice info? If the data was mishandled, it would be just as easy to tie the voice files back to you (assuming a certain level of data logging).

The “so what” is not really a warning to users but a message to researchers: You need to threat model your experiments and research initiatives, especially when innocent end users are potentially being put at risk. Data is the new gold, diamonds and other precious bits that attackers are after. You may think you’re not putting folks at risk and aren’t even a hacker target, but how you design data gathering can reinforce good or bad behaviour on the part of users. It can solidify solid security messages or tear them down. And, you and your data may be more of a target than you really know.

Reach out to interdisciplinary colleagues to help threat model your data collection, storage and dissemination methods to ensure you aren’t putting yourself or others at risk.

FIN

Pete did the right thing:

and, I’m sure the site will be on a “proper” domain soon. When it is, I’ll be one of the first in line to help make a much-needed open data set for research purposes.

The ASA (American Statistical Association) has been working in collaboration with the ACM (Association for Computing Machinery) on developing a data science curriculum for Two Year Colleges. Part of this development is the need to understand the private-sector demand for two-year college data science graduates and the prevalence of the need to invest in the continuing education of existing employees for gap-filling critical skills shortages in “knowledge workers”.

By taking this survey, participating in a workshop/summit or — especially — submitting a letter of collaboration you will be helping to prepare your organization and our combined workforce meet the challenges of succeeding in an increasingly data-driven world.

Take The ACS/ACM Data Science Survey

Optionally (or additionally), you may contact Mary Rudis directly (mrudis@ccsnh.edu) for more information on this joint ASA/ACM initiative and how you can help.

The insanely productive elf-lord, @quominus put together a small package ([`triebeard`](https://github.com/ironholds/triebeard)) that exposes an API for [radix/prefix tries](https://en.wikipedia.org/wiki/Trie) at both the R and Rcpp levels. I know he had some personal needs for this and we both kinda need these to augment some functions in our `iptools` package. Despite `triebeard` having both a vignette and function-level examples, I thought it might be good to show a real-world use of the package (at least in the cyber real world): fast determination of which [autonomous system](https://en.wikipedia.org/wiki/Autonomous_system_(Internet)) an IPv4 address is in (if it’s in one at all).

I’m not going to delve to deep into routing (you can find a good primer [here](http://www.kixtart.org/forums/ubbthreads.php?ubb=showflat&Number=81619&site_id=1#import) and one that puts routing in the context of radix tries [here](http://www.juniper.net/documentation/en_US/junos14.1/topics/usage-guidelines/policy-configuring-route-lists-for-use-in-routing-policy-match-conditions.html)) but there exists, essentially, abbreviated tables of which IP addresses belong to a particular network. These tables are in routers on your local networks and across the internet. Groups of these networks (on the internet) are composed into those autonomous systems I mentioned earlier and these tables are used to get the packets that make up the cat videos you watch routed to you as efficiently as possible.

When dealing with cybersecurity data science, it’s often useful to know which autonomous system an IP address belongs in. The world is indeed full of peril and in it there are many dark places. It’s a dangerous business, going out on the internet and we sometimes find it possible to identify unusually malicious autonomous systems by looking up suspicious IP addresses en masse. These mappings look something like this:

CIDR            ASN
1.0.0.0/24      47872
1.0.4.0/24      56203
1.0.5.0/24      56203
1.0.6.0/24      56203
1.0.7.0/24      38803
1.0.48.0/20     49597
1.0.64.0/18     18144

Each CIDR has a start and end IP address which can ultimately be converted to integers. Now, one _could_ just sequentially compare start and end ranges to see which CIDR an IP address belongs in, but there are (as of the day of this post) `647,563` CIDRs to compare against, which—in the worst case—would mean having to traverse through the entire list to find the match (or discover there is no match). There are some trivial ways to slightly optimize this, but the search times could still be fairly long, especially when you’re trying to match a billion IPv4 addresses to ASNs.

By storing the CIDR mask (the number of bits of the leading IP address specified after the `/`) in binary form (strings of 1’s and 0’s) as keys for the trie, we get much faster lookups (only a few comparisons at worst-case vs 647,563).

I made an initial, naïve, mostly straight R, implementation as a precursor to a more low-level implementation in Rcpp in our `iptools` package and to illustrate this use of the `triebeard` package.

One thing we’ll need is a function to convert an IPv4 address (in long integer form) into a binary character string. We _could_ do this with base R, but it’ll be super-slow and it doesn’t take much effort to create it with an Rcpp inline function:

library(Rcpp)
library(inline)

ip_to_binary_string <- rcpp(signature(x="integer"), "
  NumericVector xx(x);

  std::vector<double> X(xx.begin(),xx.end());
  std::vector<std::string> output(X.size());

  for (unsigned int i=0; i<X.size(); i++){

    if ((i % 10000) == 0) Rcpp::checkUserInterrupt();

    output[i] = std::bitset<32>(X[i]).to_string();

  }

  return(Rcpp::wrap(output));
")

ip_to_binary_string(ip_to_numeric("192.168.1.1"))
## [1] "11000000101010000000000100000001"

We take a vector from R and use some C++ standard library functions to convert them to bits. I vectorized this in C++ for speed (which is just a fancy way to say I used a `for` loop). In this case, our short cut will not make for a long delay.

Now, we’ll need a CIDR file. There are [historical ones](http://data.4tu.nl/repository/uuid:d4d23b8e-2077-4592-8b47-cb476ad16e12) avaialble, and I use one that I generated the day of this post (and, referenced in the code block below). You can use [`pyasn`](https://github.com/hadiasghari/pyasn) to make new ones daily (relegating mindless, automated, menial data retrieval tasks to the python goblins, like one should).

library(iptools)
library(stringi)
library(dplyr)
library(purrr)
library(readr)
library(tidyr)

asn_dat_url <- "http://rud.is/dl/asn-20160712.1600.dat.gz"
asn_dat_fil <- basename(asn_dat_url)
if (!file.exists(asn_dat_fil)) download.file(asn_dat_url, asn_dat_fil)

rip <- read_tsv(asn_dat_fil, comment=";", col_names=c("cidr", "asn"))
rip %>%
  separate(cidr, c("ip", "mask"), "/") %>%
  mutate(prefix=stri_sub(ip_to_binary_string(ip_to_numeric(ip)), 1, mask)) -> rip_df

rip_df
## # A tibble: 647,557 x 4
##           ip  mask   asn                   prefix
##        <chr> <chr> <int>                    <chr>
## 1    1.0.0.0    24 47872 000000010000000000000000
## 2    1.0.4.0    24 56203 000000010000000000000100
## 3    1.0.5.0    24 56203 000000010000000000000101
## 4    1.0.6.0    24 56203 000000010000000000000110
## 5    1.0.7.0    24 38803 000000010000000000000111
## 6   1.0.48.0    20 49597     00000001000000000011
## 7   1.0.64.0    18 18144       000000010000000001
## 8  1.0.128.0    17  9737        00000001000000001
## 9  1.0.128.0    18  9737       000000010000000010
## 10 1.0.128.0    19  9737      0000000100000000100
## # ... with 647,547 more rows

You can save off that `data_frame` to an R data file to pull in later (but it’s pretty fast to regenerate).

Now, we create the trie, using the prefix we calculated and a value we’ll piece together for this example:

library(triebeard)

rip_trie <- trie(rip_df$prefix, sprintf("%s/%s|%s", rip_df$ip, rip_df$mask, rip_df$asn))

Yep, that’s it. If you ran this yourself, it should have taken less than 2s on most modern systems to create the nigh 700,000 element trie.

Now, we’ll generate a million random IP addresses and look them up:

set.seed(1492)
data_frame(lkp=ip_random(1000000),
           lkp_bin=ip_to_binary_string(ip_to_numeric(lkp)),
           long=longest_match(rip_trie, lkp_bin)) -> lkp_df

lkp_df
## # A tibble: 1,000,000 x 3
##               lkp                          lkp_bin                long
##             <chr>                            <chr>               <chr>
## 1   35.251.195.57 00100011111110111100001100111001  35.248.0.0/13|4323
## 2     28.57.78.42 00011100001110010100111000101010                <NA>
## 3   24.60.146.202 00011000001111001001001011001010   24.60.0.0/14|7922
## 4    14.236.36.53 00001110111011000010010000110101                <NA>
## 5   7.146.253.182 00000111100100101111110110110110                <NA>
## 6     2.9.228.172 00000010000010011110010010101100     2.9.0.0/16|3215
## 7  108.111.124.79 01101100011011110111110001001111 108.111.0.0/16|3651
## 8    65.78.24.214 01000001010011100001100011010110   65.78.0.0/19|6079
## 9   50.48.151.239 00110010001100001001011111101111   50.48.0.0/13|5650
## 10  97.231.13.131 01100001111001110000110110000011   97.128.0.0/9|6167
## # ... with 999,990 more rows

On most modern systems, that should have taken less than 3s.

The `NA` values are not busted lookups. Many IP networks are assigned but not accessible (see [this](https://en.wikipedia.org/wiki/List_of_assigned_/8_IPv4_address_blocks) for more info). You can validate this with `cymruservices::bulk_origin()` on your own, too).

The trie structure for these CIDRs takes up approximately 9MB of RAM, a small price to pay for speedy lookups (and, memory really is not what the heart desires, anyway). Hopefully the `triebeard` package will help you speed up your own lookups and stay-tuned for a new version of `iptools` with some new and enhanced functions.

It’s no seekrit that I :heart: Hilbert curve heatmaps of IPv4 space. Real-world IPv4 maps (i.e. the ones that drop dots on the Earth) have little utility, but with Hilbert curves maps of IPv4 space many different topologies can be superimposed (from ASNs to—if need be—geographic locations). Plus, there’s more opportunity to find patterns by keeping CIDRs naturally close to each other.

The Measurement Factory created the [`ipv4-heatmap`](http://maps.measurement-factory.com/) command-line utility back in 2007 and there have been some tweaks and expansions to it by others over time. I wanted to use these IPv4 heatmaps in the [National Exposure](https://community.rapid7.com/community/infosec/blog/2016/06/07/rapid7-releases-new-research) report I worked on with @todb & @jhartftw at @Rapid7 but _cannot stand_ the built-in red-blue color scheme, especially when there’s [viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html) available. So, I [forked the code](https://github.com/hrbrmstr/ipv4-heatmap) and added both viridis and [colorbrewer](colorbrewer2.org) palettes to it as command-line options.

Here are two examples (the results of the National Exposure study), one using viridis and one using the colorbrewer `rdbu` palette:

![](https://www.dropbox.com/s/we5f5u7ejj7cp8l/viridis.png?raw=1)

![](https://www.dropbox.com/s/uznijxxq99qoces/rdbu-inverted.png?raw=1)

You specify the palette with `-P palette` and can invert the order of any palette with `-i` and the chosen palette will also be used in any legend you add the visualization.

Since these 4096×4096 files are a bit big, you can hit up [this dropbox link](https://www.dropbox.com/sh/wqyly8ewxeko5jn/AAC5bHIpQTuxWGBPYzMqceLQa?dl=0) to see a “gallery” of the various forward and reverse palettes.

The palette selection code is a bit brute-force at the moment, mostly due to the fact that I’m planning on a C++ port of the code and eventual inclusion of the Hilbert heatmap functionality in the [`iptools`](http://github.com/hrbrmstr/iptools) package.