The insanely productive elf-lord, @quominus put together a small package ([`triebeard`](https://github.com/ironholds/triebeard)) that exposes an API for [radix/prefix tries](https://en.wikipedia.org/wiki/Trie) at both the R and Rcpp levels. I know he had some personal needs for this and we both kinda need these to augment some functions in our `iptools` package. Despite `triebeard` having both a vignette and function-level examples, I thought it might be good to show a real-world use of the package (at least in the cyber real world): fast determination of which [autonomous system](https://en.wikipedia.org/wiki/Autonomous_system_(Internet)) an IPv4 address is in (if it’s in one at all).
I’m not going to delve to deep into routing (you can find a good primer [here](http://www.kixtart.org/forums/ubbthreads.php?ubb=showflat&Number=81619&site_id=1#import) and one that puts routing in the context of radix tries [here](http://www.juniper.net/documentation/en_US/junos14.1/topics/usage-guidelines/policy-configuring-route-lists-for-use-in-routing-policy-match-conditions.html)) but there exists, essentially, abbreviated tables of which IP addresses belong to a particular network. These tables are in routers on your local networks and across the internet. Groups of these networks (on the internet) are composed into those autonomous systems I mentioned earlier and these tables are used to get the packets that make up the cat videos you watch routed to you as efficiently as possible.
When dealing with cybersecurity data science, it’s often useful to know which autonomous system an IP address belongs in. The world is indeed full of peril and in it there are many dark places. It’s a dangerous business, going out on the internet and we sometimes find it possible to identify unusually malicious autonomous systems by looking up suspicious IP addresses en masse. These mappings look something like this:
CIDR ASN
1.0.0.0/24 47872
1.0.4.0/24 56203
1.0.5.0/24 56203
1.0.6.0/24 56203
1.0.7.0/24 38803
1.0.48.0/20 49597
1.0.64.0/18 18144
Each CIDR has a start and end IP address which can ultimately be converted to integers. Now, one _could_ just sequentially compare start and end ranges to see which CIDR an IP address belongs in, but there are (as of the day of this post) `647,563` CIDRs to compare against, which—in the worst case—would mean having to traverse through the entire list to find the match (or discover there is no match). There are some trivial ways to slightly optimize this, but the search times could still be fairly long, especially when you’re trying to match a billion IPv4 addresses to ASNs.
By storing the CIDR mask (the number of bits of the leading IP address specified after the `/`) in binary form (strings of 1’s and 0’s) as keys for the trie, we get much faster lookups (only a few comparisons at worst-case vs 647,563).
I made an initial, naïve, mostly straight R, implementation as a precursor to a more low-level implementation in Rcpp in our `iptools` package and to illustrate this use of the `triebeard` package.
One thing we’ll need is a function to convert an IPv4 address (in long integer form) into a binary character string. We _could_ do this with base R, but it’ll be super-slow and it doesn’t take much effort to create it with an Rcpp inline function:
library(Rcpp)
library(inline)
ip_to_binary_string <- rcpp(signature(x="integer"), "
NumericVector xx(x);
std::vector<double> X(xx.begin(),xx.end());
std::vector<std::string> output(X.size());
for (unsigned int i=0; i<X.size(); i++){
if ((i % 10000) == 0) Rcpp::checkUserInterrupt();
output[i] = std::bitset<32>(X[i]).to_string();
}
return(Rcpp::wrap(output));
")
ip_to_binary_string(ip_to_numeric("192.168.1.1"))
## [1] "11000000101010000000000100000001"
We take a vector from R and use some C++ standard library functions to convert them to bits. I vectorized this in C++ for speed (which is just a fancy way to say I used a `for` loop). In this case, our short cut will not make for a long delay.
Now, we’ll need a CIDR file. There are [historical ones](http://data.4tu.nl/repository/uuid:d4d23b8e-2077-4592-8b47-cb476ad16e12) avaialble, and I use one that I generated the day of this post (and, referenced in the code block below). You can use [`pyasn`](https://github.com/hadiasghari/pyasn) to make new ones daily (relegating mindless, automated, menial data retrieval tasks to the python goblins, like one should).
library(iptools)
library(stringi)
library(dplyr)
library(purrr)
library(readr)
library(tidyr)
asn_dat_url <- "http://rud.is/dl/asn-20160712.1600.dat.gz"
asn_dat_fil <- basename(asn_dat_url)
if (!file.exists(asn_dat_fil)) download.file(asn_dat_url, asn_dat_fil)
rip <- read_tsv(asn_dat_fil, comment=";", col_names=c("cidr", "asn"))
rip %>%
separate(cidr, c("ip", "mask"), "/") %>%
mutate(prefix=stri_sub(ip_to_binary_string(ip_to_numeric(ip)), 1, mask)) -> rip_df
rip_df
## # A tibble: 647,557 x 4
## ip mask asn prefix
## <chr> <chr> <int> <chr>
## 1 1.0.0.0 24 47872 000000010000000000000000
## 2 1.0.4.0 24 56203 000000010000000000000100
## 3 1.0.5.0 24 56203 000000010000000000000101
## 4 1.0.6.0 24 56203 000000010000000000000110
## 5 1.0.7.0 24 38803 000000010000000000000111
## 6 1.0.48.0 20 49597 00000001000000000011
## 7 1.0.64.0 18 18144 000000010000000001
## 8 1.0.128.0 17 9737 00000001000000001
## 9 1.0.128.0 18 9737 000000010000000010
## 10 1.0.128.0 19 9737 0000000100000000100
## # ... with 647,547 more rows
You can save off that `data_frame` to an R data file to pull in later (but it’s pretty fast to regenerate).
Now, we create the trie, using the prefix we calculated and a value we’ll piece together for this example:
library(triebeard)
rip_trie <- trie(rip_df$prefix, sprintf("%s/%s|%s", rip_df$ip, rip_df$mask, rip_df$asn))
Yep, that’s it. If you ran this yourself, it should have taken less than 2s on most modern systems to create the nigh 700,000 element trie.
Now, we’ll generate a million random IP addresses and look them up:
set.seed(1492)
data_frame(lkp=ip_random(1000000),
lkp_bin=ip_to_binary_string(ip_to_numeric(lkp)),
long=longest_match(rip_trie, lkp_bin)) -> lkp_df
lkp_df
## # A tibble: 1,000,000 x 3
## lkp lkp_bin long
## <chr> <chr> <chr>
## 1 35.251.195.57 00100011111110111100001100111001 35.248.0.0/13|4323
## 2 28.57.78.42 00011100001110010100111000101010 <NA>
## 3 24.60.146.202 00011000001111001001001011001010 24.60.0.0/14|7922
## 4 14.236.36.53 00001110111011000010010000110101 <NA>
## 5 7.146.253.182 00000111100100101111110110110110 <NA>
## 6 2.9.228.172 00000010000010011110010010101100 2.9.0.0/16|3215
## 7 108.111.124.79 01101100011011110111110001001111 108.111.0.0/16|3651
## 8 65.78.24.214 01000001010011100001100011010110 65.78.0.0/19|6079
## 9 50.48.151.239 00110010001100001001011111101111 50.48.0.0/13|5650
## 10 97.231.13.131 01100001111001110000110110000011 97.128.0.0/9|6167
## # ... with 999,990 more rows
On most modern systems, that should have taken less than 3s.
The `NA` values are not busted lookups. Many IP networks are assigned but not accessible (see [this](https://en.wikipedia.org/wiki/List_of_assigned_/8_IPv4_address_blocks) for more info). You can validate this with `cymruservices::bulk_origin()` on your own, too).
The trie structure for these CIDRs takes up approximately 9MB of RAM, a small price to pay for speedy lookups (and, memory really is not what the heart desires, anyway). Hopefully the `triebeard` package will help you speed up your own lookups and stay-tuned for a new version of `iptools` with some new and enhanced functions.
2 Comments
Any news or views on extending triebeard to IPv6
Unfortunately, radix trie performance degrades linearly with trie depth (and R’s support for super large integers is super meh, which wld be needed to make this performant, though it cld be masked using
Xptrs
). However, I can def extend this to IPv6 for smaller trie lookups. Can you file an issue on the GitHub repo foriptools
since that’s where I’m implementing the official IPv4 CIDR lookup code.2 Trackbacks/Pingbacks
[…] The insanely productive elf-lord, @quominus put together a small package (triebeard) that exposes an API for radix/prefix tries at both the R and Rcpp levels. I know he had some personal needs for this and we both kinda need these to augment some functions in our iptools package. Despite triebeard having both a vignette and… Continue reading → […]
[…] article was first published on R – rud.is, and kindly contributed to […]