The crazy/kind folks over at PacketTotal were generoue enough to slip me an API key, and long-time readers of the blog knows what that means: a new package!
If you have a non-compliance-focused job in information security chances are you will have come across or had the need to generate packet captures of network traffic to chase down a situation. PacketTotal seems to be aiming to aggregate and socialize the analysis of packet captures in similar fashion to what VirusTotal does to files/binaries.
PCAPs are a bit trickier than what VirusTotal handles since they may contain sensitive organizational data — at the very least private addressing schemes — but, I suspect they’re working on some sanitization tools to make it easier to do that and are also doing a decent job at ensuring they’re not logging the IP address (or any other identifying data) of the uploader.
Their online exploratory interface is fairly robust but by providing an API they make it possible for one to go beyond such an interface and enhance a dynamic investigation on-the-fly while keeping a record of analysis flow and artifacts.
We won’t be doing that in this post since it is just an introductory “this is how the site/package works” post but once they round out some corners we may delve into a full (faux) investigation and perhaps write our own investigations UX with Shiny.
Onwards!
I kept the dependencies pretty thin so the extra library()
calls I’m putting in here are mostly for analysis & visualization support. Let’s get them out of the way:
library(zip)
library(DT)
library(packettotal)
library(lubridate)
library(hrbrthemes)
library(tidyverse)
Now, let’s look for Emotet, which is a nasty piece of malware your organization has likely been hit with multiple times by now. To do that, we need to do issue a query on the “deep search” endpoint:
Now, we get thos results and take a look:
emo_res <- pt_get_search_results(es)
head(emo_res$results, 10)
## id found_in match_score
## 1 5b4eb1fc54db6761bb42385d1ac52b8a intel 173.38253
## 2 6fa8327a1ad7cc6870bca37400fb8832 intel 160.51707
## 3 d210f4dbea97949f694e849507951881 intel 112.51304
## 4 ef339f6d707ccb79a60a2127f5598833 intel 101.12901
## 5 02a2ba3eae826d586e15ba5bccaa1963 intel 96.59643
## 6 e24cca273dff3846348a5273ec8aaeb0 intel 86.88573
## 7 cfac071cc15a9ade09d801d2470560b1 intel 77.97037
## 8 2711f4d6f06ac45d9b0cba732ec3c3c5 intel 63.46781
## 9 7876e609c3f0050791edbb4d37e53995 intel 63.40784
## 10 dcd4b8746dfb272448a9b46712ecc8fb intel 62.05376
Let’s get even more detail:
and, see what’s in the summary:
str(emo_det$analysis_summary, 1)
## List of 9
## $ top_talkers :List of 2
## $ connection_statistics:List of 9
## $ dns_statistics :List of 2
## $ file_statistics :List of 3
## $ signatures : chr [1:3] "ET POLICY HTTP traffic on port 443 (POST)" "ET POLICY Office Document Download Containing AutoOpen Macro" "ET POLICY PE EXE or DLL Windows file download HTTP"
## $ external_references :'data.frame': 15 obs. of 2 variables:
## $ malicious_traffic : logi FALSE
## $ accuracy : chr "perfect"
## $ http_statistics :List of 3
Who are the top talkers (the IP addresses with the most connections)?
str(emo_det$analysis_summary$top_talkers)
## List of 2
## $ source_ips :List of 2
## ..$ 10.6.19.101 : chr "98.6%"
## ..$ 209.64.82.90: chr "1.4%"
## $ destination_ips:List of 32
## ..$ 5.187.0.158 : chr "24.6%"
## ..$ 10.6.19.1 : chr "20.3%"
## ..$ 128.100.126.113: chr "5.1%"
## ..$ 121.50.43.110 : chr "4.3%"
## ..$ 216.105.170.139: chr "4.3%"
## ..$ 46.38.238.8 : chr "4.3%"
## ..$ 178.62.103.94 : chr "4.3%"
## ..$ 78.47.182.42 : chr "4.3%"
## ..$ 184.186.78.177 : chr "4.3%"
## ..$ 194.88.246.242 : chr "4.3%"
## ..$ 209.64.82.90 : chr "2.9%"
## ..$ 46.4.100.178 : chr "1.4%"
## ..$ 10.6.19.101 : chr "1.4%"
## ..$ 111.118.212.86 : chr "0.7%"
## ..$ 108.167.184.253: chr "0.7%"
## ..$ 173.70.47.89 : chr "0.7%"
## ..$ 47.188.131.94 : chr "0.7%"
## ..$ 76.72.225.30 : chr "0.7%"
## ..$ 80.153.201.243 : chr "0.7%"
## ..$ 216.21.168.27 : chr "0.7%"
## ..$ 164.160.161.118: chr "0.7%"
## ..$ 23.239.2.11 : chr "0.7%"
## ..$ 66.76.26.33 : chr "0.7%"
## ..$ 24.217.117.217 : chr "0.7%"
## ..$ 75.152.52.109 : chr "0.7%"
## ..$ 217.91.43.150 : chr "0.7%"
## ..$ 189.236.94.20 : chr "0.7%"
## ..$ 24.119.116.230 : chr "0.7%"
## ..$ 96.94.189.130 : chr "0.7%"
## ..$ 72.45.212.62 : chr "0.7%"
## ..$ 70.184.125.132 : chr "0.7%"
## ..$ 206.255.140.203: chr "0.7%"
Let’s use ipinfo.io to see some extra detail on that main one:
ip_5.187.0.158 <- ipinfo::query_ip("5.187.0.158")
str(ip_5.187.0.158)
## List of 8
## $ ip : chr "5.187.0.158"
## $ hostname: chr "kvmde60-6542-1.fornex.org"
## $ city : chr "Frankfurt am Main"
## $ region : chr "Hesse"
## $ country : chr "DE"
## $ loc : chr "50.1155,8.6842"
## $ postal : chr "60313"
## $ org : chr "AS44066 First Colo GmbH"
We can also lookup various stats (these JSON strings are going to be real percentages soon from the API):
str(emo_det$analysis_summary$dns_statistics)
## List of 2
## $ queries :List of 4
## ..$ percalabia.com : chr "89.3%"
## ..$ www.grampotchayatportal.club: chr "3.6%"
## ..$ www.chamberstimber.com : chr "3.6%"
## ..$ urnachay.com : chr "3.6%"
## $ record_types:List of 1
## ..$ A: chr "100.0%"
str(emo_det$analysis_summary$file_statistics)
## List of 3
## $ mime_types :List of 5
## ..$ application/pkix-cert: chr "54.0%"
## ..$ null : chr "40.0%"
## ..$ application/msword : chr "2.0%"
## ..$ application/x-dosexec: chr "2.0%"
## ..$ text/plain : chr "2.0%"
## $ sources :List of 2
## ..$ SSL : chr "54.0%"
## ..$ HTTP: chr "46.0%"
## $ executables:List of 3
## ..$ operating_systems :List of 1
## .. ..$ Windows 2000: chr "100.0%"
## ..$ compile_cpu_architectures:List of 1
## .. ..$ I386: chr "100.0%"
## ..$ assembly_sections :List of 1
## .. ..$ .text,.rdata,.data,.pdata: chr "100.0%"
So, we get FQDNs, files, DNS queries and more. We can also just get every bit of data PacketTotal could squeeze out of the PCAP by downloading an “analysis” archive:
We’ll unpack it and take a look:
unzip(dl, exdir = "~/Data/5b4eb1fc54db6761bb42385d1ac52b8a")
list.files("~/Data/5b4eb1fc54db6761bb42385d1ac52b8a")
## [1] "5b4eb1fc54db6761bb42385d1ac52b8a.pcap"
## [2] "artifacts"
## [3] "conn.csv"
## [4] "dns.csv"
## [5] "files.csv"
## [6] "http.csv"
## [7] "intel.csv"
## [8] "notice.csv"
## [9] "pe.csv"
## [10] "signature_alerts.csv"
## [11] "ssl.csv"
## [12] "weird.csv"
## [13] "x509.csv"
We won’t explore all of these in this post but conn.csv
is the Zeek (formerly, ugh, ‘Bro’ — which was short for ‘Big Brother’ b/c it was snooping on your packets, but still…) connection logs. That’s something I’m super familiar with given that we generate tens of thousands of them every day at $WORK in our massive honeypot network, so let’s poke at it:
read_csv("~/Data/5b4eb1fc54db6761bb42385d1ac52b8a/conn.csv", na = c("null", "")) %>%
janitor::clean_names() -> conns
glimpse(conns)
## Observations: 138
## Variables: 19
## $ timestamp <dttm> 2018-06-19 17:31:49, 2018-06-19 17:…
## $ connection_id <chr> "C1YfB548DCKokZA2Ff", "Ckx0wv1xHPTm5…
## $ sender_ip <chr> "10.6.19.101", "10.6.19.101", "10.6.…
## $ sender_port <dbl> 64311, 49182, 58722, 49183, 49184, 4…
## $ target_ip <chr> "10.6.19.1", "111.118.212.86", "10.6…
## $ target_port <dbl> 53, 80, 53, 80, 80, 443, 53, 443, 44…
## $ transport_protocol <chr> "udp", "tcp", "udp", "tcp", "tcp", "…
## $ service <chr> "dns", "http", "dns", "http", "http"…
## $ duration <dbl> 0.032419, 8.070083, 0.033392, 1.2266…
## $ payload_bytes_sent <dbl> 46, 287, 40, 78, 2028, 738, 30, 1098…
## $ total_bytes_sent <dbl> 74, 2819, 68, 2010, 43360, 1070, 58,…
## $ payload_bytes_received <dbl> 76, 110600, 70, 127435, 1414967, 472…
## $ total_bytes_received <dbl> 104, 113284, 98, 130959, 1456531, 75…
## $ missed_bytes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ packets_sent <dbl> 1, 63, 1, 48, 1033, 8, 1, 8, 80, 32,…
## $ packets_received <dbl> 1, 67, 1, 88, 1039, 7, 1, 10, 153, 5…
## $ originated_locally <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ tunnel_parent_connection_id <chr> "(empty)", "(empty)", "(empty)", "(e…
## $ history <chr> "Dd", "ShADadR", "Dd", "ShADadR", "S…
(They’re also fixing the un-friendly-for-data science column names.)
Lots of info about the connections, and we can make our own exploratory interface for them pretty easily:
But, we can also attack it with the tidyverse:
count(conns, target_port, service, sort=TRUE)
## # A tibble: 18 x 3
## target_port service n
## <dbl> <chr> <int>
## 1 443 ssl 34
## 2 53 dns 28
## 3 8080 <NA> 24
## 4 80 http 11
## 5 8080 http 8
## 6 443 <NA> 7
## 7 80 <NA> 6
## 8 4143 <NA> 6
## 9 465 <NA> 4
## 10 443 http 2
## 11 22 <NA> 1
## 12 53 <NA> 1
## 13 465 http 1
## 14 990 <NA> 1
## 15 995 http 1
## 16 7080 <NA> 1
## 17 49254 <NA> 1
## 18 49255 <NA> 1
count(conns, sender_ip, sort=TRUE)
## # A tibble: 2 x 2
## sender_ip n
## <chr> <int>
## 1 10.6.19.101 136
## 2 209.64.82.90 2
count(conns, target_ip, sort=TRUE)
## # A tibble: 32 x 2
## target_ip n
## <chr> <int>
## 1 5.187.0.158 34
## 2 10.6.19.1 28
## 3 128.100.126.113 7
## 4 121.50.43.110 6
## 5 178.62.103.94 6
## 6 184.186.78.177 6
## 7 194.88.246.242 6
## 8 216.105.170.139 6
## 9 46.38.238.8 6
## 10 78.47.182.42 6
## # … with 22 more rows
mutate(conns, sec = floor_date(timestamp, "minute")) %>%
count(sec, transport_protocol) %>%
ggplot(aes(sec, n)) +
geom_line() +
facet_wrap(~transport_protocol) +
labs(title = "Total Connections-per-minute by Protocol") +
theme_ft_rc(grid="XY")
select(conns, payload_bytes_sent, payload_bytes_received) %>%
gather(measure, value) %>%
mutate(value = as.numeric(value)) %>%
ggplot(aes(value)) +
ggalt::geom_bkde(fill = alpha(ft_cols$gray, 1/3)) +
scale_x_log10(label=scales::comma) +
labs(title = "Payload metadata distributions", subtitle = "Note: Log10 Scale") +
facet_wrap(~measure) +
theme_ft_rc(grid="XY")
We can even see any threat inteligence they were able to enrich the data with:
read_csv("~/Data/5b4eb1fc54db6761bb42385d1ac52b8a/intel.csv", na = c("null", "")) %>%
janitor::clean_names() %>%
DT::datatable()
We can also look for similar PCAPs:
sim <- pt_similar("5b4eb1fc54db6761bb42385d1ac52b8a")
str(sim$similar$results, 1)
## 'data.frame': 177 obs. of 4 variables:
## $ id : chr "920aa84ebf5532bb1c0e17b776fcfa67" "8f9661560a603aafa8f952910b9ca498" "24d62ad190dd0d49ed4d4135652188a0" "7876e609c3f0050791edbb4d37e53995" ...
## $ match_score : int 25 25 20 20 20 20 20 20 20 20 ...
## $ common_terms: int 3 3 2 2 2 2 2 2 2 2 ...
## $ matches :List of 177
This is where the power of the API would really come in handy as we collect all this information and start to look for correlations, time series patterns (or anomalies) and possibly extract features to help build models to detect various types of malicious traffic.
Visit the package page for information on how to install it and you can find it on SourceHut, GitLab or (ugh) GitHub.
Keep watching their service/API since it’s only going to get even better and definitely toss up suggestions for package features or jump on in and file some PRs at your social coding hub of choice.