Working with the PacketTotal API in R

The crazy/kind folks over at PacketTotal were generoue enough to slip me an API key, and long-time readers of the blog knows what that means: a new package!

Using the PacketTotal API

I kept the dependencies pretty thin so the extra library() calls I’m putting in here are mostly for analysis & visualization support. Let’s get them out of the way:

library(zip)
library(DT)
library(packettotal)
library(lubridate)
library(hrbrthemes)
library(tidyverse)

Now, let’s look for Emotet, which is a nasty piece of malware your organization has likely been hit with multiple times by now. To do that, we need to do issue a query on the “deep search” endpoint:

es <- pt_deep_search("emotet")

Now, we get thos results and take a look:

emo_res <- pt_get_search_results(es)

head(emo_res$results, 10)
##                                  id found_in match_score
## 1  5b4eb1fc54db6761bb42385d1ac52b8a    intel   173.38253
## 2  6fa8327a1ad7cc6870bca37400fb8832    intel   160.51707
## 3  d210f4dbea97949f694e849507951881    intel   112.51304
## 4  ef339f6d707ccb79a60a2127f5598833    intel   101.12901
## 5  02a2ba3eae826d586e15ba5bccaa1963    intel    96.59643
## 6  e24cca273dff3846348a5273ec8aaeb0    intel    86.88573
## 7  cfac071cc15a9ade09d801d2470560b1    intel    77.97037
## 8  2711f4d6f06ac45d9b0cba732ec3c3c5    intel    63.46781
## 9  7876e609c3f0050791edbb4d37e53995    intel    63.40784
## 10 dcd4b8746dfb272448a9b46712ecc8fb    intel    62.05376

Let’s get even more detail:

emo_det <- pt_detail("5b4eb1fc54db6761bb42385d1ac52b8a")

and, see what’s in the summary:

str(emo_det$analysis_summary, 1)
## List of 9
##  $ top_talkers          :List of 2
##  $ connection_statistics:List of 9
##  $ dns_statistics       :List of 2
##  $ file_statistics      :List of 3
##  $ signatures           : chr [1:3] "ET POLICY HTTP traffic on port 443 (POST)" "ET POLICY Office Document Download Containing AutoOpen Macro" "ET POLICY PE EXE or DLL Windows file download HTTP"
##  $ external_references  :'data.frame':   15 obs. of  2 variables:
##  $ malicious_traffic    : logi FALSE
##  $ accuracy             : chr "perfect"
##  $ http_statistics      :List of 3

Who are the top talkers (the IP addresses with the most connections)?

str(emo_det$analysis_summary$top_talkers)
## List of 2
##  $ source_ips     :List of 2
##   ..$ 10.6.19.101 : chr "98.6%"
##   ..$ 209.64.82.90: chr "1.4%"
##  $ destination_ips:List of 32
##   ..$ 5.187.0.158    : chr "24.6%"
##   ..$ 10.6.19.1      : chr "20.3%"
##   ..$ 128.100.126.113: chr "5.1%"
##   ..$ 121.50.43.110  : chr "4.3%"
##   ..$ 216.105.170.139: chr "4.3%"
##   ..$ 46.38.238.8    : chr "4.3%"
##   ..$ 178.62.103.94  : chr "4.3%"
##   ..$ 78.47.182.42   : chr "4.3%"
##   ..$ 184.186.78.177 : chr "4.3%"
##   ..$ 194.88.246.242 : chr "4.3%"
##   ..$ 209.64.82.90   : chr "2.9%"
##   ..$ 46.4.100.178   : chr "1.4%"
##   ..$ 10.6.19.101    : chr "1.4%"
##   ..$ 111.118.212.86 : chr "0.7%"
##   ..$ 108.167.184.253: chr "0.7%"
##   ..$ 173.70.47.89   : chr "0.7%"
##   ..$ 47.188.131.94  : chr "0.7%"
##   ..$ 76.72.225.30   : chr "0.7%"
##   ..$ 80.153.201.243 : chr "0.7%"
##   ..$ 216.21.168.27  : chr "0.7%"
##   ..$ 164.160.161.118: chr "0.7%"
##   ..$ 23.239.2.11    : chr "0.7%"
##   ..$ 66.76.26.33    : chr "0.7%"
##   ..$ 24.217.117.217 : chr "0.7%"
##   ..$ 75.152.52.109  : chr "0.7%"
##   ..$ 217.91.43.150  : chr "0.7%"
##   ..$ 189.236.94.20  : chr "0.7%"
##   ..$ 24.119.116.230 : chr "0.7%"
##   ..$ 96.94.189.130  : chr "0.7%"
##   ..$ 72.45.212.62   : chr "0.7%"
##   ..$ 70.184.125.132 : chr "0.7%"
##   ..$ 206.255.140.203: chr "0.7%"

Let’s use ipinfo.io to see some extra detail on that main one:

ip_5.187.0.158 <- ipinfo::query_ip("5.187.0.158")

str(ip_5.187.0.158)
## List of 8
##  $ ip      : chr "5.187.0.158"
##  $ hostname: chr "kvmde60-6542-1.fornex.org"
##  $ city    : chr "Frankfurt am Main"
##  $ region  : chr "Hesse"
##  $ country : chr "DE"
##  $ loc     : chr "50.1155,8.6842"
##  $ postal  : chr "60313"
##  $ org     : chr "AS44066 First Colo GmbH"

We can also lookup various stats (these JSON strings are going to be real percentages soon from the API):

str(emo_det$analysis_summary$dns_statistics)
## List of 2
##  $ queries     :List of 4
##   ..$ percalabia.com              : chr "89.3%"
##   ..$ www.grampotchayatportal.club: chr "3.6%"
##   ..$ www.chamberstimber.com      : chr "3.6%"
##   ..$ urnachay.com                : chr "3.6%"
##  $ record_types:List of 1
##   ..$ A: chr "100.0%"

str(emo_det$analysis_summary$file_statistics)
## List of 3
##  $ mime_types :List of 5
##   ..$ application/pkix-cert: chr "54.0%"
##   ..$ null                 : chr "40.0%"
##   ..$ application/msword   : chr "2.0%"
##   ..$ application/x-dosexec: chr "2.0%"
##   ..$ text/plain           : chr "2.0%"
##  $ sources    :List of 2
##   ..$ SSL : chr "54.0%"
##   ..$ HTTP: chr "46.0%"
##  $ executables:List of 3
##   ..$ operating_systems        :List of 1
##   .. ..$ Windows 2000: chr "100.0%"
##   ..$ compile_cpu_architectures:List of 1
##   .. ..$ I386: chr "100.0%"
##   ..$ assembly_sections        :List of 1
##   .. ..$ .text,.rdata,.data,.pdata: chr "100.0%"

So, we get FQDNs, files, DNS queries and more. We can also just get every bit of data PacketTotal could squeeze out of the PCAP by downloading an “analysis” archive:

dl <- pt_download("5b4eb1fc54db6761bb42385d1ac52b8a", dl_dir = "~/Data")

We’ll unpack it and take a look:

unzip(dl, exdir = "~/Data/5b4eb1fc54db6761bb42385d1ac52b8a")

list.files("~/Data/5b4eb1fc54db6761bb42385d1ac52b8a")
##  [1] "5b4eb1fc54db6761bb42385d1ac52b8a.pcap"
##  [2] "artifacts"                            
##  [3] "conn.csv"                             
##  [4] "dns.csv"                              
##  [5] "files.csv"                            
##  [6] "http.csv"                             
##  [7] "intel.csv"                            
##  [8] "notice.csv"                           
##  [9] "pe.csv"                               
## [10] "signature_alerts.csv"                 
## [11] "ssl.csv"                              
## [12] "weird.csv"                            
## [13] "x509.csv"

We won’t explore all of these in this post but conn.csv is the Zeek (formerly, ugh, ‘Bro’ — which was short for ‘Big Brother’ b/c it was snooping on your packets, but still…) connection logs. That’s something I’m super familiar with given that we generate tens of thousands of them every day at $WORK in our massive honeypot network, so let’s poke at it:

read_csv("~/Data/5b4eb1fc54db6761bb42385d1ac52b8a/conn.csv", na = c("null", "")) %>% 
  janitor::clean_names() -> conns

glimpse(conns)
## Observations: 138
## Variables: 19
## $ timestamp                   <dttm> 2018-06-19 17:31:49, 2018-06-19 17:…
## $ connection_id               <chr> "C1YfB548DCKokZA2Ff", "Ckx0wv1xHPTm5…
## $ sender_ip                   <chr> "10.6.19.101", "10.6.19.101", "10.6.…
## $ sender_port                 <dbl> 64311, 49182, 58722, 49183, 49184, 4…
## $ target_ip                   <chr> "10.6.19.1", "111.118.212.86", "10.6…
## $ target_port                 <dbl> 53, 80, 53, 80, 80, 443, 53, 443, 44…
## $ transport_protocol          <chr> "udp", "tcp", "udp", "tcp", "tcp", "…
## $ service                     <chr> "dns", "http", "dns", "http", "http"…
## $ duration                    <dbl> 0.032419, 8.070083, 0.033392, 1.2266…
## $ payload_bytes_sent          <dbl> 46, 287, 40, 78, 2028, 738, 30, 1098…
## $ total_bytes_sent            <dbl> 74, 2819, 68, 2010, 43360, 1070, 58,…
## $ payload_bytes_received      <dbl> 76, 110600, 70, 127435, 1414967, 472…
## $ total_bytes_received        <dbl> 104, 113284, 98, 130959, 1456531, 75…
## $ missed_bytes                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ packets_sent                <dbl> 1, 63, 1, 48, 1033, 8, 1, 8, 80, 32,…
## $ packets_received            <dbl> 1, 67, 1, 88, 1039, 7, 1, 10, 153, 5…
## $ originated_locally          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ tunnel_parent_connection_id <chr> "(empty)", "(empty)", "(empty)", "(e…
## $ history                     <chr> "Dd", "ShADadR", "Dd", "ShADadR", "S…

(They’re also fixing the un-friendly-for-data science column names.)

Lots of info about the connections, and we can make our own exploratory interface for them pretty easily:

DT::datatable(conns)

But, we can also attack it with the tidyverse:

count(conns, target_port, service, sort=TRUE)
## # A tibble: 18 x 3
##    target_port service     n
##          <dbl> <chr>   <int>
##  1         443 ssl        34
##  2          53 dns        28
##  3        8080 <NA>       24
##  4          80 http       11
##  5        8080 http        8
##  6         443 <NA>        7
##  7          80 <NA>        6
##  8        4143 <NA>        6
##  9         465 <NA>        4
## 10         443 http        2
## 11          22 <NA>        1
## 12          53 <NA>        1
## 13         465 http        1
## 14         990 <NA>        1
## 15         995 http        1
## 16        7080 <NA>        1
## 17       49254 <NA>        1
## 18       49255 <NA>        1

count(conns, sender_ip, sort=TRUE)
## # A tibble: 2 x 2
##   sender_ip        n
##   <chr>        <int>
## 1 10.6.19.101    136
## 2 209.64.82.90     2

count(conns, target_ip, sort=TRUE)
## # A tibble: 32 x 2
##    target_ip           n
##    <chr>           <int>
##  1 5.187.0.158        34
##  2 10.6.19.1          28
##  3 128.100.126.113     7
##  4 121.50.43.110       6
##  5 178.62.103.94       6
##  6 184.186.78.177      6
##  7 194.88.246.242      6
##  8 216.105.170.139     6
##  9 46.38.238.8         6
## 10 78.47.182.42        6
## # … with 22 more rows

mutate(conns, sec = floor_date(timestamp, "minute")) %>% 
  count(sec, transport_protocol) %>% 
  ggplot(aes(sec, n)) + 
  geom_line() +
  facet_wrap(~transport_protocol) +
  labs(title = "Total Connections-per-minute by Protocol") +
  theme_ft_rc(grid="XY")


select(conns, payload_bytes_sent, payload_bytes_received) %>% 
  gather(measure, value) %>% 
  mutate(value = as.numeric(value)) %>% 
  ggplot(aes(value)) +
  ggalt::geom_bkde(fill = alpha(ft_cols$gray, 1/3)) +
  scale_x_log10(label=scales::comma) +
  labs(title = "Payload metadata distributions", subtitle = "Note: Log10 Scale") +
  facet_wrap(~measure) +
  theme_ft_rc(grid="XY")

We can even see any threat inteligence they were able to enrich the data with:

read_csv("~/Data/5b4eb1fc54db6761bb42385d1ac52b8a/intel.csv", na = c("null", "")) %>% 
  janitor::clean_names() %>% 
  DT::datatable()

We can also look for similar PCAPs:

sim <- pt_similar("5b4eb1fc54db6761bb42385d1ac52b8a")

str(sim$similar$results, 1)
## 'data.frame':    177 obs. of  4 variables:
##  $ id          : chr  "920aa84ebf5532bb1c0e17b776fcfa67" "8f9661560a603aafa8f952910b9ca498" "24d62ad190dd0d49ed4d4135652188a0" "7876e609c3f0050791edbb4d37e53995" ...
##  $ match_score : int  25 25 20 20 20 20 20 20 20 20 ...
##  $ common_terms: int  3 3 2 2 2 2 2 2 2 2 ...
##  $ matches     :List of 177

This is where the power of the API would really come in handy as we collect all this information and start to look for correlations, time series patterns (or anomalies) and possibly extract features to help build models to detect various types of malicious traffic.

Working with the PacketTotal API in R

What is PacketTotal?

Using the PacketTotal API

FIN