Category Archives: R

#30DayMapChallenge 2021 Day 2: Lines

The 30-Day Map Challenge is on again, and I’m hoping to be able to scrounge some time to get an entry for each day. Day 2 is lines (Day 1 was posted on Twitter only) and — while I’m hoping to focus on saving U.S. democracy for the majority of the entries, today’s is a short one that shows all the walks/hikes/boats we took during our Iceland vacation.

I use HealthFit to export data (automagically) from all my Apple Watch, Garmin, and Peloton activities, and have it auto-sync to iCloud, which means I have instant access to all the Garmin FIT files it creates in ~/Library/Mobile Documents/iCloud~com~altifondo~HealthFit/.

We can use {FITFileR} to read in those files, turn the points into lines, and plot them all on a {leaflet} map, so I can get an interactive view into the paths.

If I had more time, I’d add activity names and other clickable statistics in popups, which is pretty straightforward in {leaflet}.

Comments in code hopefully explain the workflow:

library(sf)
library(FITfileR) # remotes::install_github("grimbough/FITfileR")
library(leaflet)
library(tidyverse)

# get a listing of all the files for when we were in Iceland
list.files(
  path = "~/Library/Mobile Documents/iCloud~com~altifondo~HealthFit/Documents", 
  pattern = "(2021-07-3[01]|2021-08-0[1-9])",
  full.names = TRUE
) %>% 
  map(
    ~readFitFile(.x) %>%          # read the file in
      records() %>%               # turn into something we can then bind into a data frame
      bind_rows() %>%             # bind ^^ into a data frame!
      mutate(file = basename(.x)) # add the file info in case we want to eventually make a popup of information
  ) -> iceland_fit_files


# look at the data

iceland_fit_files[[1]]
## # A tibble: 2,210 × 12
##    timestamp           position_lat position_long gps_accuracy altitude distance speed temperature heart_rate cadence
##    <dttm>                     <dbl>         <dbl>        <dbl>    <dbl>    <dbl> <dbl>       <dbl>      <dbl>   <dbl>
##  1 2021-07-30 06:28:26         64.2         -21.9            2     21       318. 0.221          13        116      NA
##  2 2021-07-30 06:28:27         64.2         -21.9            2     21       318. 0.521          13        116      NA
##  3 2021-07-30 06:28:28         64.2         -21.9            2     21       319. 0.792          13        116      NA
##  4 2021-07-30 06:28:47         64.2         -21.9            2     19.4     338. 0.217          13        118      NA
##  5 2021-07-30 06:28:48         64.2         -21.9            2     19.2     338. 0.147          13        118      NA
##  6 2021-07-30 06:28:49         64.2         -21.9            2     19       338. 0.109          13        117      NA
##  7 2021-07-30 06:28:50         64.2         -21.9            2     19       338. 0.076          13        117      NA
##  8 2021-07-30 06:28:51         64.2         -21.9            2     18.8     338. 0.036          13        116      NA
##  9 2021-07-30 06:28:52         64.2         -21.9            2     18.6     338. 0.153          13        115      NA
## 10 2021-07-30 06:28:53         64.2         -21.9            2     18.4     339. 0.393          13        115      NA
## # … with 2,200 more rows, and 2 more variables: fractional_cadence <dbl>, file <chr>

iceland_fit_files[map_lgl(iceland_fit_files, has_name, "position_long")] %>% # only want activities with geo data
  map(
    ~.x %>% 
      filter(!is.na(position_long), !is.na(position_lat)) %>% # {sf} hates NAs
      st_as_sf(
        coords = c("position_long", "position_lat"), # turn the data frame into an {sf} object
        crs = 4326
      )
  ) %>% 
  bind_rows() %>% # this makes one big data frame
  group_by(file) %>%  # which we can turn into individual geometries
  summarise(          # with the epic summarise() function
    m = max(distance)
  ) %>% 
  st_cast("LINESTRING") -> paths # and then turn the points into linestrings

# let's take a look

paths
## Simple feature collection with 33 features and 2 fields
## Geometry type: LINESTRING
## Dimension:     XY
## Bounding box:  xmin: -23.92788 ymin: 63.40121 xmax: -16.18086 ymax: 65.08101
## Geodetic CRS:  WGS 84
## # A tibble: 33 × 3
##    file                                        m                                                                    geometry
##    <chr>                                   <dbl>                                                            <LINESTRING [°]>
##  1 2021-07-30-062401-Walking-Chetzmoka.fit 2807. (-21.94386 64.15495, -21.94382 64.15497, -21.94384 64.15496, -21.94388 64.…
##  2 2021-07-30-101146-Walking-Chetzmoka.fit 1191. (-21.93504 64.14761, -21.93502 64.1476, -21.93503 64.14761, -21.93501 64.1…
##  3 2021-07-30-105706-Walking-Chetzmoka.fit 1554. (-21.91706 64.12906, -21.91704 64.12905, -21.91702 64.12905, -21.917 64.12…
##  4 2021-07-30-143247-Walking-Chetzmoka.fit 1620. (-21.94977 64.15683, -21.94978 64.15685, -21.94979 64.15686, -21.94995 64.…
##  5 2021-07-31-122122-Walking-Chetzmoka.fit  702. (-22.26431 64.7628, -22.26463 64.76297, -22.26527 64.76332, -22.26496 64.7…
##  6 2021-07-31-131832-Walking-Chetzmoka.fit 2084. (-22.84353 64.90553, -22.84351 64.90553, -22.84349 64.90553, -22.84347 64.…
##  7 2021-07-31-182725-Walking-Chetzmoka.fit  578. (-22.72256 65.0809, -22.72255 65.0809, -22.72254 65.08089, -22.72252 65.08…
##  8 2021-08-01-105420-Walking-Chetzmoka.fit 1210. (-23.63732 64.79827, -23.6373 64.79829, -23.6373 64.79829, -23.6373 64.798…
##  9 2021-08-01-142847-Walking-Chetzmoka.fit 2385. (-23.80382 64.73048, -23.8038 64.73047, -23.8038 64.73047, -23.8038 64.730…
## 10 2021-08-01-165642-Walking-Chetzmoka.fit  423. (-23.92745 64.85198, -23.92749 64.85197, -23.92747 64.85197, -23.92746 64.…
## # … with 23 more rows

# make room locally

dir.create("~/projects/2021-iceland", showWarnings = FALSE)

# save the widget out

paths %>% 
  filter(!grepl("2021-08-10", file)) %>% 
  leaflet() %>% 
  addProviderTiles("CartoDB.Positron") %>% 
  addPolygons(color = "green") %>% 
  htmlwidgets::saveWidget(
    file = "~/projects/2021-iceland/index.html",
    selfcontained = FALSE
  )

# rsync it out to my website
#

This is the (static) overview:

And, this is a zoom into our boating tour of a glacier lagoon:

Hit up the widget to see where we did our Iceland activities this summer!

RSwitch 2.1.2b Bug-fix Release

RStudio is continuing to refine their 👍🏽 new product version string scheme, with the most recent change using something like the following for daily builds:

RStudio-pro-2021.11.0-daily+23.pro1.dmg

The full URL for these Amazon S3-hosted builds is something like:

https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-pro-2021.11.0-daily%2B23.pro1.dmg

with the + getting URL encoded to %2B.

Unfortunately, AWS S3’s fairly unintelligent web server really wanted that + encoded to %2B for it to properly map to the retrieved-by-explicit-path-key bucket-backed object store. Since it was not, the download component of RSwitch v2.x failed with a somewhat inaccurate error message.

It was a minor fix that I finally had some spare cycles to make today.

You can download the RSwitch 2.1.2b universal binary here.

Some Covid Donuts To End The Week

Vox grabbed some data from the Kaiser Family Foundation and did a story a few days ago on it, then posted a different visualization of it that attracted some attention:

I’m a pretty ardent donut detractor, but I have to also admit that they work pretty well for this use case, and we can reproduce the graphic in R quite nicely. The following code chunk:

scrapes the datawrapper chart Vox embedded in their original article
extracts the data from it using V8
reformats it and uses some basic ggplot2 idioms for making the facets

library(V8) # we need to execute javascript to get the data
library(ggtext) # pretty ggtext
library(stringi) # some basic string ops
library(rvest) # scraping!
library(hrbragg) # remotes::install_github("hrbrmstr/hrbragg")
library(tidyverse) # duh

ctx <- v8() # init the V8 engine

pg <- read_html("https://datawrapper.dwcdn.net/jlEL9/6/") # get the data — this is embedded in the Vox article

# clean up the javascript so we can evaluate it
html_nodes(pg, xpath = ".//script[contains(., 'DW_SVEL')]") %>% 
  html_text() %>% 
  stri_replace_first_fixed("window.", "") %>% 
  stri_replace_all_regex("window.__.*", "", multiline = TRUE) %>% 
  ctx$eval()

# get the dat from the V8 engine and reshape it a bit, then join it with state abbreviations
ctx$get("__DW_SVELTE_PROPS__")$data %>% 
  read_tsv(
    col_names = c(
      "state", 
      "Reported cases among fully vaccinated", 
      "Reported cases among not fully vaccinated"
    )
  ) %>% 
  gather(measure, value, -state) %>% 
  left_join(
    tibble(
      state = state.name,
      abbr = state.abb
    ) %>% 
      add_row(
        state = "District of Columbia",
        abbr = "DC"
      ) 
  ) -> voxxed

# basic ggplot idiom for faceted donuts

ggplot() +
  geom_col(
    data = voxxed,
    aes(3, value, fill = measure), # play with "3" here and below to change width of the donut
    color = NA,
    position = position_stack()
  ) +
  ggtext::geom_richtext(
    data = voxxed %>% filter(measure == "Reported cases among not fully vaccinated"),
    # change 0.2 and 0 to see what they do
    aes(0.2, 0, label = sprintf("%s<br/><b><span style='color:#8a264a'>%s%%</span></b>", abbr, value)),
    size = 5, label.size = 0
  ) +
  scale_x_continuous(
    limits = c(0.2, 3 + 0.5) # this 3 links to the 3 above. tweak 0.2 and 0.5 to see what it does to the shape
  ) +
  scale_fill_manual(
    name = NULL,
    values = c(
      "Reported cases among fully vaccinated" = "#d59e67",
      "Reported cases among not fully vaccinated" = "#8a264a"
    )
  ) +
  coord_polar(theta = "y") +
  facet_wrap(~state, ncol = 6) +
  labs(
    x = NULL, y = NULL,
    title = "Breakthrough cases are not driving the US Covid-19 surge",
    caption = "Source: Kaiser Family Foundation\nNote: Case data in recent months, as of July\nOriginal chart by Vox <https://www.vox.com/22602039/breakthrough-cases-covid-19-delta-variant-masks-vaccines>"
  ) +
  theme_inter(grid="") +
  theme(
    axis.text.x = elb(),
    axis.text.x.top = elb(),
    axis.text.x.bottom = elb(),
    axis.text.y = elb(),
    axis.text.y.left = elb(),
    axis.text.y.right = elb()
  ) +
  theme(
    panel.spacing.x = unit(0, "npc"),
    panel.spacing.y = unit(0, "npc")
  ) + 
  theme(
    strip.text = elb(),
    strip.text.x = elb(),
    strip.text.y = elb()
  ) +
  theme(
    legend.position = "top",
    legend.text = elt(size = 12),
    legend.justification = "left"
  ) +
  theme(
    plot.caption = elt(hjust = 0)
  )

Hasty Bug-fix Release of RSwitch v2

Version 2 of RSwitch — the macOS menubar utility that enables fast and seamless switching between R versions on macOS and also provides notifications for new versions of RStudio Dailies has had a hasty update to deal with an error condition if/when the RStudio Dailies pages contain no links (they’re switching versioning schemes and I happened to catch this right at that time). The original RSwitch v1.x should be fine as long as the non-RStudio Pro dailies page isn’t busted.

Version 2.1.1b🔗 also adds a downloader for the arm64 version of the R-devel tarball:

The preferences also has a new button that lets you clear the cached previously seen latest daily download versions if you’re not getting notified (there could be a hiccup depending on when the last check was and the state of the daily download pages).

FIN

As noted, this is a hasty fix (right before heading out on holiday), so if there are issues it’ll be a bit before they get fixed.

Acoustic: Solving a CyberDefenders PCAP SIP/RTP Challenge with R, Zeek, tshark (& friends)

2021-07-25 – 09:40
Posted in Cybersecurity, Data Analysis, data driven security, data wrangling, Information Security, pcap, R
Comments (3)

Hot on the heels of the previous CyberDefenders Challenge Solution comes this noisy installment which solves their Acoustic challenge.

You can find the source Rmd on GitHub, but I’m also testing the limits of WP’s markdown rendering and putting it in-stream as well.

No longer book expository this time since much of the setup/explanatory bits from it apply here as well).

Acoustic

Convert the PCAP
Examine and Process log.txt
Process Zeek Logs
Process Packet Summary
What is the transport protocol being used?
The attacker used a bunch of scanning tools that belong to the same suite. Provide the name of the suite.
“What is the User-Agent of the victim system?”
Which tool was only used against the following extensions: 100, 101, 102, 103, and 111?
Which extension on the honeypot does NOT require authentication?
How many extensions were scanned in total?
There is a trace for a real SIP client. What is the corresponding user-agent? (two words, once space in between)
Multiple real-world phone numbers were dialed. Provide the first 11 digits of the number dialed from extension 101?
What are the default credentials used in the attempted basic authentication? (format is username:password)
Which codec does the RTP stream use? (3 words, 2 spaces in between)
How long is the sampling time (in milliseconds)?
What was the password for the account with username 555?
Which RTP packet header field can be used to reorder out of sync
RTP packets in the correct sequence?
The trace includes a secret hidden message. Can you hear
it?

This challenge takes us “into the world of voice communications on the internet. VoIP is becoming the de-facto standard for voice communication. As this technology becomes more common, malicious parties have more opportunities and stronger motives to control these systems to conduct nefarious activities. This challenge was designed to examine and explore some of the attributes of the SIP and RTP protocols.”

We have two files to work with:

log.txt which was generated from an unadvertised, passive honeypot located on the internet such that any traffic destined to it must be nefarious. Unknown parties scanned the honeypot with a range of tools, and this activity is represented in the log file.
- The IP address of the honeypot has been changed to “honey.pot.IP.removed”. In terms of geolocation, pick your favorite city.
- The MD5 hash in the authorization digest is replaced with “MD5_hash_removedXXXXXXXXXXXXXXXX”
- Some octets of external IP addresses have been replaced with an “X”
- Several trailing digits of phone numbers have been replaced with an “X”
- Assume the timestamps in the log files are UTC.
Voip-trace.pcap was created by honeynet members for this forensic challenge to allow participants to employ network analysis skills in the VOIP context.

There are 14 questions to answer.

If you are not familiar with SIP and/or RTP you should do a bit of research first. A good place to start is RTC 3261 (for SIP) and RFC 3550 (for RTC). Some questions may be able to be answered just by knowing the details of these protocols.

Convert the PCAP

library(stringi)
library(tidyverse)

We’ll pre-generate Zeek logs. The -C tells Zeek to not bother with checksums, -r tells it to read from a file and the LogAscii::use_json=T means we want JSON output vs the default delimited files. JSON gives us data types (the headers in the delimited files do as well, but we’d have to write something to read those types then deal with it vs get this for free out of the box with JSON).

system("ZEEK_LOG_SUFFIX=json /opt/zeek/bin/zeek -C -r src/Voip-trace.pcap LogAscii::use_json=T HTTP::default_capture_password=T")

We process the PCAP twice with tshark. Once to get the handy (and small) packet summary table, then dump the whole thing to JSON. We may need to run tshark again down the road a bit.

system("tshark -T tabs -r src/Voip-trace.pcap > voip-packets.tsv")
system("tshark -T json -r src/Voip-trace.pcap > voip-trace")

Examine and Process `log.txt`

We aren’t told what format log.txt is in, so let’s take a look:

cd_sip_log <- stri_read_lines("src/log.txt")

cat(head(cd_sip_log, 25), sep="\n")
## Source: 210.184.X.Y:1083
## Datetime: 2010-05-02 01:43:05.606584
## 
## Message:
## 
## OPTIONS sip:100@honey.pot.IP.removed SIP/2.0
## Via: SIP/2.0/UDP 127.0.0.1:5061;branch=z9hG4bK-2159139916;rport
## Content-Length: 0
## From: "sipvicious"<sip:100@1.1.1.1>; tag=X_removed
## Accept: application/sdp
## User-Agent: friendly-scanner
## To: "sipvicious"<sip:100@1.1.1.1>
## Contact: sip:100@127.0.0.1:5061
## CSeq: 1 OPTIONS
## Call-ID: 845752980453913316694142
## Max-Forwards: 70
## 
## 
## 
## 
## -------------------------
## Source: 210.184.X.Y:4956
## Datetime: 2010-05-02 01:43:12.488811
## 
## Message:

These look a bit like HTTP server responses, but we know we’re working in SIP land and if you perused the RFC you’d have noticed that SIP is an HTTP-like ASCII protocol. While some HTTP response parsers might work on these records, it’s pretty straightforward to whip up a bespoke pseudo-parser.

Let’s see how many records there are by counting the number of “Message:” lines (we’re doing this, primarily, to see if we should use the {furrr} package to speed up processing):

cd_sip_log[stri_detect_fixed(cd_sip_log, "Message:")] %>%
  table()
## .
## Message: 
##     4266

There are many, so we’ll avoid parallel processing the data and just use a single thread.

One way to tackle the parsing is to look for the stop and start of each record, extract fields (these have similar formats to HTTP headers), and perhaps have to extract content as well. We know this because there are “Content-Length:” fields. According to the RFC they are supposed to exist for every message. Let’s first see if any “Content-Length:” header records are greater than 0. We’ll do this with a little help from the ripgrep utility as it provides a way to see context before and/or after matched patterns:

cat(system('rg --after-context=10 "^Content-Length: [^0]" src/log.txt', intern=TRUE), sep="\n")
## Content-Length: 330
## 
## v=0
## o=Zoiper_user 0 0 IN IP4 89.42.194.X
## s=Zoiper_session
## c=IN IP4 89.42.194.X
## t=0 0
## m=audio 52999 RTP/AVP 3 0 8 110 98 101
## a=rtpmap:3 GSM/8000
## a=rtpmap:0 PCMU/8000
## a=rtpmap:8 PCMA/8000
## --
## Content-Length: 330
## 
## v=0
## o=Zoiper_user 0 0 IN IP4 89.42.194.X
## s=Zoiper_session
## c=IN IP4 89.42.194.X
## t=0 0
## m=audio 52999 RTP/AVP 3 0 8 110 98 101
## a=rtpmap:3 GSM/8000
## a=rtpmap:0 PCMU/8000
## a=rtpmap:8 PCMA/8000
## --
## Content-Length: 330
## 
## v=0
## o=Zoiper_user 0 0 IN IP4 89.42.194.X
## s=Zoiper_session
## c=IN IP4 89.42.194.X
## t=0 0
## m=audio 52999 RTP/AVP 3 0 8 110 98 101
## a=rtpmap:3 GSM/8000
## a=rtpmap:0 PCMU/8000
## a=rtpmap:8 PCMA/8000
## --
## Content-Length: 330
## 
## v=0
## o=Zoiper_user 0 0 IN IP4 89.42.194.X
## s=Zoiper_session
## c=IN IP4 89.42.194.X
## t=0 0
## m=audio 52999 RTP/AVP 3 0 8 110 98 101
## a=rtpmap:3 GSM/8000
## a=rtpmap:0 PCMU/8000
## a=rtpmap:8 PCMA/8000

So,we do need to account for content. It’s still pretty straightforward (explanatory comments inline):

starts <- which(stri_detect_regex(cd_sip_log, "^Source:"))
stops <- which(stri_detect_regex(cd_sip_log, "^----------"))

map2_dfr(starts, stops, ~{

  raw_rec <- stri_trim_both(cd_sip_log[.x:.y]) # target the record from the log
  raw_rec <- raw_rec[raw_rec != "-------------------------"] # remove separator

  msg_idx <- which(stri_detect_regex(raw_rec, "^Message:")) # find where "Message:" line is
  source_idx <- which(stri_detect_regex(raw_rec, "^Source: ")) # find where "Source:" line is
  datetime_idx <- which(stri_detect_regex(raw_rec, "^Datetime: ")) # find where "Datetime:" line is
  contents_idx <- which(stri_detect_regex(raw_rec[(msg_idx+2):length(raw_rec)], "^$"))[1] + 2 # get position of the "data"

  source <- stri_match_first_regex(raw_rec[source_idx], "^Source: (.*)$")[,2] # extract source
  datetime <- stri_match_first_regex(raw_rec[datetime_idx], "^Datetime: (.*)$")[,2] # extract datetime
  request <- raw_rec[msg_idx+2] # extract request line

  # build a matrix out of the remaining headers. header key will be in column 2, value will be in column 3
  tmp <- stri_match_first_regex(raw_rec[(msg_idx+3):contents_idx], "^([^:]+):[[:space:]]+(.*)$")
  tmp[,2] <- stri_trans_tolower(tmp[,2]) # lowercase the header key
  tmp[,2] <- stri_replace_all_fixed(tmp[,2], "-", "_") # turn dashes to underscores so we can more easily use the keys as column names

  contents <- raw_rec[(contents_idx+1):length(raw_rec)]
  contents <- paste0(contents[contents != ""], collapse = "\n")

  as.list(tmp[,3]) %>% # turn the header values into a list
    set_names(tmp[,2]) %>% # make their names the tranformed keys
    append(c(
      source = source, # add source to the list (etc)
      datetime = datetime,
      request = request,
      contents = contents
    ))

}) -> sip_log_parsed

Let’s see what we have:

sip_log_parsed
## # A tibble: 4,266 x 18
##    via     content_length from    accept  user_agent to     contact cseq  source
##    <chr>   <chr>          <chr>   <chr>   <chr>      <chr>  <chr>   <chr> <chr> 
##  1 SIP/2.… 0              "\"sip… applic… friendly-… "\"si… sip:10… 1 OP… 210.1…
##  2 SIP/2.… 0              "\"342… applic… friendly-… "\"34… sip:34… 1 RE… 210.1…
##  3 SIP/2.… 0              "\"172… applic… friendly-… "\"17… sip:17… 1 RE… 210.1…
##  4 SIP/2.… 0              "\"adm… applic… friendly-… "\"ad… sip:ad… 1 RE… 210.1…
##  5 SIP/2.… 0              "\"inf… applic… friendly-… "\"in… sip:in… 1 RE… 210.1…
##  6 SIP/2.… 0              "\"tes… applic… friendly-… "\"te… sip:te… 1 RE… 210.1…
##  7 SIP/2.… 0              "\"pos… applic… friendly-… "\"po… sip:po… 1 RE… 210.1…
##  8 SIP/2.… 0              "\"sal… applic… friendly-… "\"sa… sip:sa… 1 RE… 210.1…
##  9 SIP/2.… 0              "\"ser… applic… friendly-… "\"se… sip:se… 1 RE… 210.1…
## 10 SIP/2.… 0              "\"sup… applic… friendly-… "\"su… sip:su… 1 RE… 210.1…
## # … with 4,256 more rows, and 9 more variables: datetime <chr>, request <chr>,
## #   contents <chr>, call_id <chr>, max_forwards <chr>, expires <chr>,
## #   allow <chr>, authorization <chr>, content_type <chr>

glimpse(sip_log_parsed)
## Rows: 4,266
## Columns: 18
## $ via            <chr> "SIP/2.0/UDP 127.0.0.1:5061;branch=z9hG4bK-2159139916;r…
## $ content_length <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ from           <chr> "\"sipvicious\"<sip:100@1.1.1.1>; tag=X_removed", "\"34…
## $ accept         <chr> "application/sdp", "application/sdp", "application/sdp"…
## $ user_agent     <chr> "friendly-scanner", "friendly-scanner", "friendly-scann…
## $ to             <chr> "\"sipvicious\"<sip:100@1.1.1.1>", "\"3428948518\"<sip:…
## $ contact        <chr> "sip:100@127.0.0.1:5061", "sip:3428948518@honey.pot.IP.…
## $ cseq           <chr> "1 OPTIONS", "1 REGISTER", "1 REGISTER", "1 REGISTER", …
## $ source         <chr> "210.184.X.Y:1083", "210.184.X.Y:4956", "210.184.X.Y:51…
## $ datetime       <chr> "2010-05-02 01:43:05.606584", "2010-05-02 01:43:12.4888…
## $ request        <chr> "OPTIONS sip:100@honey.pot.IP.removed SIP/2.0", "REGIST…
## $ contents       <chr> "Call-ID: 845752980453913316694142\nMax-Forwards: 70", …
## $ call_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ max_forwards   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ expires        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ allow          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ authorization  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ content_type   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Looks 👍, but IRL there are edge-cases we’d have to deal with.

Process Zeek Logs

Because they’re JSON files, and the names are reasonable, we can do some magic incantations to read them all in and shove them into a list we’ll call zeek:

zeek <- list()

list.files(
  pattern = "json$",
  full.names = TRUE
) %>%
  walk(~{
    append(zeek, list(file(.x) %>% 
      jsonlite::stream_in(verbose = FALSE) %>%
      as_tibble()) %>% 
        set_names(tools::file_path_sans_ext(basename(.x)))
    ) ->> zeek
  })

str(zeek, 1)
## List of 7
##  $ conn         : tibble [97 × 18] (S3: tbl_df/tbl/data.frame)
##  $ dpd          : tibble [1 × 9] (S3: tbl_df/tbl/data.frame)
##  $ files        : tibble [38 × 16] (S3: tbl_df/tbl/data.frame)
##  $ http         : tibble [92 × 24] (S3: tbl_df/tbl/data.frame)
##  $ packet_filter: tibble [1 × 5] (S3: tbl_df/tbl/data.frame)
##  $ sip          : tibble [9 × 23] (S3: tbl_df/tbl/data.frame)
##  $ weird        : tibble [1 × 9] (S3: tbl_df/tbl/data.frame)

walk2(names(zeek), zeek, ~{
  cat("File:", .x, "\n")
  glimpse(.y)
  cat("\n\n")
})
## File: conn 
## Rows: 97
## Columns: 18
## $ ts            <dbl> 1272737631, 1272737581, 1272737669, 1272737669, 12727376…
## $ uid           <chr> "Cb0OAQ1eC0ZhQTEKNl", "C2s0IU2SZFGVlZyH43", "CcEeLRD3cca…
## $ id.orig_h     <chr> "172.25.105.43", "172.25.105.43", "172.25.105.43", "172.…
## $ id.orig_p     <int> 57086, 5060, 57087, 57088, 57089, 57090, 57091, 57093, 5…
## $ id.resp_h     <chr> "172.25.105.40", "172.25.105.40", "172.25.105.40", "172.…
## $ id.resp_p     <int> 80, 5060, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80…
## $ proto         <chr> "tcp", "udp", "tcp", "tcp", "tcp", "tcp", "tcp", "tcp", …
## $ service       <chr> "http", "sip", "http", "http", "http", "http", "http", "…
## $ duration      <dbl> 0.0180180073, 0.0003528595, 0.0245900154, 0.0740420818, …
## $ orig_bytes    <int> 502, 428, 380, 385, 476, 519, 520, 553, 558, 566, 566, 5…
## $ resp_bytes    <int> 720, 518, 231, 12233, 720, 539, 17499, 144, 144, 144, 14…
## $ conn_state    <chr> "SF", "SF", "SF", "SF", "SF", "SF", "SF", "SF", "SF", "S…
## $ missed_bytes  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ history       <chr> "ShADadfF", "Dd", "ShADadfF", "ShADadfF", "ShADadfF", "S…
## $ orig_pkts     <int> 5, 1, 5, 12, 5, 6, 16, 6, 6, 5, 5, 5, 5, 5, 5, 5, 6, 5, …
## $ orig_ip_bytes <int> 770, 456, 648, 1017, 744, 839, 1360, 873, 878, 834, 834,…
## $ resp_pkts     <int> 5, 1, 5, 12, 5, 5, 16, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ resp_ip_bytes <int> 988, 546, 499, 12865, 988, 807, 18339, 412, 412, 412, 41…
## 
## 
## File: dpd 
## Rows: 1
## Columns: 9
## $ ts             <dbl> 1272737798
## $ uid            <chr> "CADvMziC96POynR2e"
## $ id.orig_h      <chr> "172.25.105.3"
## $ id.orig_p      <int> 43204
## $ id.resp_h      <chr> "172.25.105.40"
## $ id.resp_p      <int> 5060
## $ proto          <chr> "udp"
## $ analyzer       <chr> "SIP"
## $ failure_reason <chr> "Binpac exception: binpac exception: string mismatch at…
## 
## 
## File: files 
## Rows: 38
## Columns: 16
## $ ts             <dbl> 1272737631, 1272737669, 1272737676, 1272737688, 1272737…
## $ fuid           <chr> "FRnb7P5EDeZE4Y3z4", "FOT2gC2yLxjfMCuE5f", "FmUCuA3dzcS…
## $ tx_hosts       <list> "172.25.105.40", "172.25.105.40", "172.25.105.40", "17…
## $ rx_hosts       <list> "172.25.105.43", "172.25.105.43", "172.25.105.43", "17…
## $ conn_uids      <list> "Cb0OAQ1eC0ZhQTEKNl", "CFfYtA0DqqrJk4gI5", "CHN4qA4UUH…
## $ source         <chr> "HTTP", "HTTP", "HTTP", "HTTP", "HTTP", "HTTP", "HTTP",…
## $ depth          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ analyzers      <list> [], [], [], [], [], [], [], [], [], [], [], [], [], []…
## $ mime_type      <chr> "text/html", "text/html", "text/html", "text/html", "te…
## $ duration       <dbl> 0.000000e+00, 8.920908e-03, 0.000000e+00, 0.000000e+00,…
## $ is_orig        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, …
## $ seen_bytes     <int> 479, 11819, 479, 313, 17076, 55, 50, 30037, 31608, 1803…
## $ total_bytes    <int> 479, NA, 479, 313, NA, 55, 50, NA, NA, NA, 58, 313, 50,…
## $ missing_bytes  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ overflow_bytes <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ timedout       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## 
## 
## File: http 
## Rows: 92
## Columns: 24
## $ ts                <dbl> 1272737631, 1272737669, 1272737669, 1272737676, 1272…
## $ uid               <chr> "Cb0OAQ1eC0ZhQTEKNl", "CcEeLRD3cca3j4QGh", "CFfYtA0D…
## $ id.orig_h         <chr> "172.25.105.43", "172.25.105.43", "172.25.105.43", "…
## $ id.orig_p         <int> 57086, 57087, 57088, 57089, 57090, 57091, 57093, 570…
## $ id.resp_h         <chr> "172.25.105.40", "172.25.105.40", "172.25.105.40", "…
## $ id.resp_p         <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, …
## $ trans_depth       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ method            <chr> "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GE…
## $ host              <chr> "172.25.105.40", "172.25.105.40", "172.25.105.40", "…
## $ uri               <chr> "/maint", "/", "/user/", "/maint", "/maint", "/maint…
## $ referrer          <chr> "http://172.25.105.40/user/", NA, NA, "http://172.25…
## $ version           <chr> "1.1", "1.1", "1.1", "1.1", "1.1", "1.1", "1.1", "1.…
## $ user_agent        <chr> "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9)…
## $ request_body_len  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ response_body_len <int> 479, 0, 11819, 479, 313, 17076, 0, 0, 0, 0, 0, 0, 0,…
## $ status_code       <int> 401, 302, 200, 401, 301, 200, 304, 304, 304, 304, 30…
## $ status_msg        <chr> "Authorization Required", "Found", "OK", "Authorizat…
## $ tags              <list> [], [], [], [], [], [], [], [], [], [], [], [], [],…
## $ resp_fuids        <list> "FRnb7P5EDeZE4Y3z4", <NULL>, "FOT2gC2yLxjfMCuE5f", …
## $ resp_mime_types   <list> "text/html", <NULL>, "text/html", "text/html", "tex…
## $ username          <chr> NA, NA, NA, NA, "maint", "maint", "maint", "maint", …
## $ password          <chr> NA, NA, NA, NA, "password", "password", "password", …
## $ orig_fuids        <list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NU…
## $ orig_mime_types   <list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NU…
## 
## 
## File: packet_filter 
## Rows: 1
## Columns: 5
## $ ts      <dbl> 1627151196
## $ node    <chr> "zeek"
## $ filter  <chr> "ip or not ip"
## $ init    <lgl> TRUE
## $ success <lgl> TRUE
## 
## 
## File: sip 
## Rows: 9
## Columns: 23
## $ ts                <dbl> 1272737581, 1272737768, 1272737768, 1272737768, 1272…
## $ uid               <chr> "C2s0IU2SZFGVlZyH43", "CADvMziC96POynR2e", "CADvMziC…
## $ id.orig_h         <chr> "172.25.105.43", "172.25.105.3", "172.25.105.3", "17…
## $ id.orig_p         <int> 5060, 43204, 43204, 43204, 43204, 43204, 43204, 4320…
## $ id.resp_h         <chr> "172.25.105.40", "172.25.105.40", "172.25.105.40", "…
## $ id.resp_p         <int> 5060, 5060, 5060, 5060, 5060, 5060, 5060, 5060, 5060
## $ trans_depth       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ method            <chr> "OPTIONS", "REGISTER", "REGISTER", "SUBSCRIBE", "SUB…
## $ uri               <chr> "sip:100@172.25.105.40", "sip:172.25.105.40", "sip:1…
## $ request_from      <chr> "\"sipvicious\"<sip:100@1.1.1.1>", "<sip:555@172.25.…
## $ request_to        <chr> "\"sipvicious\"<sip:100@1.1.1.1>", "<sip:555@172.25.…
## $ response_from     <chr> "\"sipvicious\"<sip:100@1.1.1.1>", "<sip:555@172.25.…
## $ response_to       <chr> "\"sipvicious\"<sip:100@1.1.1.1>;tag=as18cdb0c9", "<…
## $ call_id           <chr> "61127078793469957194131", "MzEwMmYyYWRiYTUxYTBhODY3…
## $ seq               <chr> "1 OPTIONS", "1 REGISTER", "2 REGISTER", "1 SUBSCRIB…
## $ request_path      <list> "SIP/2.0/UDP 127.0.1.1:5060", "SIP/2.0/UDP 172.25.10…
## $ response_path     <list> "SIP/2.0/UDP 127.0.1.1:5060", "SIP/2.0/UDP 172.25.10…
## $ user_agent        <chr> "UNfriendly-scanner - for demo purposes", "X-Lite B…
## $ status_code       <int> 200, 401, 200, 401, 404, 401, 100, 200, NA
## $ status_msg        <chr> "OK", "Unauthorized", "OK", "Unauthorized", "Not fo…
## $ request_body_len  <int> 0, 0, 0, 0, 0, 264, 264, 264, 0
## $ response_body_len <int> 0, 0, 0, 0, 0, 0, 0, 302, NA
## $ content_type      <chr> NA, NA, NA, NA, NA, NA, NA, "application/sdp", NA
## 
## 
## File: weird 
## Rows: 1
## Columns: 9
## $ ts        <dbl> 1272737805
## $ id.orig_h <chr> "172.25.105.3"
## $ id.orig_p <int> 0
## $ id.resp_h <chr> "172.25.105.40"
## $ id.resp_p <int> 0
## $ name      <chr> "truncated_IPv6"
## $ notice    <lgl> FALSE
## $ peer      <chr> "zeek"
## $ source    <chr> "IP"

Process Packet Summary

We won’t process the big JSON file tshark generated for us util we really have to, but we can read in the packet summary table now:

packet_cols <- c("packet_num", "ts", "src", "discard", "dst", "proto", "length", "info")

read_tsv(
  file = "voip-packets.tsv",
  col_names = packet_cols,
  col_types = "ddccccdc"
) %>%
  select(-discard) -> packets

packets
## # A tibble: 4,447 x 7
##    packet_num       ts src      dst     proto length info                       
##         <dbl>    <dbl> <chr>    <chr>   <chr>  <dbl> <chr>                      
##  1          1  0       172.25.… 172.25… SIP      470 Request: OPTIONS sip:100@1…
##  2          2  3.53e-4 172.25.… 172.25… SIP      560 Status: 200 OK |           
##  3          3  5.03e+1 172.25.… 172.25… TCP       74 57086 → 80 [SYN] Seq=0 Win…
##  4          4  5.03e+1 172.25.… 172.25… TCP       74 80 → 57086 [SYN, ACK] Seq=…
##  5          5  5.03e+1 172.25.… 172.25… TCP       66 57086 → 80 [ACK] Seq=1 Ack…
##  6          6  5.03e+1 172.25.… 172.25… HTTP     568 GET /maint HTTP/1.1        
##  7          7  5.03e+1 172.25.… 172.25… TCP       66 80 → 57086 [ACK] Seq=1 Ack…
##  8          8  5.03e+1 172.25.… 172.25… HTTP     786 HTTP/1.1 401 Authorization…
##  9          9  5.03e+1 172.25.… 172.25… TCP       66 80 → 57086 [FIN, ACK] Seq=…
## 10         10  5.03e+1 172.25.… 172.25… TCP       66 57086 → 80 [ACK] Seq=503 A…
## # … with 4,437 more rows

glimpse(packets)
## Rows: 4,447
## Columns: 7
## $ packet_num <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ ts         <dbl> 0.000000, 0.000353, 50.317176, 50.317365, 50.320071, 50.329…
## $ src        <chr> "172.25.105.43", "172.25.105.40", "172.25.105.43", "172.25.…
## $ dst        <chr> "172.25.105.40", "172.25.105.43", "172.25.105.40", "172.25.…
## $ proto      <chr> "SIP", "SIP", "TCP", "TCP", "TCP", "HTTP", "TCP", "HTTP", "…
## $ length     <dbl> 470, 560, 74, 74, 66, 568, 66, 786, 66, 66, 66, 66, 74, 74,…
## $ info       <chr> "Request: OPTIONS sip:100@172.25.105.40 |", "Status: 200 OK…

What is the transport protocol being used?

SIP can use TCP or UDP and which transport it uses will be specified in the Via: header. Let’s take a look:

head(sip_log_parsed$via)
## [1] "SIP/2.0/UDP 127.0.0.1:5061;branch=z9hG4bK-2159139916;rport"
## [2] "SIP/2.0/UDP 127.0.0.1:5087;branch=z9hG4bK-1189344537;rport"
## [3] "SIP/2.0/UDP 127.0.0.1:5066;branch=z9hG4bK-2119091576;rport"
## [4] "SIP/2.0/UDP 127.0.0.1:5087;branch=z9hG4bK-3226446220;rport"
## [5] "SIP/2.0/UDP 127.0.0.1:5087;branch=z9hG4bK-1330901245;rport"
## [6] "SIP/2.0/UDP 127.0.0.1:5087;branch=z9hG4bK-945386205;rport"

Are they all UDP? We can find out by performing some light processing
on the via column:

sip_log_parsed %>% 
  select(via) %>% 
  mutate(
    transport = stri_match_first_regex(via, "^([^[:space:]]+)")[,2]
  ) %>% 
  count(transport, sort=TRUE)
## # A tibble: 1 x 2
##   transport       n
##   <chr>       <int>
## 1 SIP/2.0/UDP  4266

Looks like they’re all UDP. Question 1: ✅

The attacker used a bunch of scanning tools that belong to the same suite. Provide the name of the suite.

Don’t you, now, wish you had listen to your parents when they were telling you about the facts of SIP life when you were a wee pup?

We’ll stick with the SIP log to answer this one and peek back at the RFC to see that there’s a “User-Agent:” field which contains information about the client originating the request. Most scanners written by defenders identify themselves in User-Agent fields when those fields are available in a protocol exchange, and a large percentage of naive malicious folks are too daft to change this value (or leave it default to make you think they’re not behaving badly).

If you are a regular visitor to SIP land, you likely know the common SIP scanning tools. These are a few:

Nmap’s SIP library
Mr.SIP, a “SIP-Based Audit and Attack Tool”
SIPVicious, a “set of security tools that can be used to audit SIP based VoIP systems”
Sippts, a “set of tools to audit SIP based VoIP Systems”

(There are many more.)

Let’s see what user-agent was used in this log extract:

count(sip_log_parsed, user_agent, sort=TRUE)
## # A tibble: 3 x 2
##   user_agent           n
##   <chr>            <int>
## 1 friendly-scanner  4248
## 2 Zoiper rev.6751     14
## 3 <NA>                 4

The overwhelming majority are friendly-scanner. Let’s look at a few of those log entries:

sip_log_parsed %>% 
  filter(
    user_agent == "friendly-scanner"
  ) %>% 
  glimpse()
## Rows: 4,248
## Columns: 18
## $ via            <chr> "SIP/2.0/UDP 127.0.0.1:5061;branch=z9hG4bK-2159139916;r…
## $ content_length <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ from           <chr> "\"sipvicious\"<sip:100@1.1.1.1>; tag=X_removed", "\"34…
## $ accept         <chr> "application/sdp", "application/sdp", "application/sdp"…
## $ user_agent     <chr> "friendly-scanner", "friendly-scanner", "friendly-scann…
## $ to             <chr> "\"sipvicious\"<sip:100@1.1.1.1>", "\"3428948518\"<sip:…
## $ contact        <chr> "sip:100@127.0.0.1:5061", "sip:3428948518@honey.pot.IP.…
## $ cseq           <chr> "1 OPTIONS", "1 REGISTER", "1 REGISTER", "1 REGISTER", …
## $ source         <chr> "210.184.X.Y:1083", "210.184.X.Y:4956", "210.184.X.Y:51…
## $ datetime       <chr> "2010-05-02 01:43:05.606584", "2010-05-02 01:43:12.4888…
## $ request        <chr> "OPTIONS sip:100@honey.pot.IP.removed SIP/2.0", "REGIST…
## $ contents       <chr> "Call-ID: 845752980453913316694142\nMax-Forwards: 70", …
## $ call_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ max_forwards   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ expires        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ allow          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ authorization  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ content_type   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Those from and to fields have an interesting name in them: “sipviscious”. You’ve seen that before, right at the beginning of this section.

Let’s do a quick check over at the SIPvicious repo just to make sure.

count(sip_log_parsed, user_agent)
## # A tibble: 3 x 2
##   user_agent           n
##   <chr>            <int>
## 1 friendly-scanner  4248
## 2 Zoiper rev.6751     14
## 3 <NA>                 4

“What is the User-Agent of the victim system?”

We only have partial data in the text log so we’ll have to look elsewhere (the PCAP) for this information. The “victim” is whatever was the target of a this SIP-based attack and we can look for SIP messages, user agents, and associated IPs in the PCAP thanks to tshark’s rich SIP filter library:

system("tshark -Q -T fields -e ip.src -e ip.dst -e sip.User-Agent -r src/Voip-trace.pcap 'sip.User-Agent'")

That first exchange is all we really need. We see our rude poker talking to 172.25.105.40 and it responding right after.

Which tool was only used against the following extensions: 100, 101, 102, 103, and 111?

The question is a tad vague and is assuming — since we now know the SIPvicious suite was used — that we also know to provide the name of the Python script in SIPvicious that was used. There are five tools:

svmap: this is a sip scanner. When launched against ranges of ip address space, it will identify any SIP servers which it finds on the way. Also has the option to scan hosts on ranges of ports. Usage: https://github.com/EnableSecurity/sipvicious/wiki/SVMap-Usage
svwar: identifies working extension lines on a PBX. A working extension is one that can be registered. Also tells you if the extension line requires authentication or not. Usage: https://github.com/EnableSecurity/sipvicious/wiki/SVWar-Usage
svcrack: a password cracker making use of digest authentication. It is able to crack passwords on both registrar servers and proxy servers. Current cracking modes are either numeric ranges or words from dictionary files. Usage: https://github.com/EnableSecurity/sipvicious/wiki/SVCrack-Usage
svreport: able to manage sessions created by the rest of the tools and export to pdf, xml, csv and plain text. Usage: https://github.com/EnableSecurity/sipvicious/wiki/SVReport-Usage
svcrash: responds to svwar and svcrack SIP messages with a message that causes old versions to crash. Usage: https://github.com/EnableSecurity/sipvicious/wiki/SVCrash-FAQ

The svcrash tool is something defenders can use to help curtail scanner activity. We can cross that off the list. The svreport tool is for working with data generated by svmap, svwar and/or svcrack. One more crossed off. We also know that the attacker scanned the SIP network looking for nodes, which means svmap and svwar are likely not exclusive tool to the target extensions. (We technically have enough information right now to answer the question especially if you look carefully at the answer box on the site but that’s cheating).

The SIP request line and header field like “To:” destination information in the form of a SIP URI. Since we only care about the extension component of the URI for this question, we can use a regular expression to isolate them.

Back to the SIP log to see if we can find the identified extensions. We’ll also process the “From:” header just in case we need it.

sip_log_parsed %>% 
  mutate_at(
    vars(request, from, to),
    ~stri_match_first_regex(.x, "sip:([^@]+)@")[,2]
  ) %>% 
  select(request, from, to)
## # A tibble: 4,266 x 3
##    request    from       to        
##    <chr>      <chr>      <chr>     
##  1 100        100        100       
##  2 3428948518 3428948518 3428948518
##  3 1729240413 1729240413 1729240413
##  4 admin      admin      admin     
##  5 info       info       info      
##  6 test       test       test      
##  7 postmaster postmaster postmaster
##  8 sales      sales      sales     
##  9 service    service    service   
## 10 support    support    support   
## # … with 4,256 more rows

That worked! We can now see what friendly-scanner attempted to authenticate only to our targets:

sip_log_parsed %>%
  mutate_at(
    vars(request, from, to),
    ~stri_match_first_regex(.x, "sip:([^@]+)@")[,2]
  ) %>% 
  filter(
    user_agent == "friendly-scanner",
    stri_detect_fixed(contents, "Authorization")
  ) %>% 
  distinct(to)
## # A tibble: 4 x 1
##   to   
##   <chr>
## 1 102  
## 2 103  
## 3 101  
## 4 111

While we’re missing 100 that’s likely due to it not requiring authentication (svcrack will REGISTER first to determine if a target requires authentication and not send cracking requests if it doesn’t).

Which extension on the honeypot does NOT require authentication?

We know this due to what we found in the previous question. Extension 100 does not require authentication.

How many extensions were scanned in total?

We just need to count the distinct to’s where the user agent is the scanner:

sip_log_parsed %>% 
  mutate_at(
    vars(request, from, to),
    ~stri_match_first_regex(.x, "sip:([^@]+)@")[,2]
  ) %>% 
  filter(
    user_agent == "friendly-scanner"
  ) %>% 
  distinct(to)
## # A tibble: 2,652 x 1
##    to        
##    <chr>     
##  1 100       
##  2 3428948518
##  3 1729240413
##  4 admin     
##  5 info      
##  6 test      
##  7 postmaster
##  8 sales     
##  9 service   
## 10 support   
## # … with 2,642 more rows

There is a trace for a real SIP client. What is the corresponding user-agent? (two words, once space in between)

We only need to look for user agent’s that aren’t our scanner:

sip_log_parsed %>% 
  filter(
    user_agent != "friendly-scanner"
  ) %>% 
  count(user_agent)
## # A tibble: 1 x 2
##   user_agent          n
##   <chr>           <int>
## 1 Zoiper rev.6751    14

Multiple real-world phone numbers were dialed. Provide the first 11 digits of the number dialed from extension 101?

Calls are “INVITE” requests

sip_log_parsed %>% 
  mutate_at(
    vars(from, to),
    ~stri_match_first_regex(.x, "sip:([^@]+)@")[,2]
  ) %>% 
  filter(
    from == 101,
    stri_detect_regex(cseq, "INVITE")
  ) %>% 
  select(to) 
## # A tibble: 3 x 1
##   to              
##   <chr>           
## 1 900114382089XXXX
## 2 00112322228XXXX 
## 3 00112524021XXXX

The challenge answer box provides hint to what number they want. I’m not sure but I suspect it may be randomized, so you’ll have to match the pattern they expect with the correct digits above.

What are the default credentials used in the attempted basic authentication? (format is username:password)

This question wants us to look at the HTTP requests that require authentication. We can get he credentials info from the zeek$http log:

zeek$http %>% 
  distinct(username, password)
## # A tibble: 2 x 2
##   username password
##   <chr>    <chr>   
## 1 <NA>     <NA>    
## 2 maint    password

Which codec does the RTP stream use? (3 words, 2 spaces in between)

“Codec” refers to the algorithm used to encode/decode an audio or video stream. The RTP RFC uses the term “payload type” to refer to this during exchanges and even has a link to RFC 3551 which provides further information on these encodings.

The summary packet table that tshark generates helpfully provides summary info for RTP packets and part of that info is PT=… which indicates the payload type.

packets %>% 
  filter(proto == "RTP") %>% 
  select(info)
## # A tibble: 2,988 x 1
##    info                                                       
##    <chr>                                                      
##  1 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6402, Time=126160
##  2 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6403, Time=126320
##  3 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6404, Time=126480
##  4 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6405, Time=126640
##  5 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6406, Time=126800
##  6 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6407, Time=126960
##  7 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6408, Time=127120
##  8 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6409, Time=127280
##  9 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6410, Time=127440
## 10 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6411, Time=127600
## # … with 2,978 more rows

How long is the sampling time (in milliseconds)?

1 Hz = 1,000 ms
1 ms = 1,000 Hz

(1/8000) * 1000

What was the password for the account with username 555?

We don’t really need to use external programs for this but it will sure go quite a bit faster if we do. While the original reference page for sipdump and sipcrack is defunct, you can visit that link to go to the Wayback machine’s capture of it. It will help if you have a linux system handy (so Docker to the rescue for macOS and Windows folks) since the following answer details are running on Ubunbu.

This question is taking advantage of the fact that the default authentication method for SIP is extremely weak. The process uses an MD5 challenge/response, and if an attacker can capture call traffic it is possible to brute force the password offline (which is what we’ll use sipcrack for).

You can install them via sudo apt install sipcrack.

We’ll first generate a dump of the authentication attempts with sipdump:

system("sipdump -p src/Voip-trace.pcap sip.dump", intern=TRUE)
##  [1] ""                                                               
##  [2] "SIPdump 0.2 "                                                   
##  [3] "---------------------------------------"                        
##  [4] ""                                                               
##  [5] "* Using pcap file 'src/Voip-trace.pcap' for sniffing"           
##  [6] "* Starting to sniff with packet filter 'tcp or udp'"            
##  [7] ""                                                               
##  [8] "* Dumped login from 172.25.105.40 -> 172.25.105.3 (User: '555')"
##  [9] "* Dumped login from 172.25.105.40 -> 172.25.105.3 (User: '555')"
## [10] "* Dumped login from 172.25.105.40 -> 172.25.105.3 (User: '555')"
## [11] ""                                                               
## [12] "* Exiting, sniffed 3 logins"

cat(readLines("sip.dump"), sep="\n")
## 172.25.105.3"172.25.105.40"555"asterisk"REGISTER"sip:172.25.105.40"4787f7ce""""MD5"1ac95ce17e1f0230751cf1fd3d278320
## 172.25.105.3"172.25.105.40"555"asterisk"INVITE"sip:1000@172.25.105.40"70fbfdae""""MD5"aa533f6efa2b2abac675c1ee6cbde327
## 172.25.105.3"172.25.105.40"555"asterisk"BYE"sip:1000@172.25.105.40"70fbfdae""""MD5"0b306e9db1f819dd824acf3227b60e07

It saves the IPs, caller, authentication realm, method, nonce and hash which will all be fed into the sipcrack.

We know from the placeholder answer text that the “password” is 4 characters, and this is the land of telephony, so we can make an assumption that it is really 4 digits. sipcrack needs a file of passwords to try, so We’ll let R make a randomized file of 4 digit pins for us:

cat(sprintf("%04d", sample(0:9999)), file = "4-digits", sep="\n")

We only have authenticaton packets for 555 so we can automate what would normally be an interactive process:

cat(system('echo "1" | sipcrack -w 4-digits sip.dump', intern=TRUE), sep="\n")
## 
## SIPcrack 0.2 
## ----------------------------------------
## 
## * Found Accounts:
## 
## Num  Server      Client      User    Hash|Password
## 
## 1    172.25.105.3    172.25.105.40   555 1ac95ce17e1f0230751cf1fd3d278320
## 2    172.25.105.3    172.25.105.40   555 aa533f6efa2b2abac675c1ee6cbde327
## 3    172.25.105.3    172.25.105.40   555 0b306e9db1f819dd824acf3227b60e07
## 
## * Select which entry to crack (1 - 3): 
## * Generating static MD5 hash... c3e0f1664fde9fbc75a7cbd341877875
## * Loaded wordlist: '4-digits'
## * Starting bruteforce against user '555' (MD5: '1ac95ce17e1f0230751cf1fd3d278320')
## * Tried 8904 passwords in 0 seconds
## 
## * Found password: '1234'
## * Updating dump file 'sip.dump'... done

Which RTP packet header field can be used to reorder out of sync RTP packets in the correct sequence?

Just reading involved here: 5.1 RTP Fixed Header Fields.

The trace includes a secret hidden message. Can you hear it?

We could command line this one but honestly Wireshark has a pretty keen audio player. Fire it up, open up the PCAP, go to the “Telephony” menu, pick SIP and play the streams.

Packet Maze: Solving a CyberDefenders PCAP Puzzle with R, Zeek, and tshark

2021-07-20 – 15:18
Posted in Cybersecurity, Data Analysis, data driven security, data science, data wrangling, DNS, Information Security, pcap, R
Comments (7)

It was a rainy weekend in southern Maine and I really didn’t feel like doing chores, so I was skimming through RSS feeds and noticed a link to a PacketMaze challenge in the latest This Week In 4n6.

Since it’s also been a while since I’ve done any serious content delivery (on the personal side, anyway), I thought it’d be fun to solve the challenge with some tools I like — namely Zeek, tshark, and R (links to those in the e-book I’m linking to below), craft some real expository around each solution, and bundle it all up into an e-book and lighter-weight GitHub repo.

There are 11 “quests” in the challenge, requiring sifting through a packet capture (PCAP) and looking for various odds and ends (some are very windy maze passages). The challenge ranges from extracting images and image metadata from FTP sessions to pulling out precise elements in TLS sessions, to dealing with IPv6.

This is far from an expert challenge, and anyone can likely work through it with a little bit of elbow grease.

As it says on the tin, not all data is ‘big’ nor do all data-driven cybersecurity projects require advanced modeling capabilities. Sometimes you just need to dissect some network packet capture (PCAP) data and don’t want to click through a GUI to get the job done. This short book works through the questions in CyberDefenders Lab #68 to show how you can get the Zeek open source network security tool, tshark command-line PCAP analysis Swiss army knife, and R (via RStudio) working together.

FIN

If you find the resource helpful or have other feedback, drop a note on Twitter (@hrbrmstr), in a comment here, or as a GitHub issue.

New TabularData Available in Swift on macOS Monterery (et al)

WWDC 2021 is on this week and many new fun things are being introduced, including some data science-friendly additions to the frameworks that come with Xcode 13 and available on macOS 12+ (and its *OS cousins).

Specifically, Apple has made tabular data a first-class citizen with the new TabularData app service.

A future post will have some more expository, but here’s a sample of core operations including:

reading in tabular data from CSV or JSON
examining the structure
working with columns and/or rows
grouping and filtering operations
transforming and removing columns

I’ve tagged this with rstats as there are R equivalents included for each operation so R folks can translate any Swift code they see in the future.

import TabularData

// define some basic formatting options for data frame output
let dOpts = FormattingOptions(maximumLineWidth: 80, maximumCellWidth: 10, maximumRowCount: 20, includesColumnTypes: true)

// read in a CSV file
// R: xdf <- read.csv("mtcars.csv")
var xdf = try! DataFrame.init(contentsOfCSVFile: URL(fileURLWithPath: "mtcars.csv"))

// take a look at it
// R: print(xdf) # no more print() in further R equivalents; just assume interactive or wrap with print
print(xdf.description(options: dOpts))

┏━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳╍╍╍╍╍╍┓
┃    ┃ mpg      ┃ cyl   ┃ disp     ┃ hp    ┃ drat     ┃ wt       ┃ 5    ┇
┃    ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Double> ┃ more ┇
┡━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇╍╍╍╍╍╍┩
│ 0  │ 21.0     │ 6     │ 160.0    │ 110   │ 3.9      │ 2.62     │      ┆
│ 1  │ 21.0     │ 6     │ 160.0    │ 110   │ 3.9      │ 2.875    │      ┆
│ 2  │ 22.8     │ 4     │ 108.0    │ 93    │ 3.85     │ 2.32     │      ┆
│ 3  │ 21.4     │ 6     │ 258.0    │ 110   │ 3.08     │ 3.215    │      ┆
│ 4  │ 18.7     │ 8     │ 360.0    │ 175   │ 3.15     │ 3.44     │      ┆
│ 5  │ 18.1     │ 6     │ 225.0    │ 105   │ 2.76     │ 3.46     │      ┆
│ 6  │ 14.3     │ 8     │ 360.0    │ 245   │ 3.21     │ 3.57     │      ┆
│ 7  │ 24.4     │ 4     │ 146.7    │ 62    │ 3.69     │ 3.19     │      ┆
│ 8  │ 22.8     │ 4     │ 140.8    │ 95    │ 3.92     │ 3.15     │      ┆
│ 9  │ 19.2     │ 6     │ 167.6    │ 123   │ 3.92     │ 3.44     │      ┆
│ 10 │ 17.8     │ 6     │ 167.6    │ 123   │ 3.92     │ 3.44     │      ┆
│ 11 │ 16.4     │ 8     │ 275.8    │ 180   │ 3.07     │ 4.07     │      ┆
│ 12 │ 17.3     │ 8     │ 275.8    │ 180   │ 3.07     │ 3.73     │      ┆
│ 13 │ 15.2     │ 8     │ 275.8    │ 180   │ 3.07     │ 3.78     │      ┆
│ 14 │ 10.4     │ 8     │ 472.0    │ 205   │ 2.93     │ 5.25     │      ┆
│ 15 │ 10.4     │ 8     │ 460.0    │ 215   │ 3.0      │ 5.424    │      ┆
│ 16 │ 14.7     │ 8     │ 440.0    │ 230   │ 3.23     │ 5.345    │      ┆
│ 17 │ 32.4     │ 4     │ 78.7     │ 66    │ 4.08     │ 2.2      │      ┆
│ 18 │ 30.4     │ 4     │ 75.7     │ 52    │ 4.93     │ 1.615    │      ┆
│ 19 │ 33.9     │ 4     │ 71.1     │ 65    │ 4.22     │ 1.835    │      ┆
┢╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍┪
┇ 12 more                                                               ┇
┗╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍┛

// dimensions
// R: dim(xdf)
print(xdf.shape)

(rows: 32, columns: 11)

// head
// R: head(xdf)
print(xdf.prefix(5).description(options: dOpts))

┏━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳╍╍╍╍╍╍┓
┃   ┃ mpg      ┃ cyl   ┃ disp     ┃ hp    ┃ drat     ┃ wt       ┃ 5    ┇
┃   ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Double> ┃ more ┇
┡━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇╍╍╍╍╍╍┩
│ 0 │ 21.0     │ 6     │ 160.0    │ 110   │ 3.9      │ 2.62     │      ┆
│ 1 │ 21.0     │ 6     │ 160.0    │ 110   │ 3.9      │ 2.875    │      ┆
│ 2 │ 22.8     │ 4     │ 108.0    │ 93    │ 3.85     │ 2.32     │      ┆
│ 3 │ 21.4     │ 6     │ 258.0    │ 110   │ 3.08     │ 3.215    │      ┆
│ 4 │ 18.7     │ 8     │ 360.0    │ 175   │ 3.15     │ 3.44     │      ┆
└───┴──────────┴───────┴──────────┴───────┴──────────┴──────────┴╌╌╌╌╌╌┘

// tail
// R: tail(xdf)
print(xdf.suffix(5).description(options: dOpts))

┏━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳╍╍╍╍╍╍┓
┃    ┃ mpg      ┃ cyl   ┃ disp     ┃ hp    ┃ drat     ┃ wt       ┃ 5    ┇
┃    ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Double> ┃ more ┇
┡━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇╍╍╍╍╍╍┩
│ 27 │ 30.4     │ 4     │ 95.1     │ 113   │ 3.77     │ 1.513    │      ┆
│ 28 │ 15.8     │ 8     │ 351.0    │ 264   │ 4.22     │ 3.17     │      ┆
│ 29 │ 19.7     │ 6     │ 145.0    │ 175   │ 3.62     │ 2.77     │      ┆
│ 30 │ 15.0     │ 8     │ 301.0    │ 335   │ 3.54     │ 3.57     │      ┆
│ 31 │ 21.4     │ 4     │ 121.0    │ 109   │ 4.11     │ 2.78     │      ┆
└────┴──────────┴───────┴──────────┴───────┴──────────┴──────────┴╌╌╌╌╌╌┘

// column summaries
// summary(xdf)
print(xdf.summaryOfAllColumns().description(options: dOpts))

┏━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳╍╍╍╍╍╍┓
┃   ┃ count(mpg) ┃ uniqueCou… ┃ top(mpg) ┃ topFreque… ┃ count(cyl) ┃ 39   ┇
┃   ┃ <Int>      ┃ <Int>      ┃ <Double> ┃ <Int>      ┃ <Int>      ┃ more ┇
┡━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇╍╍╍╍╍╍┩
│ 0 │ 32         │ 25         │ 21.4     │ 2          │ 32         │      ┆
└───┴────────────┴────────────┴──────────┴────────────┴────────────┴╌╌╌╌╌╌┘

// sort it
// R: library(tidyverse) # assume this going forward for R examples
// R: arrange(xdf, cyl)
xdf.sort(on: "cyl")

print(xdf.description(options: dOpts))

┏━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳╍╍╍╍╍╍┓
┃    ┃ mpg      ┃ cyl   ┃ disp     ┃ hp    ┃ drat     ┃ wt       ┃ 5    ┇
┃    ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Double> ┃ more ┇
┡━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇╍╍╍╍╍╍┩
│ 0  │ 22.8     │ 4     │ 108.0    │ 93    │ 3.85     │ 2.32     │      ┆
│ 1  │ 24.4     │ 4     │ 146.7    │ 62    │ 3.69     │ 3.19     │      ┆
│ 2  │ 22.8     │ 4     │ 140.8    │ 95    │ 3.92     │ 3.15     │      ┆
│ 3  │ 32.4     │ 4     │ 78.7     │ 66    │ 4.08     │ 2.2      │      ┆
│ 4  │ 30.4     │ 4     │ 75.7     │ 52    │ 4.93     │ 1.615    │      ┆
│ 5  │ 33.9     │ 4     │ 71.1     │ 65    │ 4.22     │ 1.835    │      ┆
│ 6  │ 21.5     │ 4     │ 120.1    │ 97    │ 3.7      │ 2.465    │      ┆
│ 7  │ 27.3     │ 4     │ 79.0     │ 66    │ 4.08     │ 1.935    │      ┆
│ 8  │ 26.0     │ 4     │ 120.3    │ 91    │ 4.43     │ 2.14     │      ┆
│ 9  │ 30.4     │ 4     │ 95.1     │ 113   │ 3.77     │ 1.513    │      ┆
│ 10 │ 21.4     │ 4     │ 121.0    │ 109   │ 4.11     │ 2.78     │      ┆
│ 11 │ 21.0     │ 6     │ 160.0    │ 110   │ 3.9      │ 2.62     │      ┆
│ 12 │ 21.0     │ 6     │ 160.0    │ 110   │ 3.9      │ 2.875    │      ┆
│ 13 │ 21.4     │ 6     │ 258.0    │ 110   │ 3.08     │ 3.215    │      ┆
│ 14 │ 18.1     │ 6     │ 225.0    │ 105   │ 2.76     │ 3.46     │      ┆
│ 15 │ 19.2     │ 6     │ 167.6    │ 123   │ 3.92     │ 3.44     │      ┆
│ 16 │ 17.8     │ 6     │ 167.6    │ 123   │ 3.92     │ 3.44     │      ┆
│ 17 │ 19.7     │ 6     │ 145.0    │ 175   │ 3.62     │ 2.77     │      ┆
│ 18 │ 18.7     │ 8     │ 360.0    │ 175   │ 3.15     │ 3.44     │      ┆
│ 19 │ 14.3     │ 8     │ 360.0    │ 245   │ 3.21     │ 3.57     │      ┆
┢╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍┪
┇ 12 more                                                               ┇
┗╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍┛

// read in a JSON File
// R: xdf2 <- jsonlite::fromJSON("mtcars.json")
var xdf2 = try! DataFrame.init(contentsOfJSONFile: URL(fileURLWithPath: "mtcars.json"))

// bind the rows together
// R: xdf <- bind_rows(xdf, xdf2)
xdf.append(xdf2)

// get the new summary
// R: summary(xdf)
print(xdf.summaryOfAllColumns().description(options: dOpts))

┏━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳╍╍╍╍╍╍┓
┃   ┃ count(mpg) ┃ uniqueCou… ┃ top(mpg) ┃ topFreque… ┃ count(cyl) ┃ 39   ┇
┃   ┃ <Int>      ┃ <Int>      ┃ <Double> ┃ <Int>      ┃ <Int>      ┃ more ┇
┡━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇╍╍╍╍╍╍┩
│ 0 │ 64         │ 25         │ 21.4     │ 4          │ 64         │      ┆
└───┴────────────┴────────────┴──────────┴────────────┴────────────┴╌╌╌╌╌╌┘

// basic filtering
// R: xdf.filter(cyl == 6)
print( xdf.filter(on: "cyl", Int.self) { (val) in val == 6 } )

┏━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳╍╍╍╍╍╍┓
┃    ┃ mpg      ┃ cyl   ┃ disp     ┃ hp    ┃ drat     ┃ wt       ┃ 5    ┇
┃    ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Double> ┃ more ┇
┡━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇╍╍╍╍╍╍┩
│ 11 │ 21.0     │ 6     │ 160.0    │ 110   │ 3.9      │ 2.62     │      ┆
│ 12 │ 21.0     │ 6     │ 160.0    │ 110   │ 3.9      │ 2.875    │      ┆
│ 13 │ 21.4     │ 6     │ 258.0    │ 110   │ 3.08     │ 3.215    │      ┆
│ 14 │ 18.1     │ 6     │ 225.0    │ 105   │ 2.76     │ 3.46     │      ┆
│ 15 │ 19.2     │ 6     │ 167.6    │ 123   │ 3.92     │ 3.44     │      ┆
│ 16 │ 17.8     │ 6     │ 167.6    │ 123   │ 3.92     │ 3.44     │      ┆
│ 17 │ 19.7     │ 6     │ 145.0    │ 175   │ 3.62     │ 2.77     │      ┆
│ 32 │ 21.0     │ 6     │ 160.0    │ 110   │ 3.9      │ 2.62     │      ┆
│ 33 │ 21.0     │ 6     │ 160.0    │ 110   │ 3.9      │ 2.875    │      ┆
│ 35 │ 21.4     │ 6     │ 258.0    │ 110   │ 3.08     │ 3.215    │      ┆
┢╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍┪
┇ 4 more                                                                ┇
┗╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍┛

// group by a column
// R: group_by(xdf, cyl)
print(xdf.grouped(by: "cyl"))

4
┏━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳╍╍╍╍╍╍┓
┃   ┃ mpg      ┃ cyl   ┃ disp     ┃ hp    ┃ drat     ┃ wt       ┃ 5    ┇
┃   ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Double> ┃ more ┇
┡━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇╍╍╍╍╍╍┩
│ 0 │ 22.8     │ 4     │ 108.0    │ 93    │ 3.85     │ 2.32     │      ┆
│ 1 │ 24.4     │ 4     │ 146.7    │ 62    │ 3.69     │ 3.19     │      ┆
│ 2 │ 22.8     │ 4     │ 140.8    │ 95    │ 3.92     │ 3.15     │      ┆
│ 3 │ 32.4     │ 4     │ 78.7     │ 66    │ 4.08     │ 2.2      │      ┆
│ 4 │ 30.4     │ 4     │ 75.7     │ 52    │ 4.93     │ 1.615    │      ┆
│ 5 │ 33.9     │ 4     │ 71.1     │ 65    │ 4.22     │ 1.835    │      ┆
│ 6 │ 21.5     │ 4     │ 120.1    │ 97    │ 3.7      │ 2.465    │      ┆
│ 7 │ 27.3     │ 4     │ 79.0     │ 66    │ 4.08     │ 1.935    │      ┆
│ 8 │ 26.0     │ 4     │ 120.3    │ 91    │ 4.43     │ 2.14     │      ┆
│ 9 │ 30.4     │ 4     │ 95.1     │ 113   │ 3.77     │ 1.513    │      ┆
┢╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍┪
┇ 12 more                                                              ┇
┗╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍┛

6
┏━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳╍╍╍╍╍╍┓
┃    ┃ mpg      ┃ cyl   ┃ disp     ┃ hp    ┃ drat     ┃ wt       ┃ 5    ┇
┃    ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Double> ┃ more ┇
┡━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇╍╍╍╍╍╍┩
│ 11 │ 21.0     │ 6     │ 160.0    │ 110   │ 3.9      │ 2.62     │      ┆
│ 12 │ 21.0     │ 6     │ 160.0    │ 110   │ 3.9      │ 2.875    │      ┆
│ 13 │ 21.4     │ 6     │ 258.0    │ 110   │ 3.08     │ 3.215    │      ┆
│ 14 │ 18.1     │ 6     │ 225.0    │ 105   │ 2.76     │ 3.46     │      ┆
│ 15 │ 19.2     │ 6     │ 167.6    │ 123   │ 3.92     │ 3.44     │      ┆
│ 16 │ 17.8     │ 6     │ 167.6    │ 123   │ 3.92     │ 3.44     │      ┆
│ 17 │ 19.7     │ 6     │ 145.0    │ 175   │ 3.62     │ 2.77     │      ┆
│ 32 │ 21.0     │ 6     │ 160.0    │ 110   │ 3.9      │ 2.62     │      ┆
│ 33 │ 21.0     │ 6     │ 160.0    │ 110   │ 3.9      │ 2.875    │      ┆
│ 35 │ 21.4     │ 6     │ 258.0    │ 110   │ 3.08     │ 3.215    │      ┆
┢╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍┪
┇ 4 more                                                                ┇
┗╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍┛

8
┏━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳╍╍╍╍╍╍┓
┃    ┃ mpg      ┃ cyl   ┃ disp     ┃ hp    ┃ drat     ┃ wt       ┃ 5    ┇
┃    ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Double> ┃ more ┇
┡━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇╍╍╍╍╍╍┩
│ 18 │ 18.7     │ 8     │ 360.0    │ 175   │ 3.15     │ 3.44     │      ┆
│ 19 │ 14.3     │ 8     │ 360.0    │ 245   │ 3.21     │ 3.57     │      ┆
│ 20 │ 16.4     │ 8     │ 275.8    │ 180   │ 3.07     │ 4.07     │      ┆
│ 21 │ 17.3     │ 8     │ 275.8    │ 180   │ 3.07     │ 3.73     │      ┆
│ 22 │ 15.2     │ 8     │ 275.8    │ 180   │ 3.07     │ 3.78     │      ┆
│ 23 │ 10.4     │ 8     │ 472.0    │ 205   │ 2.93     │ 5.25     │      ┆
│ 24 │ 10.4     │ 8     │ 460.0    │ 215   │ 3.0      │ 5.424    │      ┆
│ 25 │ 14.7     │ 8     │ 440.0    │ 230   │ 3.23     │ 5.345    │      ┆
│ 26 │ 15.5     │ 8     │ 318.0    │ 150   │ 2.76     │ 3.52     │      ┆
│ 27 │ 15.2     │ 8     │ 304.0    │ 150   │ 3.15     │ 3.435    │      ┆
┢╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍┪
┇ 18 more                                                               ┇
┗╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍┛

// number of groups
// R: group_by(xdf, cyl) %>% group_keys() %>% nrow()
print(xdf.grouped(by: "cyl").count)

3

// group, manipulate (in this case, filter), and re-combine
// R: group_by(xdf) %>% filter(mpg < 20) %>% ungroup()
print(
  xdf.grouped(by: "cyl").mapGroups { (val) in
    val.filter(on: "mpg", Double.self) { (val) in val! < 20 }.base
  }.ungrouped()
)

┏━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳╍╍╍╍╍╍┓
┃   ┃ mpg      ┃ disp     ┃ hp    ┃ drat     ┃ wt       ┃ qsec     ┃ 5    ┇
┃   ┃ <Double> ┃ <Double> ┃ <Int> ┃ <Double> ┃ <Double> ┃ <Double> ┃ more ┇
┡━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇╍╍╍╍╍╍┩
│ 0 │ 22.8     │ 108.0    │ 93    │ 3.85     │ 2.32     │ 18.61    │      ┆
│ 1 │ 24.4     │ 146.7    │ 62    │ 3.69     │ 3.19     │ 20.0     │      ┆
│ 2 │ 22.8     │ 140.8    │ 95    │ 3.92     │ 3.15     │ 22.9     │      ┆
│ 3 │ 32.4     │ 78.7     │ 66    │ 4.08     │ 2.2      │ 19.47    │      ┆
│ 4 │ 30.4     │ 75.7     │ 52    │ 4.93     │ 1.615    │ 18.52    │      ┆
│ 5 │ 33.9     │ 71.1     │ 65    │ 4.22     │ 1.835    │ 19.9     │      ┆
│ 6 │ 21.5     │ 120.1    │ 97    │ 3.7      │ 2.465    │ 20.01    │      ┆
│ 7 │ 27.3     │ 79.0     │ 66    │ 4.08     │ 1.935    │ 18.9     │      ┆
│ 8 │ 26.0     │ 120.3    │ 91    │ 4.43     │ 2.14     │ 16.7     │      ┆
│ 9 │ 30.4     │ 95.1     │ 113   │ 3.77     │ 1.513    │ 16.9     │      ┆
┢╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍┪
┇ 182 more                                                                ┇
┗╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍┛

// look at one column
// R: xdf$cyl
print( xdf["cyl"] )

┏━━━━━━━┓
┃ cyl   ┃
┃ <Int> ┃
┡━━━━━━━┩
│ 4     │
│ 4     │
│ 4     │
│ 4     │
│ 4     │
│ 4     │
│ 4     │
│ 4     │
│ 4     │
│ 4     │
┢╍╍╍╍╍╍╍┪
┇ 54 m… ┇
┗╍╍╍╍╍╍╍┛

// combine two columns and look at it
// R: mutate(xdf, cyl_mpg = sprintf("%s:%s", cyl, mpg) %>% select(-cyl, -mpg)
// R: unite(xdf, cyl_mpg, cyl, mpg, sep = ":") # alternate way
xdf.combineColumns("cyl", "mpg", into: "cyl_mpg") { (val1: Int?, val2: Double?) -> String in
  String(val1 ?? 0) + ":" + String(val2 ?? 0.0)
}

print(xdf["cyl_mpg"])

┏━━━━━━━━━━┓
┃ cyl_mpg  ┃
┃ <String> ┃
┡━━━━━━━━━━┩
│ 4:22.8   │
│ 4:24.4   │
│ 4:22.8   │
│ 4:32.4   │
│ 4:30.4   │
│ 4:33.9   │
│ 4:21.5   │
│ 4:27.3   │
│ 4:26.0   │
│ 4:30.4   │
┢╍╍╍╍╍╍╍╍╍╍┪
┇ 54 more  ┇
┗╍╍╍╍╍╍╍╍╍╍┛

// look at the colnames (^^ removes "cyl" and "mpg"
// R: colnames(xdf)
print(xdf.columns.map{ col in col.name })

["cyl_mpg", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb"]

// turn an Int into a Double
// R: xdf$hp <- as.double(xdf$hp) # or use dplyr::mutate()
xdf.transformColumn("hp") { (val1: Int?) -> Double? in
  Double(val1 ?? 0)
}

print(xdf["hp"])

┏━━━━━━━━━━┓
┃ hp       ┃
┃ <Double> ┃
┡━━━━━━━━━━┩
│ 93.0     │
│ 62.0     │
│ 95.0     │
│ 66.0     │
│ 52.0     │
│ 65.0     │
│ 97.0     │
│ 66.0     │
│ 91.0     │
│ 113.0    │
┢╍╍╍╍╍╍╍╍╍╍┪
┇ 54 more  ┇
┗╍╍╍╍╍╍╍╍╍╍┛

// look at the coltypes
// R: sapply(mtcars, typeof)
print(xdf.columns.map{ col in col.wrappedElementType })

[Swift.String, Swift.Double, Swift.Double, Swift.Double, Swift.Double, Swift.Double, Swift.Int, Swift.Int, Swift.Int, Swift.Int]

// distinct horsepower
// R: distinct(xdf, hp)
print(xdf["hp"].distinct())

┏━━━━━━━━━━┓
┃ hp       ┃
┃ <Double> ┃
┡━━━━━━━━━━┩
│ 93.0     │
│ 62.0     │
│ 95.0     │
│ 66.0     │
│ 52.0     │
│ 65.0     │
│ 97.0     │
│ 91.0     │
│ 113.0    │
│ 109.0    │
┢╍╍╍╍╍╍╍╍╍╍┪
┇ 12 more  ┇
┗╍╍╍╍╍╍╍╍╍╍┛

// row slices
// R: xdf[10,]
print(xdf.rows[10])

┏━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳╍╍╍╍╍╍┓
┃    ┃ cyl_mpg  ┃ disp     ┃ hp       ┃ drat     ┃ wt       ┃ qsec     ┃ 4    ┇
┃    ┃ <String> ┃ <Double> ┃ <Double> ┃ <Double> ┃ <Double> ┃ <Double> ┃ more ┇
┡━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇╍╍╍╍╍╍┩
│ 10 │ 4:21.4   │ 121.0    │ 109.0    │ 4.11     │ 2.78     │ 18.6     │      ┆
└────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴╌╌╌╌╌╌┘

// R: xdf[3:10,]
print(xdf.rows[3...10])

Rows(base: 
┏━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳╍╍╍╍╍╍┓
┃   ┃ cyl_mpg  ┃ disp     ┃ hp       ┃ drat     ┃ wt       ┃ qsec     ┃ 4    ┇
┃   ┃ <String> ┃ <Double> ┃ <Double> ┃ <Double> ┃ <Double> ┃ <Double> ┃ more ┇
┡━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇╍╍╍╍╍╍┩
│ 0 │ 4:22.8   │ 108.0    │ 93.0     │ 3.85     │ 2.32     │ 18.61    │      ┆
│ 1 │ 4:24.4   │ 146.7    │ 62.0     │ 3.69     │ 3.19     │ 20.0     │      ┆
│ 2 │ 4:22.8   │ 140.8    │ 95.0     │ 3.92     │ 3.15     │ 22.9     │      ┆
│ 3 │ 4:32.4   │ 78.7     │ 66.0     │ 4.08     │ 2.2      │ 19.47    │      ┆
│ 4 │ 4:30.4   │ 75.7     │ 52.0     │ 4.93     │ 1.615    │ 18.52    │      ┆
│ 5 │ 4:33.9   │ 71.1     │ 65.0     │ 4.22     │ 1.835    │ 19.9     │      ┆
│ 6 │ 4:21.5   │ 120.1    │ 97.0     │ 3.7      │ 2.465    │ 20.01    │      ┆
│ 7 │ 4:27.3   │ 79.0     │ 66.0     │ 4.08     │ 1.935    │ 18.9     │      ┆
│ 8 │ 4:26.0   │ 120.3    │ 91.0     │ 4.43     │ 2.14     │ 16.7     │      ┆
│ 9 │ 4:30.4   │ 95.1     │ 113.0    │ 3.77     │ 1.513    │ 16.9     │      ┆
┢╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍╍╍╍╍┷╍╍╍╍╍╍┪
┇ 54 more                                                                    ┇
┗╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍╍┛
, subranges: _RangeSet(3..<11))

A {swiftr} Brief Interlude While Awaiting {cdcfluview} CRAN Checks

My {cdcfluview} package started tossing erros on CRAN just over a week ago when the CDC added an extra parameter to one of the hidden API endpoints that the package wraps. After a fairly hectic set of days since said NOTE came, I had time this morning to poke at a fix. There are alot of tests, so after successful debugging session I was awaiting CRAN checks on various remotes as well as README builds and figured I’d keep up some practice with another, nascent, package of mine, {swiftr}, which makes it dead simple to build R functions from Swift code, in similar fashion to what Rcpp::cppFunction() does for C/C++ code.

macOS comes with a full set of machine learning/AI libraries/frameworks that definitely have “batteries included” (i.e. you can almost just make one function call to get 90-95% what you want without even training new models). One of which is text extraction from Apple’s computer Vision framework. I thought it’d be a fun and quick “wait mode” distraction to wrap the VNRecognizeTextRequest() function and use it from R.

To show how capable the default model is, I pulled a semi-complex random image from DDG’s image search:

Yellow street signs against clear blue sky pointing different directions. Each plate on the street sign has a specific term like unsure, muddled, coonfused and so on. Dilemma and confusion concept. horizontal composition with copy space. Clipping path is included.

Let’s build the function (you need to be on macOS for this; exposition inine):

library(swiftr) # github.com/hrbrmstr/swiftr

swift_function(
  code = '
import Foundation
import CoreImage
import Cocoa
import Vision

@_cdecl ("detect_text")
public func detect_text(path: SEXP) -> SEXP {

   // turn R string into Swift String so we can use it
   let fileName = String(cString: R_CHAR(STRING_ELT(path, 0)))

   var res: String = ""
   var out: SEXP = R_NilValue

  // get image into the right format
  if let ciImage = CIImage(contentsOf: URL(fileURLWithPath:fileName)) {

    let context = CIContext(options: nil)
    if let img = context.createCGImage(ciImage, from: ciImage.extent) {

      // setup comptuer vision request
      let requestHandler = VNImageRequestHandler(cgImage: img)

      // start recognition
      let request = VNRecognizeTextRequest()
      do {
        try requestHandler.perform([request])

        // if we have results
        if let observations = request.results as? [VNRecognizedTextObservation] {

          // paste them together
          let recognizedStrings = observations.compactMap { observation in
            observation.topCandidates(1).first?.string
          }
          res = recognizedStrings.joined(separator: "\\n")
        }
      } catch {
        debugPrint("\\(error)")
      }
    }
  }

  res.withCString { cstr in out = Rf_mkString(cstr) }

  return(out)
}
')

The detect_text() is now available in R, so let’s see how it performs on that image of signs:

detect_text(path.expand("~/Data/signs.jpeg")) %>% 
  stringi::stri_split_lines() %>% 
  unlist()
##  [1] "BEWILDERED" "UNCLEAR"    "nAZEU"      "UNCERTAIN"  "VISA"       "INSURE"    
##  [7] "ATED"       "MUDDLED"    "LOsT"       "DISTRACTED" "PERPLEXED"  "CONFUSED"  
## [13] "PUZZLED"

It works super-fast and gets far more correct than I would have expected.

Toy examples aside, it also works pretty well (as one would expect) on “real” text images, such as this example from the Tesseract test suite:

detect_text(path.expand("~/Data/tesseract/news.3B/0/8200_006.3B.tif")) %>% 
  stringi::stri_split_lines() %>% 
  unlist()
##  [1] "Tobacco chiefs still refuse to see the truth abou"                           
##  [2] "even of America's least conscionable"                                        
##  [3] "The tobacco industry would like to promote"                                  
##  [4] "men sat together in Washington last"                                         
##  [5] "under the conditions they are used.'"                                        
##  [6] "week to do what they do best: blow"                                          
##  [7] "the specter of prohibition."                                                 
##  [8] "panel\" of toxicologists as \"not hazardous"                                 
##  [9] "smoke at the truth about cigarettes."                                        
## [10] "'If cigarettes are too dangerous to be sold,"                                
## [11] "then ban them. Some smokers will obey the"                                   
## [12] "People not paid by the tobacco companies"                                    
## [13] "aren't so sure. The list includes several"                                   
## [14] "The CEOs of the nation's largest tobacco"                                    
## [15] "firms told congressional panel that nicotine"                                
## [16] "law, but many will not. People will be selling"                              
## [17] "iS not addictive, that they are unconvinced"                                 
## [18] "cigarettes out of the trunks of cars, cigarettes"                            
## [19] "substances the government does not allow in"                                 
## [20] "foods or classifies as potentially toxic. They"                              
## [21] "that smoking causes lung cancer or any other"                                
## [22] "made by who knows who, made of who knows include ammonia, a pesticide called"
## [23] "illness, and that smoking is no more harmful"                                
## [24] "what,\" said James Johnston of R.J. Reynolds."                               
## [25] "than drinking coffee or eating Twinkies."                                    
## [26] "It's a ruse. He knows cigarettes are not"                                    
## [27] "methoprene, and ethyl furoate, which has"                                    
## [28] "They said these things with straight taces."                                 
## [29] "going to be banned, at leasi not in his lifetime."                           
## [30] "caused liver damage in rats."                                                
## [31] "The list \"begs a number of important"                                       
## [32] "They said them in the face of massive"                                       
## [33] "STEVE WILSON"                                                                
## [34] "What he really fears are new taxes, stronger"                                
## [35] "questions about the safety of these additives,\""                            
## [36] "scientific evidence that smoking is responsible"                             
## [37] "anti-smoking campaigns, further smoking"                                     
## [38] "said a joint statement from the American"                                    
## [39] "for more than 400,000 deaths every year."                                    
## [40] "restrictions, limits on secondhand smoke and"                                
## [41] "Rep. Henry Waxman, D-Calif., put that"                                       
## [42] "Republic Columnist"                                                          
## [43] "Lung, Cancer and Heart associations. The"                                    
## [44] "limits on tar and nicotine."                                                 
## [45] "statement added that substances safe to eat"                                 
## [46] "frightful statistic another way:"                                            
## [47] "Collectively, these steps can accelerate the"                                
## [48] "\"Imagine our nation's outrage if two fully"                                 
## [49] "He and the others played dumb for the"                                       
## [50] "current 5 percent annual decline in cigarette"                               
## [51] "aren't necessarily safe to inhale."                                          
## [52] "The 50-page list can be obtained free by"                                    
## [53] "loaded jumbo jets crashed each day, killing all"                             
## [54] "entire six hours, but really didn't matter."                                 
## [55] "use and turn the tobacco business from highly"                               
## [56] "calling 1-800-852-8749."                                                     
## [57] "aboard. That's the same number of Americans"                                 
## [58] "The game i nearly over, and the tobacco"                                     
## [59] "profitable to depressed."                                                    
## [60] "Johnson's comment about cigarettes \"made"                                   
## [61] "Here are just the 44 ingredients that start"                                 
## [62] "that cigarettes kill every 24 hours.'"                                       
## [63] "executives know it."                                                         
## [64] "with the letter \"A\":"                                                      
## [65] "The CEOs were not impressed."                                                
## [66] "The hearing marked a turning point in the"                                   
## [67] "of who knows what\" was comical."                                            
## [68] "Acetanisole, acetic acid, acetoin,"                                          
## [69] "\"We have looked at the data."                                               
## [70] "It does"                                                                     
## [71] "nation's growing aversion to cigarettes. No"                                 
## [72] "The day before the hearing, the tobacco"                                     
## [73] "acetophenone,6-acetoxydihydrotheaspirane,"                                   
## [74] "not convince me that smoking causes death,\""                                
## [75] "2-acetyl-3-ethylpyrazine, 2-acetyl-5-"                                       
## [76] "said Andrew Tisch of the Lorillard Tobacco"                                  
## [77] "longer hamstrung by tobacco-state seniority"                                 
## [78] "companies released a long-secret list of 599"                                
## [79] "Co."                                                                         
## [80] "and the deep-pocketed tobacco lobby,"                                        
## [81] "methylfuran, acetylpyrazine, 2-acetylpyridine,"                              
## [82] "Congress is taking aim at cigarette makers."                                 
## [83] "additives used in cigarettes. The companies"                                 
## [84] "said all are certified by an \"independent"                                  
## [85] "3-acetylpyridine, 2-acetylthiazole, aconitic"

(You can compare that on your own with the Tesseract results.)

FIN

{cdcfluview} checks are done, and the fixed functions are back on CRAN! Just in time to close out this post.

If you’re on macOS, definitely check out the various ML/AI frameworks Apple has to offer via Swift and have some fun playing with integrating them into R (or build some small, command line utilities if you want to keep Swift and R apart).