Category Archives: data driven security

Your Logs Contain a Plea for Assistance In a Packet Bottle and You Likely Never Noticed It

2026-07-09 – 08:34
Posted in Cybersecurity, data driven security, Information Security, pcap
Comments (2)

HMEFB: Anatomy of a Belarusian Worm

I found this story because of a re-post on Mastodon by Fritz Adalis regarding a suspicious User-Agent by Jason Callahan of SANS’s ISC. He found something odd in his DShield honeypot logs: a URI path that read /?_HELP_ME_ESCAPE_FROM_BELARUS_PLEASE_. In the User-Agent header, an email address: HelpMeEscapeFromBelarus@proton.me.

The SANS diary — isc.sans.edu/diary/33130 — laid out what Callahan found: around a dozen HTTP requests over a two-month period, from IPs scattered globally with no discernible pattern — pointing to a self-propagating bot rather than a single attacker. A Reddit thread on r/selfhosted described the same requests hitting a Traefik reverse proxy. Someone in that thread emailed the address in the User-Agent and got a reply pointing to a page on a free web hosting service.

I and Glenn Thorpe run a small fleet of honeypots in our scant spare time. After reading the SANS diary I pulled a PCAP of similar sessions our of our collector and started picking it apart. This contents of this post is what I found.

Following the link

The page that Reddit user got pointed to (it’s also in one of the user agent strings) is hmefb.fwh.is/?i=1, titled “HMEFB // PROJECT_ROOT”. It’s a manifesto from someone named Alex, a 27-year-old engineer living in Belarus. Full text is at that URL (NOTE: some “security” products flag it as unsafe). Here’s the skinny on it.

Alex describes a self-propagating worm that scans random IP addresses for open HTTP and SSH ports. On HTTP:

It scans random IP addresses for open HTTP ports (TCP 80, 8000, 8080, etc.) and SSH ports (TCP 22, 2222). If it finds an open HTTP port, it simply sends a request to the server using a random method (GET, CONNECT, or HEAD).

On SSH:

If it finds an open SSH port, it begins a password brute-force attack, but only using default combinations like admin:admin, root:root, or support:support. No exploits, no other malicious actions.

He’s explicit about the lifecycle:

The bot is also fully autonomous – it doesn’t connect to a command-and-control server and runs entirely on its own. It only reports discovered IP and login:password pairs back to a loader. Additionally, the bot has a built-in timer: six months after it starts, it self-terminates. If your device has become part of this network of spreader bots, simply reboot it (in theory, at least). The bot doesn’t establish persistence on the system and usually runs from /tmp.

Alex isn’t crafting a CVE writeup. He appears to be trying to leave Belarus. He frames the worm as a project, attempts transparency about its harm, and explicitly invites contact:

I am not interested in funding or sponsorship in any form. Please view this as a highly specific performance piece – one without parallels, as far as I’ve been able to find.

He also notes he’ll be cut off from the internet starting May 19, 2026. Keep that date in mind.

The SANS diary takes a skeptical stance on all of this, and rightly so:

Sob stories and appeals to sympathy are also a known social-engineering lever, and a URI designed to make analysts pause and read a web page rather than immediately blocklist an IP is an effective way to buy a scanner some goodwill.

Callahan’s conclusion: “treat it as an untrusted, credential-guessing scanner.” Fair. But the PCAP tells a more interesting story than either the manifesto or the SANS diary alone.

The capture

Over roughly forty days — May 10 through June 19, 2026 — our tiny fleet received the kind of probe Alex describes. The capture file I extracted has 2,167 packets, 318 KB, all TCP/IP, all HTTP, no TLS.

A quick protocol hierarchy check via tshark (the Wireshark command-line packet analyzer):

$ tshark -r belarus-sessions.PCAP -q -z io,phs
===================================================================
Protocol Hierarchy Statistics
Filter:

frame                                    frames:2167 bytes:284118
  eth                                    frames:2167 bytes:284118
    ip                                   frames:2167 bytes:284118
      tcp                                frames:2167 bytes:284118
        http                             frames: 406 bytes:147498
          data-text-lines                frames: 162 bytes: 84270
===================================================================

I focused on the HTTP requests since they’re easier to find, though I did confirm there were SSH shenanigans going on in the same time frame. The HTTP number around 205, using two distinct User-Agent strings; the remaining packets are TCP handshakes, ACKs, and response payloads.

Forty days, small packet count. That’s consistent with the worm’s described behavior: one request per open port, then move on. Not a high-volume scanner.

Two user-agent variants, two distinct campaigns

The HTTP requests split into two distinct populations based on User-Agent. The original variant – 159 requests:

Wget/1.21.4 (HelpMeEscapeFromBelarus@proton.me)

The extended variant – 46 requests:

Wget/1.21.4 ( https://hmefb.fwh.is | HelpMeEscapeFromBelarus@proton.me)

The two variants targeted different sets of ports and came from different kinds of infrastructure. That’s where it gets interesting.

Different ports, different responses

The 159 requests with the original User-Agent targeted ports 80, 81, 8000, and 8080 – exactly the ports Alex names in his description. The 46 requests with the extended User-Agent targeted a different set entirely: 82, 8002, 8088, 8090, 9000, and 9090.

The extended User-Agent variant and non-standard port probing overlapped exactly. Both came exclusively from TOR exit nodes. Both stopped May 17, 2026.

The TOR exit node revelation

Of the 44 unique source IPs in the PCAP, 31 are listed on the TOR exit node list. TOR – The Onion Router – anonymizes traffic by bouncing it through multiple volunteer-operated relays before it exits to the public internet; the destination sees the exit node’s IP rather than the original requester’s. Seventy percent of the source IPs in this capture are exit nodes.

Cross-referenced against geolocation and ASN data, the TOR exit ranges break down as:

TOR exit operator	ASN	Country	Source IPs in PCAP
Church of Cyberology	AS215125	NL	19
Stiftung Erneuerbare Freiheit	AS60729	DE	6
QuxLabs AB (R0cket Cloud)	AS214503	SW	4
The Infrastructure Group B.V.	AS60404	NL	1
AON	AS214094	IT	1

All 31 TOR exit IPs used the extended hmefb.fwh.is User-Agent variant. None of the non-TOR sources did. That’s a deliberate OPSEC choice. Whoever ran those TOR-routed probes wanted them identifiable as a separate campaign – wanted the URL traced back to HMEFB.

The compromised hosts

The remaining 13 source IPs do not appear to be TOR exit nodes. They’re a mix residential ISPs and VPS providers across several continents:

ISP	Country	Packets
R L A World Net LTDA	BR	24
INPL’s	IN	22
BSNL Internet	IN	12
Bredband2 Stockholms Stadsnaet AB	SW	12
China Unicom CHINA169 Jiangsu Province	CN	12
Chinanet	CN	12
Cox Communications Inc.	US	12
Interlink Comunicatii SRL	MD	12
TELUS Communications Inc	CA	12
TOT Public Company Limited	TH	12
Charter Communications	US	10
CNC Group CHINA169 Shan1xi Province	CN	6
StarHub Cable Vision Ltd	SG	1

(Counts are packets per ISP, reflecting scan volume from that operator’s range.)

These are the worm’s actual victims – exactly the kinds of hosts a low-and-slow self-propagating scanner lands on: routers, IoT devices, VPS instances, home networks running default SSH credentials. The IPs come from cable and fiber residential ranges or small-business ISP blocks (not Hetzner, DigitalOcean, AWS, etc.). The worm is living on consumer gear.

Alex confirms this reading:

That line in your logs is the work of a bot. It’s harmless by design but operates like a worm.

The timeline

The PCAP’s burst of TOR-routed probing lines up precisely with Alex’s note on the HMEFB page:

Also, starting from the 19th, I will be cut off from the outside world and likely unable to follow how the situation unfolds or respond to messages.

The TOR-routed probes stopped May 17, 2026 – two days before the date he named. Whatever was happening through TOR, it stopped on schedule.

The SANS ISC diary notes the bot was first reported to ISC in May 2026, with reports peaking shortly after the first sighting before a sharp drop — consistent with what the PCAP shows.

What the worm is, and what it isn’t

Callahan’s SANS diary concludes: “treat it as an untrusted, credential-guessing scanner.” That’s the right defensive posture. The HMEFB worm isn’t malware in the conventional sense – it’s a network scanner that propagates itself, runs from /tmp, persists for six months at most, and only attempts default SSH credentials. It doesn’t exploit anything. The only data it reports back to a loader is IP plus credential pairs – not exfiltrated files, not session tokens.

But “not malware in the conventional sense” doesn’t mean harmless. Self-propagating scanners consume bandwidth, hammer SSH daemons with failed login attempts, and embed themselves on devices that already have weak security postures. If you find it on a host you operate: kill the process, rotate any default SSH credentials, and audit whether port 22 actually needs to be publicly exposed. Rebooting (in theory) clears it – the manifesto says so, and the PCAP and SSH session behavior appears consistent with that claim. But, you never can tell with bees.

What the PCAP adds to the picture is the two-campaign structure. The worm spreading on compromised consumer gear across a dozen countries is one thing. A separate, deliberate set of probes routed through TOR exit nodes, using an extended User-Agent that points back to a named URL, hitting non-standard ports on honeypots — that’s a different kind of operation. Someone running probes through TOR and signing them with a link to their manifesto wanted those probes found and read.

The SANS diary understandably focuses on the defensive posture. The PCAP shows you can do both: block the scanner, clean the compromised hosts, and still recognize that someone out there is using a worm as a job application.

The hmefb.fwh.is page closes with:

Thank you for reading this rambling monologue. I hope I haven’t caused you any inconvenience.

[ Send A Message ]

[un]prompted Spring 2026: Threat Hunting In The Matrix

2026-04-06 – 11:18
Posted in AI, Cybersecurity, Data Analysis, data driven security, data science
Leave a Comment

At our previous employer, the global deception and detection infrastructure generates tons of events that eventually make their way into an ever-growing data lake with (as of February 2026) 22 TB of PCAPs and 32 TB of session protocol data. When trying to find novel and truly dangerous attacker behavior, the bottleneck isn’t the data — it’s the analyst trying to hold it all in their head while toggling between Arkime, Censys, VirusTotal, and five other tabs.

Glenn Thorpe and I built Orbie to attack that problem. It’s a prompt-engineered analytical system running in Claude Code that coordinates 16 data source integrations, 8 investigation skills, and 2 background enrichment agents across structured, reproducible workflows — with one rule we never bent on: never assume, always query, show your work.

The full architecture, the failure modes, and where it’s going are in the talk we gave at the February 2026 installment of [un]prompted, above, and you can get some more info and freebies at https://github.com/GreyNoise-Intelligence/2026-labs-unprompted.

There’s going to be another [un]prompted likely later this year and I highly recommend attending and — if you have some of your own accomplishments to share — presenting. It was an incredible experience.

Acoustic: Solving a CyberDefenders PCAP SIP/RTP Challenge with R, Zeek, tshark (& friends)

2021-07-25 – 09:40
Posted in Cybersecurity, Data Analysis, data driven security, data wrangling, Information Security, pcap, R
Comments (3)

Hot on the heels of the previous CyberDefenders Challenge Solution comes this noisy installment which solves their Acoustic challenge.

You can find the source Rmd on GitHub, but I’m also testing the limits of WP’s markdown rendering and putting it in-stream as well.

No longer book expository this time since much of the setup/explanatory bits from it apply here as well).

Acoustic

Convert the PCAP
Examine and Process log.txt
Process Zeek Logs
Process Packet Summary
What is the transport protocol being used?
The attacker used a bunch of scanning tools that belong to the same suite. Provide the name of the suite.
“What is the User-Agent of the victim system?”
Which tool was only used against the following extensions: 100, 101, 102, 103, and 111?
Which extension on the honeypot does NOT require authentication?
How many extensions were scanned in total?
There is a trace for a real SIP client. What is the corresponding user-agent? (two words, once space in between)
Multiple real-world phone numbers were dialed. Provide the first 11 digits of the number dialed from extension 101?
What are the default credentials used in the attempted basic authentication? (format is username:password)
Which codec does the RTP stream use? (3 words, 2 spaces in between)
How long is the sampling time (in milliseconds)?
What was the password for the account with username 555?
Which RTP packet header field can be used to reorder out of sync
RTP packets in the correct sequence?
The trace includes a secret hidden message. Can you hear
it?

This challenge takes us “into the world of voice communications on the internet. VoIP is becoming the de-facto standard for voice communication. As this technology becomes more common, malicious parties have more opportunities and stronger motives to control these systems to conduct nefarious activities. This challenge was designed to examine and explore some of the attributes of the SIP and RTP protocols.”

We have two files to work with:

log.txt which was generated from an unadvertised, passive honeypot located on the internet such that any traffic destined to it must be nefarious. Unknown parties scanned the honeypot with a range of tools, and this activity is represented in the log file.
- The IP address of the honeypot has been changed to “honey.pot.IP.removed”. In terms of geolocation, pick your favorite city.
- The MD5 hash in the authorization digest is replaced with “MD5_hash_removedXXXXXXXXXXXXXXXX”
- Some octets of external IP addresses have been replaced with an “X”
- Several trailing digits of phone numbers have been replaced with an “X”
- Assume the timestamps in the log files are UTC.
Voip-trace.pcap was created by honeynet members for this forensic challenge to allow participants to employ network analysis skills in the VOIP context.

There are 14 questions to answer.

If you are not familiar with SIP and/or RTP you should do a bit of research first. A good place to start is RTC 3261 (for SIP) and RFC 3550 (for RTC). Some questions may be able to be answered just by knowing the details of these protocols.

Convert the PCAP

library(stringi)
library(tidyverse)

We’ll pre-generate Zeek logs. The -C tells Zeek to not bother with checksums, -r tells it to read from a file and the LogAscii::use_json=T means we want JSON output vs the default delimited files. JSON gives us data types (the headers in the delimited files do as well, but we’d have to write something to read those types then deal with it vs get this for free out of the box with JSON).

system("ZEEK_LOG_SUFFIX=json /opt/zeek/bin/zeek -C -r src/Voip-trace.pcap LogAscii::use_json=T HTTP::default_capture_password=T")

We process the PCAP twice with tshark. Once to get the handy (and small) packet summary table, then dump the whole thing to JSON. We may need to run tshark again down the road a bit.

system("tshark -T tabs -r src/Voip-trace.pcap > voip-packets.tsv")
system("tshark -T json -r src/Voip-trace.pcap > voip-trace")

Examine and Process `log.txt`

We aren’t told what format log.txt is in, so let’s take a look:

cd_sip_log <- stri_read_lines("src/log.txt")

cat(head(cd_sip_log, 25), sep="\n")
## Source: 210.184.X.Y:1083
## Datetime: 2010-05-02 01:43:05.606584
## 
## Message:
## 
## OPTIONS sip:100@honey.pot.IP.removed SIP/2.0
## Via: SIP/2.0/UDP 127.0.0.1:5061;branch=z9hG4bK-2159139916;rport
## Content-Length: 0
## From: "sipvicious"<sip:100@1.1.1.1>; tag=X_removed
## Accept: application/sdp
## User-Agent: friendly-scanner
## To: "sipvicious"<sip:100@1.1.1.1>
## Contact: sip:100@127.0.0.1:5061
## CSeq: 1 OPTIONS
## Call-ID: 845752980453913316694142
## Max-Forwards: 70
## 
## 
## 
## 
## -------------------------
## Source: 210.184.X.Y:4956
## Datetime: 2010-05-02 01:43:12.488811
## 
## Message:

These look a bit like HTTP server responses, but we know we’re working in SIP land and if you perused the RFC you’d have noticed that SIP is an HTTP-like ASCII protocol. While some HTTP response parsers might work on these records, it’s pretty straightforward to whip up a bespoke pseudo-parser.

Let’s see how many records there are by counting the number of “Message:” lines (we’re doing this, primarily, to see if we should use the {furrr} package to speed up processing):

cd_sip_log[stri_detect_fixed(cd_sip_log, "Message:")] %>%
  table()
## .
## Message: 
##     4266

There are many, so we’ll avoid parallel processing the data and just use a single thread.

One way to tackle the parsing is to look for the stop and start of each record, extract fields (these have similar formats to HTTP headers), and perhaps have to extract content as well. We know this because there are “Content-Length:” fields. According to the RFC they are supposed to exist for every message. Let’s first see if any “Content-Length:” header records are greater than 0. We’ll do this with a little help from the ripgrep utility as it provides a way to see context before and/or after matched patterns:

cat(system('rg --after-context=10 "^Content-Length: [^0]" src/log.txt', intern=TRUE), sep="\n")
## Content-Length: 330
## 
## v=0
## o=Zoiper_user 0 0 IN IP4 89.42.194.X
## s=Zoiper_session
## c=IN IP4 89.42.194.X
## t=0 0
## m=audio 52999 RTP/AVP 3 0 8 110 98 101
## a=rtpmap:3 GSM/8000
## a=rtpmap:0 PCMU/8000
## a=rtpmap:8 PCMA/8000
## --
## Content-Length: 330
## 
## v=0
## o=Zoiper_user 0 0 IN IP4 89.42.194.X
## s=Zoiper_session
## c=IN IP4 89.42.194.X
## t=0 0
## m=audio 52999 RTP/AVP 3 0 8 110 98 101
## a=rtpmap:3 GSM/8000
## a=rtpmap:0 PCMU/8000
## a=rtpmap:8 PCMA/8000
## --
## Content-Length: 330
## 
## v=0
## o=Zoiper_user 0 0 IN IP4 89.42.194.X
## s=Zoiper_session
## c=IN IP4 89.42.194.X
## t=0 0
## m=audio 52999 RTP/AVP 3 0 8 110 98 101
## a=rtpmap:3 GSM/8000
## a=rtpmap:0 PCMU/8000
## a=rtpmap:8 PCMA/8000
## --
## Content-Length: 330
## 
## v=0
## o=Zoiper_user 0 0 IN IP4 89.42.194.X
## s=Zoiper_session
## c=IN IP4 89.42.194.X
## t=0 0
## m=audio 52999 RTP/AVP 3 0 8 110 98 101
## a=rtpmap:3 GSM/8000
## a=rtpmap:0 PCMU/8000
## a=rtpmap:8 PCMA/8000

So,we do need to account for content. It’s still pretty straightforward (explanatory comments inline):

starts <- which(stri_detect_regex(cd_sip_log, "^Source:"))
stops <- which(stri_detect_regex(cd_sip_log, "^----------"))

map2_dfr(starts, stops, ~{

  raw_rec <- stri_trim_both(cd_sip_log[.x:.y]) # target the record from the log
  raw_rec <- raw_rec[raw_rec != "-------------------------"] # remove separator

  msg_idx <- which(stri_detect_regex(raw_rec, "^Message:")) # find where "Message:" line is
  source_idx <- which(stri_detect_regex(raw_rec, "^Source: ")) # find where "Source:" line is
  datetime_idx <- which(stri_detect_regex(raw_rec, "^Datetime: ")) # find where "Datetime:" line is
  contents_idx <- which(stri_detect_regex(raw_rec[(msg_idx+2):length(raw_rec)], "^$"))[1] + 2 # get position of the "data"

  source <- stri_match_first_regex(raw_rec[source_idx], "^Source: (.*)$")[,2] # extract source
  datetime <- stri_match_first_regex(raw_rec[datetime_idx], "^Datetime: (.*)$")[,2] # extract datetime
  request <- raw_rec[msg_idx+2] # extract request line

  # build a matrix out of the remaining headers. header key will be in column 2, value will be in column 3
  tmp <- stri_match_first_regex(raw_rec[(msg_idx+3):contents_idx], "^([^:]+):[[:space:]]+(.*)$")
  tmp[,2] <- stri_trans_tolower(tmp[,2]) # lowercase the header key
  tmp[,2] <- stri_replace_all_fixed(tmp[,2], "-", "_") # turn dashes to underscores so we can more easily use the keys as column names

  contents <- raw_rec[(contents_idx+1):length(raw_rec)]
  contents <- paste0(contents[contents != ""], collapse = "\n")

  as.list(tmp[,3]) %>% # turn the header values into a list
    set_names(tmp[,2]) %>% # make their names the tranformed keys
    append(c(
      source = source, # add source to the list (etc)
      datetime = datetime,
      request = request,
      contents = contents
    ))

}) -> sip_log_parsed

Let’s see what we have:

sip_log_parsed
## # A tibble: 4,266 x 18
##    via     content_length from    accept  user_agent to     contact cseq  source
##    <chr>   <chr>          <chr>   <chr>   <chr>      <chr>  <chr>   <chr> <chr> 
##  1 SIP/2.… 0              "\"sip… applic… friendly-… "\"si… sip:10… 1 OP… 210.1…
##  2 SIP/2.… 0              "\"342… applic… friendly-… "\"34… sip:34… 1 RE… 210.1…
##  3 SIP/2.… 0              "\"172… applic… friendly-… "\"17… sip:17… 1 RE… 210.1…
##  4 SIP/2.… 0              "\"adm… applic… friendly-… "\"ad… sip:ad… 1 RE… 210.1…
##  5 SIP/2.… 0              "\"inf… applic… friendly-… "\"in… sip:in… 1 RE… 210.1…
##  6 SIP/2.… 0              "\"tes… applic… friendly-… "\"te… sip:te… 1 RE… 210.1…
##  7 SIP/2.… 0              "\"pos… applic… friendly-… "\"po… sip:po… 1 RE… 210.1…
##  8 SIP/2.… 0              "\"sal… applic… friendly-… "\"sa… sip:sa… 1 RE… 210.1…
##  9 SIP/2.… 0              "\"ser… applic… friendly-… "\"se… sip:se… 1 RE… 210.1…
## 10 SIP/2.… 0              "\"sup… applic… friendly-… "\"su… sip:su… 1 RE… 210.1…
## # … with 4,256 more rows, and 9 more variables: datetime <chr>, request <chr>,
## #   contents <chr>, call_id <chr>, max_forwards <chr>, expires <chr>,
## #   allow <chr>, authorization <chr>, content_type <chr>

glimpse(sip_log_parsed)
## Rows: 4,266
## Columns: 18
## $ via            <chr> "SIP/2.0/UDP 127.0.0.1:5061;branch=z9hG4bK-2159139916;r…
## $ content_length <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ from           <chr> "\"sipvicious\"<sip:100@1.1.1.1>; tag=X_removed", "\"34…
## $ accept         <chr> "application/sdp", "application/sdp", "application/sdp"…
## $ user_agent     <chr> "friendly-scanner", "friendly-scanner", "friendly-scann…
## $ to             <chr> "\"sipvicious\"<sip:100@1.1.1.1>", "\"3428948518\"<sip:…
## $ contact        <chr> "sip:100@127.0.0.1:5061", "sip:3428948518@honey.pot.IP.…
## $ cseq           <chr> "1 OPTIONS", "1 REGISTER", "1 REGISTER", "1 REGISTER", …
## $ source         <chr> "210.184.X.Y:1083", "210.184.X.Y:4956", "210.184.X.Y:51…
## $ datetime       <chr> "2010-05-02 01:43:05.606584", "2010-05-02 01:43:12.4888…
## $ request        <chr> "OPTIONS sip:100@honey.pot.IP.removed SIP/2.0", "REGIST…
## $ contents       <chr> "Call-ID: 845752980453913316694142\nMax-Forwards: 70", …
## $ call_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ max_forwards   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ expires        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ allow          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ authorization  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ content_type   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Looks ?, but IRL there are edge-cases we’d have to deal with.

Process Zeek Logs

Because they’re JSON files, and the names are reasonable, we can do some magic incantations to read them all in and shove them into a list we’ll call zeek:

zeek <- list()

list.files(
  pattern = "json$",
  full.names = TRUE
) %>%
  walk(~{
    append(zeek, list(file(.x) %>% 
      jsonlite::stream_in(verbose = FALSE) %>%
      as_tibble()) %>% 
        set_names(tools::file_path_sans_ext(basename(.x)))
    ) ->> zeek
  })

str(zeek, 1)
## List of 7
##  $ conn         : tibble [97 × 18] (S3: tbl_df/tbl/data.frame)
##  $ dpd          : tibble [1 × 9] (S3: tbl_df/tbl/data.frame)
##  $ files        : tibble [38 × 16] (S3: tbl_df/tbl/data.frame)
##  $ http         : tibble [92 × 24] (S3: tbl_df/tbl/data.frame)
##  $ packet_filter: tibble [1 × 5] (S3: tbl_df/tbl/data.frame)
##  $ sip          : tibble [9 × 23] (S3: tbl_df/tbl/data.frame)
##  $ weird        : tibble [1 × 9] (S3: tbl_df/tbl/data.frame)

walk2(names(zeek), zeek, ~{
  cat("File:", .x, "\n")
  glimpse(.y)
  cat("\n\n")
})
## File: conn 
## Rows: 97
## Columns: 18
## $ ts            <dbl> 1272737631, 1272737581, 1272737669, 1272737669, 12727376…
## $ uid           <chr> "Cb0OAQ1eC0ZhQTEKNl", "C2s0IU2SZFGVlZyH43", "CcEeLRD3cca…
## $ id.orig_h     <chr> "172.25.105.43", "172.25.105.43", "172.25.105.43", "172.…
## $ id.orig_p     <int> 57086, 5060, 57087, 57088, 57089, 57090, 57091, 57093, 5…
## $ id.resp_h     <chr> "172.25.105.40", "172.25.105.40", "172.25.105.40", "172.…
## $ id.resp_p     <int> 80, 5060, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80…
## $ proto         <chr> "tcp", "udp", "tcp", "tcp", "tcp", "tcp", "tcp", "tcp", …
## $ service       <chr> "http", "sip", "http", "http", "http", "http", "http", "…
## $ duration      <dbl> 0.0180180073, 0.0003528595, 0.0245900154, 0.0740420818, …
## $ orig_bytes    <int> 502, 428, 380, 385, 476, 519, 520, 553, 558, 566, 566, 5…
## $ resp_bytes    <int> 720, 518, 231, 12233, 720, 539, 17499, 144, 144, 144, 14…
## $ conn_state    <chr> "SF", "SF", "SF", "SF", "SF", "SF", "SF", "SF", "SF", "S…
## $ missed_bytes  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ history       <chr> "ShADadfF", "Dd", "ShADadfF", "ShADadfF", "ShADadfF", "S…
## $ orig_pkts     <int> 5, 1, 5, 12, 5, 6, 16, 6, 6, 5, 5, 5, 5, 5, 5, 5, 6, 5, …
## $ orig_ip_bytes <int> 770, 456, 648, 1017, 744, 839, 1360, 873, 878, 834, 834,…
## $ resp_pkts     <int> 5, 1, 5, 12, 5, 5, 16, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ resp_ip_bytes <int> 988, 546, 499, 12865, 988, 807, 18339, 412, 412, 412, 41…
## 
## 
## File: dpd 
## Rows: 1
## Columns: 9
## $ ts             <dbl> 1272737798
## $ uid            <chr> "CADvMziC96POynR2e"
## $ id.orig_h      <chr> "172.25.105.3"
## $ id.orig_p      <int> 43204
## $ id.resp_h      <chr> "172.25.105.40"
## $ id.resp_p      <int> 5060
## $ proto          <chr> "udp"
## $ analyzer       <chr> "SIP"
## $ failure_reason <chr> "Binpac exception: binpac exception: string mismatch at…
## 
## 
## File: files 
## Rows: 38
## Columns: 16
## $ ts             <dbl> 1272737631, 1272737669, 1272737676, 1272737688, 1272737…
## $ fuid           <chr> "FRnb7P5EDeZE4Y3z4", "FOT2gC2yLxjfMCuE5f", "FmUCuA3dzcS…
## $ tx_hosts       <list> "172.25.105.40", "172.25.105.40", "172.25.105.40", "17…
## $ rx_hosts       <list> "172.25.105.43", "172.25.105.43", "172.25.105.43", "17…
## $ conn_uids      <list> "Cb0OAQ1eC0ZhQTEKNl", "CFfYtA0DqqrJk4gI5", "CHN4qA4UUH…
## $ source         <chr> "HTTP", "HTTP", "HTTP", "HTTP", "HTTP", "HTTP", "HTTP",…
## $ depth          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ analyzers      <list> [], [], [], [], [], [], [], [], [], [], [], [], [], []…
## $ mime_type      <chr> "text/html", "text/html", "text/html", "text/html", "te…
## $ duration       <dbl> 0.000000e+00, 8.920908e-03, 0.000000e+00, 0.000000e+00,…
## $ is_orig        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, …
## $ seen_bytes     <int> 479, 11819, 479, 313, 17076, 55, 50, 30037, 31608, 1803…
## $ total_bytes    <int> 479, NA, 479, 313, NA, 55, 50, NA, NA, NA, 58, 313, 50,…
## $ missing_bytes  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ overflow_bytes <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ timedout       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## 
## 
## File: http 
## Rows: 92
## Columns: 24
## $ ts                <dbl> 1272737631, 1272737669, 1272737669, 1272737676, 1272…
## $ uid               <chr> "Cb0OAQ1eC0ZhQTEKNl", "CcEeLRD3cca3j4QGh", "CFfYtA0D…
## $ id.orig_h         <chr> "172.25.105.43", "172.25.105.43", "172.25.105.43", "…
## $ id.orig_p         <int> 57086, 57087, 57088, 57089, 57090, 57091, 57093, 570…
## $ id.resp_h         <chr> "172.25.105.40", "172.25.105.40", "172.25.105.40", "…
## $ id.resp_p         <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, …
## $ trans_depth       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ method            <chr> "GET", "GET", "GET", "GET", "GET", "GET", "GET", "GE…
## $ host              <chr> "172.25.105.40", "172.25.105.40", "172.25.105.40", "…
## $ uri               <chr> "/maint", "/", "/user/", "/maint", "/maint", "/maint…
## $ referrer          <chr> "http://172.25.105.40/user/", NA, NA, "http://172.25…
## $ version           <chr> "1.1", "1.1", "1.1", "1.1", "1.1", "1.1", "1.1", "1.…
## $ user_agent        <chr> "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9)…
## $ request_body_len  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ response_body_len <int> 479, 0, 11819, 479, 313, 17076, 0, 0, 0, 0, 0, 0, 0,…
## $ status_code       <int> 401, 302, 200, 401, 301, 200, 304, 304, 304, 304, 30…
## $ status_msg        <chr> "Authorization Required", "Found", "OK", "Authorizat…
## $ tags              <list> [], [], [], [], [], [], [], [], [], [], [], [], [],…
## $ resp_fuids        <list> "FRnb7P5EDeZE4Y3z4", <NULL>, "FOT2gC2yLxjfMCuE5f", …
## $ resp_mime_types   <list> "text/html", <NULL>, "text/html", "text/html", "tex…
## $ username          <chr> NA, NA, NA, NA, "maint", "maint", "maint", "maint", …
## $ password          <chr> NA, NA, NA, NA, "password", "password", "password", …
## $ orig_fuids        <list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NU…
## $ orig_mime_types   <list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NU…
## 
## 
## File: packet_filter 
## Rows: 1
## Columns: 5
## $ ts      <dbl> 1627151196
## $ node    <chr> "zeek"
## $ filter  <chr> "ip or not ip"
## $ init    <lgl> TRUE
## $ success <lgl> TRUE
## 
## 
## File: sip 
## Rows: 9
## Columns: 23
## $ ts                <dbl> 1272737581, 1272737768, 1272737768, 1272737768, 1272…
## $ uid               <chr> "C2s0IU2SZFGVlZyH43", "CADvMziC96POynR2e", "CADvMziC…
## $ id.orig_h         <chr> "172.25.105.43", "172.25.105.3", "172.25.105.3", "17…
## $ id.orig_p         <int> 5060, 43204, 43204, 43204, 43204, 43204, 43204, 4320…
## $ id.resp_h         <chr> "172.25.105.40", "172.25.105.40", "172.25.105.40", "…
## $ id.resp_p         <int> 5060, 5060, 5060, 5060, 5060, 5060, 5060, 5060, 5060
## $ trans_depth       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ method            <chr> "OPTIONS", "REGISTER", "REGISTER", "SUBSCRIBE", "SUB…
## $ uri               <chr> "sip:100@172.25.105.40", "sip:172.25.105.40", "sip:1…
## $ request_from      <chr> "\"sipvicious\"<sip:100@1.1.1.1>", "<sip:555@172.25.…
## $ request_to        <chr> "\"sipvicious\"<sip:100@1.1.1.1>", "<sip:555@172.25.…
## $ response_from     <chr> "\"sipvicious\"<sip:100@1.1.1.1>", "<sip:555@172.25.…
## $ response_to       <chr> "\"sipvicious\"<sip:100@1.1.1.1>;tag=as18cdb0c9", "<…
## $ call_id           <chr> "61127078793469957194131", "MzEwMmYyYWRiYTUxYTBhODY3…
## $ seq               <chr> "1 OPTIONS", "1 REGISTER", "2 REGISTER", "1 SUBSCRIB…
## $ request_path      <list> "SIP/2.0/UDP 127.0.1.1:5060", "SIP/2.0/UDP 172.25.10…
## $ response_path     <list> "SIP/2.0/UDP 127.0.1.1:5060", "SIP/2.0/UDP 172.25.10…
## $ user_agent        <chr> "UNfriendly-scanner - for demo purposes", "X-Lite B…
## $ status_code       <int> 200, 401, 200, 401, 404, 401, 100, 200, NA
## $ status_msg        <chr> "OK", "Unauthorized", "OK", "Unauthorized", "Not fo…
## $ request_body_len  <int> 0, 0, 0, 0, 0, 264, 264, 264, 0
## $ response_body_len <int> 0, 0, 0, 0, 0, 0, 0, 302, NA
## $ content_type      <chr> NA, NA, NA, NA, NA, NA, NA, "application/sdp", NA
## 
## 
## File: weird 
## Rows: 1
## Columns: 9
## $ ts        <dbl> 1272737805
## $ id.orig_h <chr> "172.25.105.3"
## $ id.orig_p <int> 0
## $ id.resp_h <chr> "172.25.105.40"
## $ id.resp_p <int> 0
## $ name      <chr> "truncated_IPv6"
## $ notice    <lgl> FALSE
## $ peer      <chr> "zeek"
## $ source    <chr> "IP"

Process Packet Summary

We won’t process the big JSON file tshark generated for us util we really have to, but we can read in the packet summary table now:

packet_cols <- c("packet_num", "ts", "src", "discard", "dst", "proto", "length", "info")

read_tsv(
  file = "voip-packets.tsv",
  col_names = packet_cols,
  col_types = "ddccccdc"
) %>%
  select(-discard) -> packets

packets
## # A tibble: 4,447 x 7
##    packet_num       ts src      dst     proto length info                       
##         <dbl>    <dbl> <chr>    <chr>   <chr>  <dbl> <chr>                      
##  1          1  0       172.25.… 172.25… SIP      470 Request: OPTIONS sip:100@1…
##  2          2  3.53e-4 172.25.… 172.25… SIP      560 Status: 200 OK |           
##  3          3  5.03e+1 172.25.… 172.25… TCP       74 57086 → 80 [SYN] Seq=0 Win…
##  4          4  5.03e+1 172.25.… 172.25… TCP       74 80 → 57086 [SYN, ACK] Seq=…
##  5          5  5.03e+1 172.25.… 172.25… TCP       66 57086 → 80 [ACK] Seq=1 Ack…
##  6          6  5.03e+1 172.25.… 172.25… HTTP     568 GET /maint HTTP/1.1        
##  7          7  5.03e+1 172.25.… 172.25… TCP       66 80 → 57086 [ACK] Seq=1 Ack…
##  8          8  5.03e+1 172.25.… 172.25… HTTP     786 HTTP/1.1 401 Authorization…
##  9          9  5.03e+1 172.25.… 172.25… TCP       66 80 → 57086 [FIN, ACK] Seq=…
## 10         10  5.03e+1 172.25.… 172.25… TCP       66 57086 → 80 [ACK] Seq=503 A…
## # … with 4,437 more rows

glimpse(packets)
## Rows: 4,447
## Columns: 7
## $ packet_num <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ ts         <dbl> 0.000000, 0.000353, 50.317176, 50.317365, 50.320071, 50.329…
## $ src        <chr> "172.25.105.43", "172.25.105.40", "172.25.105.43", "172.25.…
## $ dst        <chr> "172.25.105.40", "172.25.105.43", "172.25.105.40", "172.25.…
## $ proto      <chr> "SIP", "SIP", "TCP", "TCP", "TCP", "HTTP", "TCP", "HTTP", "…
## $ length     <dbl> 470, 560, 74, 74, 66, 568, 66, 786, 66, 66, 66, 66, 74, 74,…
## $ info       <chr> "Request: OPTIONS sip:100@172.25.105.40 |", "Status: 200 OK…

What is the transport protocol being used?

SIP can use TCP or UDP and which transport it uses will be specified in the Via: header. Let’s take a look:

head(sip_log_parsed$via)
## [1] "SIP/2.0/UDP 127.0.0.1:5061;branch=z9hG4bK-2159139916;rport"
## [2] "SIP/2.0/UDP 127.0.0.1:5087;branch=z9hG4bK-1189344537;rport"
## [3] "SIP/2.0/UDP 127.0.0.1:5066;branch=z9hG4bK-2119091576;rport"
## [4] "SIP/2.0/UDP 127.0.0.1:5087;branch=z9hG4bK-3226446220;rport"
## [5] "SIP/2.0/UDP 127.0.0.1:5087;branch=z9hG4bK-1330901245;rport"
## [6] "SIP/2.0/UDP 127.0.0.1:5087;branch=z9hG4bK-945386205;rport"

Are they all UDP? We can find out by performing some light processing
on the via column:

sip_log_parsed %>% 
  select(via) %>% 
  mutate(
    transport = stri_match_first_regex(via, "^([^[:space:]]+)")[,2]
  ) %>% 
  count(transport, sort=TRUE)
## # A tibble: 1 x 2
##   transport       n
##   <chr>       <int>
## 1 SIP/2.0/UDP  4266

Looks like they’re all UDP. Question 1: ✅

The attacker used a bunch of scanning tools that belong to the same suite. Provide the name of the suite.

Don’t you, now, wish you had listen to your parents when they were telling you about the facts of SIP life when you were a wee pup?

We’ll stick with the SIP log to answer this one and peek back at the RFC to see that there’s a “User-Agent:” field which contains information about the client originating the request. Most scanners written by defenders identify themselves in User-Agent fields when those fields are available in a protocol exchange, and a large percentage of naive malicious folks are too daft to change this value (or leave it default to make you think they’re not behaving badly).

If you are a regular visitor to SIP land, you likely know the common SIP scanning tools. These are a few:

Nmap’s SIP library
Mr.SIP, a “SIP-Based Audit and Attack Tool”
SIPVicious, a “set of security tools that can be used to audit SIP based VoIP systems”
Sippts, a “set of tools to audit SIP based VoIP Systems”

(There are many more.)

Let’s see what user-agent was used in this log extract:

count(sip_log_parsed, user_agent, sort=TRUE)
## # A tibble: 3 x 2
##   user_agent           n
##   <chr>            <int>
## 1 friendly-scanner  4248
## 2 Zoiper rev.6751     14
## 3 <NA>                 4

The overwhelming majority are friendly-scanner. Let’s look at a few of those log entries:

sip_log_parsed %>% 
  filter(
    user_agent == "friendly-scanner"
  ) %>% 
  glimpse()
## Rows: 4,248
## Columns: 18
## $ via            <chr> "SIP/2.0/UDP 127.0.0.1:5061;branch=z9hG4bK-2159139916;r…
## $ content_length <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ from           <chr> "\"sipvicious\"<sip:100@1.1.1.1>; tag=X_removed", "\"34…
## $ accept         <chr> "application/sdp", "application/sdp", "application/sdp"…
## $ user_agent     <chr> "friendly-scanner", "friendly-scanner", "friendly-scann…
## $ to             <chr> "\"sipvicious\"<sip:100@1.1.1.1>", "\"3428948518\"<sip:…
## $ contact        <chr> "sip:100@127.0.0.1:5061", "sip:3428948518@honey.pot.IP.…
## $ cseq           <chr> "1 OPTIONS", "1 REGISTER", "1 REGISTER", "1 REGISTER", …
## $ source         <chr> "210.184.X.Y:1083", "210.184.X.Y:4956", "210.184.X.Y:51…
## $ datetime       <chr> "2010-05-02 01:43:05.606584", "2010-05-02 01:43:12.4888…
## $ request        <chr> "OPTIONS sip:100@honey.pot.IP.removed SIP/2.0", "REGIST…
## $ contents       <chr> "Call-ID: 845752980453913316694142\nMax-Forwards: 70", …
## $ call_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ max_forwards   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ expires        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ allow          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ authorization  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ content_type   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Those from and to fields have an interesting name in them: “sipviscious”. You’ve seen that before, right at the beginning of this section.

Let’s do a quick check over at the SIPvicious repo just to make sure.

count(sip_log_parsed, user_agent)
## # A tibble: 3 x 2
##   user_agent           n
##   <chr>            <int>
## 1 friendly-scanner  4248
## 2 Zoiper rev.6751     14
## 3 <NA>                 4

“What is the User-Agent of the victim system?”

We only have partial data in the text log so we’ll have to look elsewhere (the PCAP) for this information. The “victim” is whatever was the target of a this SIP-based attack and we can look for SIP messages, user agents, and associated IPs in the PCAP thanks to tshark’s rich SIP filter library:

system("tshark -Q -T fields -e ip.src -e ip.dst -e sip.User-Agent -r src/Voip-trace.pcap 'sip.User-Agent'")

That first exchange is all we really need. We see our rude poker talking to 172.25.105.40 and it responding right after.

Which tool was only used against the following extensions: 100, 101, 102, 103, and 111?

The question is a tad vague and is assuming — since we now know the SIPvicious suite was used — that we also know to provide the name of the Python script in SIPvicious that was used. There are five tools:

svmap: this is a sip scanner. When launched against ranges of ip address space, it will identify any SIP servers which it finds on the way. Also has the option to scan hosts on ranges of ports. Usage: https://github.com/EnableSecurity/sipvicious/wiki/SVMap-Usage
svwar: identifies working extension lines on a PBX. A working extension is one that can be registered. Also tells you if the extension line requires authentication or not. Usage: https://github.com/EnableSecurity/sipvicious/wiki/SVWar-Usage
svcrack: a password cracker making use of digest authentication. It is able to crack passwords on both registrar servers and proxy servers. Current cracking modes are either numeric ranges or words from dictionary files. Usage: https://github.com/EnableSecurity/sipvicious/wiki/SVCrack-Usage
svreport: able to manage sessions created by the rest of the tools and export to pdf, xml, csv and plain text. Usage: https://github.com/EnableSecurity/sipvicious/wiki/SVReport-Usage
svcrash: responds to svwar and svcrack SIP messages with a message that causes old versions to crash. Usage: https://github.com/EnableSecurity/sipvicious/wiki/SVCrash-FAQ

The svcrash tool is something defenders can use to help curtail scanner activity. We can cross that off the list. The svreport tool is for working with data generated by svmap, svwar and/or svcrack. One more crossed off. We also know that the attacker scanned the SIP network looking for nodes, which means svmap and svwar are likely not exclusive tool to the target extensions. (We technically have enough information right now to answer the question especially if you look carefully at the answer box on the site but that’s cheating).

The SIP request line and header field like “To:” destination information in the form of a SIP URI. Since we only care about the extension component of the URI for this question, we can use a regular expression to isolate them.

Back to the SIP log to see if we can find the identified extensions. We’ll also process the “From:” header just in case we need it.

sip_log_parsed %>% 
  mutate_at(
    vars(request, from, to),
    ~stri_match_first_regex(.x, "sip:([^@]+)@")[,2]
  ) %>% 
  select(request, from, to)
## # A tibble: 4,266 x 3
##    request    from       to        
##    <chr>      <chr>      <chr>     
##  1 100        100        100       
##  2 3428948518 3428948518 3428948518
##  3 1729240413 1729240413 1729240413
##  4 admin      admin      admin     
##  5 info       info       info      
##  6 test       test       test      
##  7 postmaster postmaster postmaster
##  8 sales      sales      sales     
##  9 service    service    service   
## 10 support    support    support   
## # … with 4,256 more rows

That worked! We can now see what friendly-scanner attempted to authenticate only to our targets:

sip_log_parsed %>%
  mutate_at(
    vars(request, from, to),
    ~stri_match_first_regex(.x, "sip:([^@]+)@")[,2]
  ) %>% 
  filter(
    user_agent == "friendly-scanner",
    stri_detect_fixed(contents, "Authorization")
  ) %>% 
  distinct(to)
## # A tibble: 4 x 1
##   to   
##   <chr>
## 1 102  
## 2 103  
## 3 101  
## 4 111

While we’re missing 100 that’s likely due to it not requiring authentication (svcrack will REGISTER first to determine if a target requires authentication and not send cracking requests if it doesn’t).

Which extension on the honeypot does NOT require authentication?

We know this due to what we found in the previous question. Extension 100 does not require authentication.

How many extensions were scanned in total?

We just need to count the distinct to’s where the user agent is the scanner:

sip_log_parsed %>% 
  mutate_at(
    vars(request, from, to),
    ~stri_match_first_regex(.x, "sip:([^@]+)@")[,2]
  ) %>% 
  filter(
    user_agent == "friendly-scanner"
  ) %>% 
  distinct(to)
## # A tibble: 2,652 x 1
##    to        
##    <chr>     
##  1 100       
##  2 3428948518
##  3 1729240413
##  4 admin     
##  5 info      
##  6 test      
##  7 postmaster
##  8 sales     
##  9 service   
## 10 support   
## # … with 2,642 more rows

There is a trace for a real SIP client. What is the corresponding user-agent? (two words, once space in between)

We only need to look for user agent’s that aren’t our scanner:

sip_log_parsed %>% 
  filter(
    user_agent != "friendly-scanner"
  ) %>% 
  count(user_agent)
## # A tibble: 1 x 2
##   user_agent          n
##   <chr>           <int>
## 1 Zoiper rev.6751    14

Multiple real-world phone numbers were dialed. Provide the first 11 digits of the number dialed from extension 101?

Calls are “INVITE” requests

sip_log_parsed %>% 
  mutate_at(
    vars(from, to),
    ~stri_match_first_regex(.x, "sip:([^@]+)@")[,2]
  ) %>% 
  filter(
    from == 101,
    stri_detect_regex(cseq, "INVITE")
  ) %>% 
  select(to) 
## # A tibble: 3 x 1
##   to              
##   <chr>           
## 1 900114382089XXXX
## 2 00112322228XXXX 
## 3 00112524021XXXX

The challenge answer box provides hint to what number they want. I’m not sure but I suspect it may be randomized, so you’ll have to match the pattern they expect with the correct digits above.

What are the default credentials used in the attempted basic authentication? (format is username:password)

This question wants us to look at the HTTP requests that require authentication. We can get he credentials info from the zeek$http log:

zeek$http %>% 
  distinct(username, password)
## # A tibble: 2 x 2
##   username password
##   <chr>    <chr>   
## 1 <NA>     <NA>    
## 2 maint    password

Which codec does the RTP stream use? (3 words, 2 spaces in between)

“Codec” refers to the algorithm used to encode/decode an audio or video stream. The RTP RFC uses the term “payload type” to refer to this during exchanges and even has a link to RFC 3551 which provides further information on these encodings.

The summary packet table that tshark generates helpfully provides summary info for RTP packets and part of that info is PT=… which indicates the payload type.

packets %>% 
  filter(proto == "RTP") %>% 
  select(info)
## # A tibble: 2,988 x 1
##    info                                                       
##    <chr>                                                      
##  1 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6402, Time=126160
##  2 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6403, Time=126320
##  3 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6404, Time=126480
##  4 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6405, Time=126640
##  5 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6406, Time=126800
##  6 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6407, Time=126960
##  7 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6408, Time=127120
##  8 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6409, Time=127280
##  9 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6410, Time=127440
## 10 PT=ITU-T G.711 PCMU, SSRC=0xA254E017, Seq=6411, Time=127600
## # … with 2,978 more rows

How long is the sampling time (in milliseconds)?

1 Hz = 1,000 ms
1 ms = 1,000 Hz

(1/8000) * 1000

What was the password for the account with username 555?

We don’t really need to use external programs for this but it will sure go quite a bit faster if we do. While the original reference page for sipdump and sipcrack is defunct, you can visit that link to go to the Wayback machine’s capture of it. It will help if you have a linux system handy (so Docker to the rescue for macOS and Windows folks) since the following answer details are running on Ubunbu.

This question is taking advantage of the fact that the default authentication method for SIP is extremely weak. The process uses an MD5 challenge/response, and if an attacker can capture call traffic it is possible to brute force the password offline (which is what we’ll use sipcrack for).

You can install them via sudo apt install sipcrack.

We’ll first generate a dump of the authentication attempts with sipdump:

system("sipdump -p src/Voip-trace.pcap sip.dump", intern=TRUE)
##  [1] ""                                                               
##  [2] "SIPdump 0.2 "                                                   
##  [3] "---------------------------------------"                        
##  [4] ""                                                               
##  [5] "* Using pcap file 'src/Voip-trace.pcap' for sniffing"           
##  [6] "* Starting to sniff with packet filter 'tcp or udp'"            
##  [7] ""                                                               
##  [8] "* Dumped login from 172.25.105.40 -> 172.25.105.3 (User: '555')"
##  [9] "* Dumped login from 172.25.105.40 -> 172.25.105.3 (User: '555')"
## [10] "* Dumped login from 172.25.105.40 -> 172.25.105.3 (User: '555')"
## [11] ""                                                               
## [12] "* Exiting, sniffed 3 logins"

cat(readLines("sip.dump"), sep="\n")
## 172.25.105.3"172.25.105.40"555"asterisk"REGISTER"sip:172.25.105.40"4787f7ce""""MD5"1ac95ce17e1f0230751cf1fd3d278320
## 172.25.105.3"172.25.105.40"555"asterisk"INVITE"sip:1000@172.25.105.40"70fbfdae""""MD5"aa533f6efa2b2abac675c1ee6cbde327
## 172.25.105.3"172.25.105.40"555"asterisk"BYE"sip:1000@172.25.105.40"70fbfdae""""MD5"0b306e9db1f819dd824acf3227b60e07

It saves the IPs, caller, authentication realm, method, nonce and hash which will all be fed into the sipcrack.

We know from the placeholder answer text that the “password” is 4 characters, and this is the land of telephony, so we can make an assumption that it is really 4 digits. sipcrack needs a file of passwords to try, so We’ll let R make a randomized file of 4 digit pins for us:

cat(sprintf("%04d", sample(0:9999)), file = "4-digits", sep="\n")

We only have authenticaton packets for 555 so we can automate what would normally be an interactive process:

cat(system('echo "1" | sipcrack -w 4-digits sip.dump', intern=TRUE), sep="\n")
## 
## SIPcrack 0.2 
## ----------------------------------------
## 
## * Found Accounts:
## 
## Num  Server      Client      User    Hash|Password
## 
## 1    172.25.105.3    172.25.105.40   555 1ac95ce17e1f0230751cf1fd3d278320
## 2    172.25.105.3    172.25.105.40   555 aa533f6efa2b2abac675c1ee6cbde327
## 3    172.25.105.3    172.25.105.40   555 0b306e9db1f819dd824acf3227b60e07
## 
## * Select which entry to crack (1 - 3): 
## * Generating static MD5 hash... c3e0f1664fde9fbc75a7cbd341877875
## * Loaded wordlist: '4-digits'
## * Starting bruteforce against user '555' (MD5: '1ac95ce17e1f0230751cf1fd3d278320')
## * Tried 8904 passwords in 0 seconds
## 
## * Found password: '1234'
## * Updating dump file 'sip.dump'... done

Which RTP packet header field can be used to reorder out of sync RTP packets in the correct sequence?

Just reading involved here: 5.1 RTP Fixed Header Fields.

The trace includes a secret hidden message. Can you hear it?

We could command line this one but honestly Wireshark has a pretty keen audio player. Fire it up, open up the PCAP, go to the “Telephony” menu, pick SIP and play the streams.

Packet Maze: Solving a CyberDefenders PCAP Puzzle with R, Zeek, and tshark

2021-07-20 – 15:18
Posted in Cybersecurity, Data Analysis, data driven security, data science, data wrangling, DNS, Information Security, pcap, R
Comments (7)

It was a rainy weekend in southern Maine and I really didn’t feel like doing chores, so I was skimming through RSS feeds and noticed a link to a PacketMaze challenge in the latest This Week In 4n6.

Since it’s also been a while since I’ve done any serious content delivery (on the personal side, anyway), I thought it’d be fun to solve the challenge with some tools I like — namely Zeek, tshark, and R (links to those in the e-book I’m linking to below), craft some real expository around each solution, and bundle it all up into an e-book and lighter-weight GitHub repo.

There are 11 “quests” in the challenge, requiring sifting through a packet capture (PCAP) and looking for various odds and ends (some are very windy maze passages). The challenge ranges from extracting images and image metadata from FTP sessions to pulling out precise elements in TLS sessions, to dealing with IPv6.

This is far from an expert challenge, and anyone can likely work through it with a little bit of elbow grease.

As it says on the tin, not all data is ‘big’ nor do all data-driven cybersecurity projects require advanced modeling capabilities. Sometimes you just need to dissect some network packet capture (PCAP) data and don’t want to click through a GUI to get the job done. This short book works through the questions in CyberDefenders Lab #68 to show how you can get the Zeek open source network security tool, tshark command-line PCAP analysis Swiss army knife, and R (via RStudio) working together.

FIN

If you find the resource helpful or have other feedback, drop a note on Twitter (@hrbrmstr), in a comment here, or as a GitHub issue.

CRAN Mirror “Security”

2019-03-03 – 15:22
Posted in Cybersecurity, data driven security, Information Security, R
Comments (5)

In the “Changes on CRAN” section of the latest version of the The R Journal (Vol. 10/2, December 2018) had this short blurb entitled “CRAN mirror security”:

Currently, there are 100 official CRAN mirrors, 68 of which provide both secure downloads via ‘https’ and use secure mirroring from the CRAN master (via rsync through ssh tunnels). Since the R 3.4.0 release, chooseCRANmirror() offers these mirrors in preference to the others which are not fully secured (yet).

I would have linked to the R Journal section quoted above but I can’t because I’m blocked from accessing all resources at the IP address serving cran.r-project.org from my business-class internet connection likely due to me having a personal CRAN mirror (that was following the rules, which I also cannot link to since I can’t get to the site).

That word — “security” — is one of the most misunderstood and misused terms in modern times in many contexts. The context for the use here is cybersecurity and since CRAN (and others in the R community) seem to equate transport-layer uber-obfuscation with actual security/safety I thought it would be useful for R users in general to get a more complete picture of these so-called “secure” hosts. I also did this since I had to figure out another way to continue to have a CRAN mirror and needed to validate which nodes both supported + allowed mirroring and were at least somewhat trustworthy.

Unless there is something truly egregious in a given section I’m just going to present data with some commentary (I’m unamused abt being blocked so some commentary has an unusually sharp edge) and refrain from stating “X is ?|?” since the goal is really to help you make the best decision of which mirror to use on your own.

The full Rproj supporting the snippets in this post (and including the data gathered by the post) can be found in my new R blog projects.

We’re going to need a few supporting packages so let’s get those out of the way:

library(xml2)
library(httr)
library(curl)
library(stringi)
library(urltools)
library(ipinfo) # install.packages("ipinfo", repos = "https://cinc.rud.is/")
library(openssl)
library(furrr)
library(vershist) # install.packages("vershist", repos = "https://cinc.rud.is/")
library(ggalt)
library(ggbeeswarm)
library(hrbrthemes)
library(tidyverse)

What Is “Secure”?

As noted, CRAN folks seem to think encryption == security since the criteria for making that claim in the R Journal was transport-layer encryption for rsync (via ssh) mirroring from CRAN to a downstream mirror and a downstream mirror providing an https transport for shuffling package binaries and sources from said mirror to your local system(s). I find that equally as adorable as I do the rhetoric from the Let’s Encrypt cabal as this https gets you:

in theory protection from person-in-the-middle attacks that could otherwise fiddle with the package bits in transport
protection from your organization or ISP knowing what specific package you were grabbing; note that unless you’ve got a setup where your DNS requests are also encrypted the entity that controls your transport layer does indeed know exactly where you’re going.

and…that’s about it.

The soon-to-be-gone-and-formerly-green-in-most-browsers lock icon alone tells you nothing about the configuration of any site you’re connecting to and using rsync over ssh provides no assurance as to what else is on the CRAN mirror server(s), what else is using the mirror server(s), how many admins/users have shell access to those system(s) nor anything else about the cyber hygiene of those systems.

So, we’re going to look at (not necessarily in this order & non-exhaustively since this isn’t a penetration test and only lightweight introspection has been performed):

how many servers are involved in a given mirror URL
SSL certificate information including issuer, strength, and just how many other domains can use the cert
the actual server SSL transport configuration to see just how many CRAN mirrors have HIGH or CRITICAL SSL configuration issues
use (or lack thereof) HTTP “security” headers (I mean, the server is supposed to be “secure”, right?)
how much other “junk” is running on a given CRAN mirror (the more running services the greater the attack surface)

We’ll use R for most of this, too (I’m likely never going to rewrite longstanding SSL testers in/for R).

Let’s dig in.

Acquiring Most of the Metadata

It can take a little while to run some of the data gathering steps so the project repo includes the already-gathered data. But, we’ll show the work on the first bit of reconnaissance which involves:

Slurping the SSL certificate from the first server in each CRAN mirror entry (again, I can’t link to the mirror page because I literally can’t see CRAN or the main R site anymore)
Performing an HTTP HEAD request (to minimize server bandwidth & CPU usage) of the full CRAN mirror URL (we have to since load balancers or proxies could re-route us to a completely different server otherwise)
Getting an IP address for each CRAN mirror
Getting metadata about that IP address

This all done below:

if (!file.exists(here::here("data/mir-dat.rds"))) {
  mdoc <- xml2::read_xml(here::here("data/mirrors.html"), as_html = TRUE)

  xml_find_all(mdoc, ".//td/a[contains(@href, 'https')]") %>%
    xml_attr("href") %>%
    unique() -> ssl_mirrors

  plan(multiprocess)

  # safety first
  dl_cert <- possibly(openssl::download_ssl_cert, NULL)
  HEAD_ <- possibly(httr::HEAD, NULL)
  dig <- possibly(curl::nslookup, NULL)
  query_ip_ <- possibly(ipinfo::query_ip, NULL)

  ssl_mirrors %>%
    future_map(~{
      host <- domain(.x)
      ip <- dig(host, TRUE)
      ip_info <- if (length(ip)) query_ip_(ip) else NULL
      list(
        host = host,
        cert = dl_cert(host),
        head = HEAD_(.x),
        ip = ip,
        ip_info = ip_info
      )
    }) -> mir_dat

  saveRDS(mir_dat, here::here("data/mir-dat.rds"))
} else {
  mir_dat <- readRDS(here::here("data/mir-dat.rds"))
}

# take a look

str(mir_dat[1], 3)
## List of 1
##  $ :List of 5
##   ..$ host   : chr "cloud.r-project.org"
##   ..$ cert   :List of 4
##   .. ..$ :List of 8
##   .. ..$ :List of 8
##   .. ..$ :List of 8
##   .. ..$ :List of 8
##   ..$ head   :List of 10
##   .. ..$ url        : chr "https://cloud.r-project.org/"
##   .. ..$ status_code: int 200
##   .. ..$ headers    :List of 13
##   .. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
##   .. ..$ all_headers:List of 1
##   .. ..$ cookies    :'data.frame':   0 obs. of  7 variables:
##   .. ..$ content    : raw(0) 
##   .. ..$ date       : POSIXct[1:1], format: "2018-11-29 09:41:27"
##   .. ..$ times      : Named num [1:6] 0 0.0507 0.0512 0.0666 0.0796 ...
##   .. .. ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
##   .. ..$ request    :List of 7
##   .. .. ..- attr(*, "class")= chr "request"
##   .. ..$ handle     :Class 'curl_handle' <externalptr> 
##   .. ..- attr(*, "class")= chr "response"
##   ..$ ip     : chr "52.85.89.62"
##   ..$ ip_info:List of 8
##   .. ..$ ip      : chr "52.85.89.62"
##   .. ..$ hostname: chr "server-52-85-89-62.jfk6.r.cloudfront.net"
##   .. ..$ city    : chr "Seattle"
##   .. ..$ region  : chr "Washington"
##   .. ..$ country : chr "US"
##   .. ..$ loc     : chr "47.6348,-122.3450"
##   .. ..$ postal  : chr "98109"
##   .. ..$ org     : chr "AS16509 Amazon.com, Inc."

Note that two sites failed to respond so they were excluded from all analyses.

A Gratuitous Map of “Secure” CRAN Servers

Since ipinfo.io‘s API returns lat/lng geolocation information why not start with a map (since that’s going to be the kindest section of this post):

maps::map("world", ".", exact = FALSE, plot = FALSE,  fill = TRUE) %>%
  fortify() %>%
  filter(region != "Antarctica") -> world

map_chr(mir_dat, ~.x$ip_info$loc) %>%
  stri_split_fixed(pattern = ",", n = 2, simplify = TRUE) %>%
  as.data.frame(stringsAsFactors = FALSE) %>%
  as_tibble() %>%
  mutate_all(list(as.numeric)) -> wheres_cran

ggplot() +
  ggalt::geom_cartogram(
    data = world, map = world, aes(long, lat, map_id=region),
    color = ft_cols$gray, size = 0.125
  ) +
  geom_point(
    data = wheres_cran, aes(V2, V1), size = 2,
    color = ft_cols$slate, fill = alpha(ft_cols$yellow, 3/4), shape = 21
  ) +
  ggalt::coord_proj("+proj=wintri") +
  labs(
    x = NULL, y = NULL,
    title = "Geolocation of HTTPS-'enabled' CRAN Mirrors"
  ) +
  theme_ft_rc(grid="") +
  theme(axis.text = element_blank())

Shakesperian Security

What’s in a [Subject Alternative] name? That which we call a site secure. By using dozens of other names would smell as not really secure at all? —Hackmeyo & Pwndmeyet (II, ii, 1-2)

The average internet user likely has no idea that one SSL certificate can front a gazillion sites. I’m not just talking a wildcard cert (e.g. using *.rud.is for all rud.is subdomains which I try not to do for many reasons), I’m talking dozens of subject alternative names. Let’s examine some data since an example is better than blathering:

# extract some of the gathered metadata into a data frame
map_df(mir_dat, ~{
  tibble(
    host = .x$host,
    s_issuer = .x$cert[[1]]$issuer %||% NA_character_,
    i_issuer = .x$cert[[2]]$issuer %||% NA_character_,
    algo = .x$cert[[1]]$algorithm %||% NA_character_,
    names = .x$cert[[1]]$alt_names %||% NA_character_,
    nm_ct = length(.x$cert[[1]]$alt_names),
    key_size = .x$cert[[1]]$pubkey$size %||% NA_integer_
  )
}) -> certs

certs <- filter(certs, complete.cases(certs))

count(certs, host, sort=TRUE) %>%
  ggplot() +
  geom_quasirandom(
    aes("", n), size = 2,
    color = ft_cols$slate, fill = alpha(ft_cols$yellow, 3/4), shape = 21
  ) +
  scale_y_comma() +
  labs(
    x = NULL, y = "# Servers",
    title = "Distribution of the number of alt-names in CRAN mirror certificates"
  ) +
  theme_ft_rc(grid="Y")

Most only front a couple but there are some with a crazy amount of domains. We can look at a slice of cran.cnr.berkeley.edu:

filter(certs, host == "cran.cnr.berkeley.edu") %>%
  select(names) %>%
  head(20)

names
nature.berkeley.edu
ag-labor.cnr.berkeley.edu
agro-laboral.cnr.berkeley.edu
agroecology.berkeley.edu
anthoff.erg.berkeley.edu
are-dev.cnr.berkeley.edu
are-prod.cnr.berkeley.edu
are-qa.cnr.berkeley.edu
are.berkeley.edu
arebeta.berkeley.edu
areweb.berkeley.edu
atkins-dev.cnr.berkeley.edu
atkins-prod.cnr.berkeley.edu
atkins-qa.cnr.berkeley.edu
atkins.berkeley.edu
bakerlab-dev.cnr.berkeley.edu
bakerlab-prod.cnr.berkeley.edu
bakerlab-qa.cnr.berkeley.edu
bamg.cnr.berkeley.edu
beahrselp-dev.cnr.berkeley.edu

The project repo has some more examples and you can examine as many as you like.

For some CRAN mirrors the certificate is used all over the place at the hosting organization. That alone isn’t bad, but organizations are generally terrible at protecting the secrets associated with certificate generation (just look at how many Google/Apple app store apps are found monthly to be using absconded-with enterprise certs) and since each server with these uber-certs has copies of public & private bits users had better hope that mal-intentioned ne’er-do-wells do not get copies of them (making it easier to impersonate any one of those, especially if an attacker controls DNS).

This Berkeley uber-cert is also kinda cute since it mixes alt-names for dev, prod & qa systems across may different apps/projects (dev systems are notoriously maintained improperly in virtually every organization).

There are legitimate reasons and circumstances for wildcard certs and taking advantage of SANs. You can examine what other CRAN mirrors do and judge for yourself which ones are Doing It Kinda OK.

Size (and Algorithm) Matters

In some crazy twist of pleasant surprises most of the mirrors seem to do OK when it comes to the algorithm and key size used for the certificate(s):

distinct(certs, host, algo, key_size) %>%
  count(algo, key_size, sort=TRUE)

algo	key_size	n
sha256WithRSAEncryption	2048	59
sha256WithRSAEncryption	4096	13
ecdsa-with-SHA256	256	2
sha256WithRSAEncryption	256	1
sha256WithRSAEncryption	384	1
sha512WithRSAEncryption	2048	1
sha512WithRSAEncryption	4096	1

You can go to the mirror list and hit up SSL Labs Interactive Server Test (which has links to many ‘splainers) or use the ssllabs? R package to get the grade of each site. I dig into the state of config and transport issues below but will suggest that you stick with sites with ecdsa certs or sha256 and higher numbers if you want a general, quick bit of guidance.

Where Do They Get All These Wonderful Certs?

Certs come from somewhere. You can self-generate play ones, setup your own internal/legit certificate authority and augment trust chains, or go to a bona-fide certificate authority to get a certificate.

Your browsers and operating systems have a built-in set of certificate authorities they trust and you can use ssllabs::get_root_certs()? to see an up-to-date list of ones for Mozilla, Apple, Android, Java & Windows. In the age of Let’s Encrypt, certificates have almost no monetary value and virtually no integrity value so where they come from isn’t as important as it used to be, but it’s kinda fun to poke at it anyway:

distinct(certs, host, i_issuer) %>%
  count(i_issuer, sort = TRUE) %>%
  head(28)

i_issuer	n
CN=DST Root CA X3,O=Digital Signature Trust Co.	20
CN=COMODO RSA Certification Authority,O=COMODO CA Limited,L=Salford,ST=Greater Manchester,C=GB	7
CN=DigiCert Assured ID Root CA,OU=www.digicert.com,O=DigiCert Inc,C=US	7
CN=DigiCert Global Root CA,OU=www.digicert.com,O=DigiCert Inc,C=US	6
CN=DigiCert High Assurance EV Root CA,OU=www.digicert.com,O=DigiCert Inc,C=US	6
CN=QuoVadis Root CA 2 G3,O=QuoVadis Limited,C=BM	5
CN=USERTrust RSA Certification Authority,O=The USERTRUST Network,L=Jersey City,ST=New Jersey,C=US	5
CN=GlobalSign Root CA,OU=Root CA,O=GlobalSign nv-sa,C=BE	4
CN=Trusted Root CA SHA256 G2,O=GlobalSign nv-sa,OU=Trusted Root,C=BE	3
CN=COMODO ECC Certification Authority,O=COMODO CA Limited,L=Salford,ST=Greater Manchester,C=GB	2
CN=DFN-Verein PCA Global – G01,OU=DFN-PKI,O=DFN-Verein,C=DE	2
OU=Security Communication RootCA2,O=SECOM Trust Systems CO.\,LTD.,C=JP	2
CN=AddTrust External CA Root,OU=AddTrust External TTP Network,O=AddTrust AB,C=SE	1
CN=Amazon Root CA 1,O=Amazon,C=US	1
CN=Baltimore CyberTrust Root,OU=CyberTrust,O=Baltimore,C=IE	1
CN=Certum Trusted Network CA,OU=Certum Certification Authority,O=Unizeto Technologies S.A.,C=PL	1
CN=DFN-Verein Certification Authority 2,OU=DFN-PKI,O=Verein zur Foerderung eines Deutschen Forschungsnetzes e. V.,C=DE	1
CN=Go Daddy Root Certificate Authority – G2,O=GoDaddy.com\, Inc.,L=Scottsdale,ST=Arizona,C=US	1
CN=InCommon RSA Server CA,OU=InCommon,O=Internet2,L=Ann Arbor,ST=MI,C=US	1
CN=QuoVadis Root CA 2,O=QuoVadis Limited,C=BM	1
CN=QuoVadis Root Certification Authority,OU=Root Certification Authority,O=QuoVadis Limited,C=BM	1

That first one is Let’s Encrypt, which is not unexpected since they’re free and super easy to setup/maintain (especially for phishing campaigns).

A “fun” exercise might be to Google/DDG around for historical compromises tied to these CAs (look in the subject ones too if you’re playing with the data at home) and see what, eh, issues they’ve had.

You might want to keep more of an eye on this whole “boring” CA bit, too, since some trust stores are noodling on the idea of trusting surveillance firms and you never know what Microsoft or Google is going to do to placate authoritarian regimes and allow into their trust stores.

At this point in the exercise you’ve got

how many domains a certificate fronts
certificate strength
certificate birthplace

to use when formulating your own decision on what CRAN mirror to use.

But, as noted, certificate breeding is not enough. Let’s dive into the next areas.

It’s In The Way That You Use It

You can’t just look at a cert to evaluate site security. Sure, you can spend 4 days and use the aforementioned ssllabs package to get the rating for each cert (well, if they’ve been cached then an API call won’t be an assessment so you can prime the cache with 4 other ppl in one day and then everyone else can use the cached values and not burn the rate limit) or go one-by-one in the SSL Labs test site, but we can also use a tool like testssl.sh? to gather technical data via interactive protocol examination.

I’m being a bit harsh in this post, so fair’s fair and here are the plaintext results from my own run of testssl.sh for rud.is along with ones from Qualys:

As you can see in the detail pages, I am having an issue with the provider of my .is domain (severe limitation on DNS record counts and types) so I fail CAA checks because I literally can’t add an entry for it nor can I use a different nameserver. Feel encouraged to pick nits about that tho as that should provide sufficient impetus to take two weeks of IRL time and some USD to actually get it transferred (yay. international. domain. providers.)

The project repo has all the results from a weekend run on the CRAN mirrors. No special options were chosen for the runs.

list.files(here::here("data/ssl"), "json$", full.names = TRUE) %>%
  map_df(jsonlite::fromJSON) %>%
  as_tibble() -> ssl_tests

# filter only fields we want to show and get them in order
sev <- c("OK", "LOW", "MEDIUM", "HIGH", "WARN", "CRITICAL")

group_by(ip) %>%
  count(severity) %>%
  ungroup() %>%
  complete(ip = unique(ip), severity = sev) %>%
  mutate(severity = factor(severity, levels = sev)) %>% # order left->right by severity
  arrange(ip) %>%
  mutate(ip = factor(ip, levels = rev(unique(ip)))) %>% # order alpha by mirror name so it's easier to ref
  ggplot(aes(severity, ip, fill=n)) +
  geom_tile(color = "#b2b2b2", size = 0.125) +
  scale_x_discrete(name = NULL, expand = c(0,0.1), position = "top") +
  scale_y_discrete(name = NULL, expand = c(0,0)) +
  viridis::scale_fill_viridis(
    name = "# Tests", option = "cividis", na.value = ft_cols$gray
  ) +
  labs(
    title = "CRAN Mirror SSL Test Summary Findings by Severity"
  ) +
  theme_ft_rc(grid="") +
  theme(axis.text.y = element_text(size = 8, family = "mono")) -> gg

# We're going to move the title vs have too wide of a plot

gb <- ggplot2::ggplotGrob(gg)
gb$layout$l[gb$layout$name %in% "title"] <- 2

grid::grid.newpage()
grid::grid.draw(gb)

Thankfully most SSL checks come back OK. Unfortunately, many do not:

filter(ssl_tests,severity == "HIGH") %>% 
  count(id, sort = TRUE)

id	n
BREACH	42
cipherlist_3DES_IDEA	37
cipher_order	34
RC4	16
cipher_negotiated	10
LOGJAM-common_primes	9
POODLE_SSL	6
SSLv3	6
cert_expiration_status	1
cert_notAfter	1
fallback_SCSV	1
LOGJAM	1
secure_client_renego	1

filter(ssl_tests,severity == "CRITICAL") %>% 
  count(id, sort = TRUE)

id	n
cipherlist_LOW	16
TLS1_1	5
CCS	2
cert_chain_of_trust	1
cipherlist_aNULL	1
cipherlist_EXPORT	1
DROWN	1
FREAK	1
ROBOT	1
SSLv2	1

Some CRAN mirror site admins aren’t keeping up with secure SSL configurations. If you’re not familiar with some of the acronyms here are a few (fairly layman-friendly) links:

You’d be hard-pressed to have me say that the presence of these is the end of the world (I mean, you’re trusting random servers to provide packages for you which may run in secure enclaves on production code, so how important can this really be?) but I also wouldn’t attach the word “secure” to any CRAN mirror with HIGH or CRITICAL SSL configuration weaknesses.

Getting Ahead[er] Of Myself

We did the httr::HEAD() request primarily to capture HTTP headers. And, we definitely got some!

map_df(mir_dat, ~{

  if (length(.x$head$headers) == 0) return(NULL)

  host <- .x$host

  flatten_df(.x$head$headers) %>%
    gather(name, value) %>%
    mutate(host = host)

}) -> hdrs

count(hdrs, name, sort=TRUE) %>%
  head(nrow(.))

name	n
content-type	79
date	79
server	79
last-modified	72
content-length	67
accept-ranges	65
etag	65
content-encoding	38
connection	28
vary	28
strict-transport-security	13
x-frame-options	8
x-content-type-options	7
cache-control	4
expires	3
x-xss-protection	3
cf-ray	2
expect-ct	2
set-cookie	2
via	2
ms-author-via	1
pragma	1
referrer-policy	1
upgrade	1
x-amz-cf-id	1
x-cache	1
x-permitted-cross-domain	1
x-powered-by	1
x-robots-tag	1
x-tuna-mirror-id	1
x-ua-compatible	1

There are a handful of “security” headers that kinda matter so we’ll see how many “secure” CRAN mirrors use “security” headers:

c(
  "content-security-policy", "x-frame-options", "x-xss-protection",
  "x-content-type-options", "strict-transport-security", "referrer-policy"
) -> secure_headers

count(hdrs, name, sort=TRUE) %>%
  filter(name %in% secure_headers)

name	n
strict-transport-security	13
x-frame-options	8
x-content-type-options	7
x-xss-protection	3
referrer-policy	1

I’m honestly shocked any were in use but only a handful or two are using even one “security” header. cran.csiro.au uses all five of the above so good on ya Commonwealth Scientific and Industrial Research Organisation!

I keep putting the word “security” in quotes as R does nothing with these headers when you do an install.packages(). As a whole they’re important but mostly when it comes to your safety when browsing those CRAN mirrors.

I would have liked to have seen at least one with some Content-Security-Policy header, but a girl can at least dream.

Version Aversion

There’s another HTTP response header we can look at, the Server one which is generally there to help attackers figure out whether they should target you further for HTTP server and application attacks. No, I mean it! Back in the day when geeks rules the internets — and it wasn’t just a platform for cat pictures and pwnd IP cameras — things like the Server header were cool because it might help us create server-specific interactions and build cool stuff. Yes, modern day REST APIs are likely better in the long run but the naiveté of the silver age of the internet was definitely something special (and also led to the chaos we have now). But, I digress.

In theory, no HTTP server in it’s rightly configured digital mind would tell you what it’s running down to the version level, but most do. (Again, feel free to pick nits that I let the world know I run nginx…or do I). Assuming the CRAN mirrors haven’t been configured to deceive attackers and report what folks told them to report we can survey what they run behind the browser window:

filter(hdrs, name == "server") %>%
  separate(
    value, c("kind", "version"), sep="/", fill="right", extra="merge"
  ) -> svr

count(svr, kind, sort=TRUE)

kind	n
Apache	57
nginx	15
cloudflare	2
CSIRO	1
Hiawatha v10.8.4	1
High Performance 8bit Web Server	1
none	1
openresty	1

I really hope Cloudflare is donating bandwidth vs charging these mirror sites. They’ve likely benefitted greatly from the diverse FOSS projects many of these sites serve. (I hadn’t said anything bad about Cloudflare yet so I had to get one in before the end).

Lots run Apache (makes sense since CRAN-proper does too, not that I can validate that from home since I’m IP blocked…bitter much, hrbrmstr?) Many run nginx. CSIRO likely names their server that on purpose and hasn’t actually written their own web server. Hiawatha is, indeed, a valid web server. While there are also “high performance 8bit web servers” out there I’m willing to bet that’s a joke header value along with “none”. Finally, “openresty” is also a valid web server (it’s nginx++).

We’ll pick on Apache and nginx and see how current patch levels are. Not all return a version number but a good chunk do:

apache_httpd_version_history() %>%
  arrange(rls_date) %>%
  mutate(
    vers = factor(as.character(vers), levels = as.character(vers))
  ) -> apa_all

filter(svr, kind == "Apache") %>%
  filter(!is.na(version)) %>%
  mutate(version = stri_replace_all_regex(version, " .*$", "")) %>%
  count(version) %>%
  separate(version, c("maj", "min", "pat"), sep="\\.", convert = TRUE, fill = "right") %>%
  mutate(pat = ifelse(is.na(pat), 1, pat)) %>%
  mutate(v = sprintf("%s.%s.%s", maj, min, pat)) %>%
  mutate(v = factor(v, levels = apa_all$vers)) %>%
  arrange(v) -> apa_vers

filter(apa_all, vers %in% apa_vers$v) %>%
  arrange(rls_date) %>%
  group_by(rls_year) %>%
  slice(1) %>%
  ungroup() %>%
  arrange(rls_date) -> apa_yrs

ggplot() +
  geom_blank(
    data = apa_vers, aes(v, n)
  ) +
  geom_segment(
    data = apa_yrs, aes(vers, 0, xend=vers, yend=Inf),
    linetype = "dotted", size = 0.25, color = "white"
  ) +
  geom_segment(
    data = apa_vers, aes(v, n, xend=v, yend=0),
    color = ft_cols$gray, size = 8
  ) +
  geom_label(
    data = apa_yrs, aes(vers, Inf, label = rls_year),
    family = font_rc, color = "white", fill = "#262a31", size = 4,
    vjust = 1, hjust = 0, nudge_x = 0.01, label.size = 0
  ) +
  scale_y_comma(limits = c(0, 15)) +
  labs(
    x = "Apache Version #", y = "# Servers",
    title = "CRAN Mirrors Apache Version History"
  ) +
  theme_ft_rc(grid="Y") +
  theme(axis.text.x = element_text(family = "mono", size = 8, color = "white"))

O_O

I’ll let you decide if a six-year-old version of Apache indicates how well a mirror site is run or not. Sure, mitigations could be in place but I see no statement of efficacy on any site so we’ll go with #lazyadmin.

But, it’s gotta be better with nginx, right? It’s all cool & modern!

nginx_version_history() %>%
  arrange(rls_date) %>%
  mutate(
    vers = factor(as.character(vers), levels = as.character(vers))
  ) -> ngx_all

filter(svr, kind == "nginx") %>%
  filter(!is.na(version)) %>%
  mutate(version = stri_replace_all_regex(version, " .*$", "")) %>%
  count(version) %>%
  separate(version, c("maj", "min", "pat"), sep="\\.", convert = TRUE, fill = "right") %>%
  mutate(v = sprintf("%s.%s.%s", maj, min, pat)) %>%
  mutate(v = factor(v, levels = ngx_all$vers)) %>%
  arrange(v) -> ngx_vers

filter(ngx_all, vers %in% ngx_vers$v) %>%
  arrange(rls_date) %>%
  group_by(rls_year) %>%
  slice(1) %>%
  ungroup() %>%
  arrange(rls_date) -> ngx_yrs

ggplot() +
  geom_blank(
    data = ngx_vers, aes(v, n)
  ) +
  geom_segment(
    data = ngx_yrs, aes(vers, 0, xend=vers, yend=Inf),
    linetype = "dotted", size = 0.25, color = "white"
  ) +
  geom_segment(
    data = ngx_vers, aes(v, n, xend=v, yend=0),
    color = ft_cols$gray, size = 8
  ) +
  geom_label(
    data = ngx_yrs, aes(vers, Inf, label = rls_year),
    family = font_rc, color = "white", fill = "#262a31", size = 4,
    vjust = 1, hjust = 0, nudge_x = 0.01, label.size = 0
  ) +
  scale_y_comma(limits = c(0, 15)) +
  labs(
    x = "nginx Version #", y = "# Servers",
    title = "CRAN Mirrors nginx Version History"
  ) +
  theme_ft_rc(grid="Y") +
  theme(axis.text.x = element_text(family = "mono", color = "white"))

I will at close out this penultimate section with a “thank you!” to the admins at Georg-August-Universität Göttingen and Yamagata University for keeping up with web server patches.

You Made It This Far

If I had known you’d read to the nigh bitter end I would have made cookies. You’ll have to just accept the ones the blog gives your browser (those ones taste taste pretty bland tho).

The last lightweight element we’ll look at is “what else do these ‘secure’ CRAN mirrors run”?

To do this, we’ll turn to Rapid7 OpenData and look at what else is running on the IP addresses used by these CRAN mirrors. We already know some certs are promiscuous, so what about the servers themselves?

cran_mirror_other_things <- readRDS(here::here("data/cran-mirror-other-things.rds"))

# "top" 20
distinct(cran_mirror_other_things, ip, port) %>%
  count(ip, sort = TRUE) %>%
  head(20)

ip	n
104.25.94.23	8
143.107.10.17	7
104.27.133.206	5
137.208.57.37	5
192.75.96.254	5
208.81.1.244	5
119.40.117.175	4
130.225.254.116	4
133.24.248.17	4
14.49.99.238	4
148.205.148.16	4
190.64.49.124	4
194.214.26.146	4
200.236.31.1	4
201.159.221.67	4
202.90.159.172	4
217.31.202.63	4
222.66.109.32	4
45.63.11.93	4
62.44.96.11	4

Four isn’t bad since we kinda expect at least 80, 443 and 21 (FTP) to be running. We’ll take those away and look at the distribution:

distinct(cran_mirror_other_things, ip, port) %>%
  filter(!(port %in% c(21, 80, 443))) %>%
  count(ip) %>%
  count(n) %>%
  mutate(n = factor(n)) %>%
  ggplot() +
  geom_segment(
    aes(n, nn, xend = n, yend = 0), size = 10, color = ft_cols$gray
  ) +
  scale_y_comma() +
  labs(
    x = "Total number of running services", y = "# hosts",
    title = "How many other services do CRAN mirrors run?",
    subtitle = "NOTE: Not counting 80/443/21"
  ) +
  theme_ft_rc(grid="Y")

So, what are these other ports?

distinct(cran_mirror_other_things, ip, port) %>%
  count(port, sort=TRUE)

port	n
80	75
443	75
21	29
22	18
8080	6
25	5
53	2
2082	2
2086	2
8000	2
8008	2
8443	2
111	1
465	1
587	1
993	1
995	1
2083	1
2087	1

22 is SSH, 53 is DNS, 8000/8008/8080/8553 are web high ports usually associated with admin or API endpoints and generally a bad sign when exposed externally (especially on a “secure” mirror server). 25/465/587/993/995 all deal with mail sending and reading (not exactly a great service to have on a “secure” mirror server). I didn’t poke too hard but 208[2367] tend to be cPanel admin ports and those being internet-accessible is also not great.

Port 111 is sunrpc and is a really bad thing to expose to the internet or to run at all. But, the server is a “secure” CRAN mirror, so perhaps everything is fine.

FIN

While I hope this posts informs, I’ve worked in cybersecurity for ages and — as a result — don’t really expect anything to change. Tomorrow, I’ll still be blocked from the main CRAN & r-project.org site despite having better “security” than the vast majority of these “secure” CRAN mirrors (and was following the rules). Also CRAN mirror settings tend to be fairly invisible since most modern R users use the RStudio default (which is really not a bad choice from any “security” analysis angle), choose the first item in the mirror-chooser (Russian roulette!), or live with the setting in the site-wide Rprofile anyway (org-wide risk acceptance/”blame the admin”).

Since I only stated it way back up top (WordPress says this is ~3,900 words but much of that is [I think] code) you can get the full R project for this and examine the data yourself. There is a bit more data and code in the project since I also looked up the IP addresses in Rapid7’s FDNS OpenData study set to really see how many domains point to a particular CRAN mirror but really didn’t want to drag the post on any further.

Now, where did I put those Python 3 & Julia Jupyter notebooks…

2018 IEEE Security & Privacy (Filtered) Paper Dump

2018-04-03 – 10:20
Posted in AppSec, Cybersecurity, data driven security, data science
Leave a Comment

The 2018 IEEE Security & Privacy Conference is in May but they’ve posted their full proceedings and it’s better to grab them early than to wait for it to become part of a paid journal offering.

There are alot of papers. Not all match my interests but (fortunately?) many did and I’ve filtered down a list of the more interesting (to me) ones. It’s encouraging to see academic cybersecurity researchers branching out across a whole host of areas.

I can’t promise a the morning paper-esque daily treatment of these on the blog but I’ll likely exposit a few of them over the coming weeks. I’ve emoji’d a few that stood out. Order is the order I read them in (no other meaning to the order).

New viridis & colorbrewer palettes for ipv4-heatmap

2016-06-07 – 15:59
Posted in data driven security, data science, Data Visualization, DataVis, DataViz, Information Security
Tagged post
Comments (1)

It’s no seekrit that I :heart: Hilbert curve heatmaps of IPv4 space. Real-world IPv4 maps (i.e. the ones that drop dots on the Earth) have little utility, but with Hilbert curves maps of IPv4 space many different topologies can be superimposed (from ASNs to—if need be—geographic locations). Plus, there’s more opportunity to find patterns by keeping CIDRs naturally close to each other.

The Measurement Factory created the [`ipv4-heatmap`](http://maps.measurement-factory.com/) command-line utility back in 2007 and there have been some tweaks and expansions to it by others over time. I wanted to use these IPv4 heatmaps in the [National Exposure](https://community.rapid7.com/community/infosec/blog/2016/06/07/rapid7-releases-new-research) report I worked on with @todb & @jhartftw at @Rapid7 but _cannot stand_ the built-in red-blue color scheme, especially when there’s [viridis](https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html) available. So, I [forked the code](https://github.com/hrbrmstr/ipv4-heatmap) and added both viridis and [colorbrewer](colorbrewer2.org) palettes to it as command-line options.

Here are two examples (the results of the National Exposure study), one using viridis and one using the colorbrewer `rdbu` palette:

![](https://www.dropbox.com/s/we5f5u7ejj7cp8l/viridis.png?raw=1)

![](https://www.dropbox.com/s/uznijxxq99qoces/rdbu-inverted.png?raw=1)

You specify the palette with `-P palette` and can invert the order of any palette with `-i` and the chosen palette will also be used in any legend you add the visualization.

Since these 4096×4096 files are a bit big, you can hit up [this dropbox link](https://www.dropbox.com/sh/wqyly8ewxeko5jn/AAC5bHIpQTuxWGBPYzMqceLQa?dl=0) to see a “gallery” of the various forward and reverse palettes.

The palette selection code is a bit brute-force at the moment, mostly due to the fact that I’m planning on a C++ port of the code and eventual inclusion of the Hilbert heatmap functionality in the [`iptools`](http://github.com/hrbrmstr/iptools) package.

Visualizing Survey Data : Comparison Between Observations

2015-11-08 – 12:46
Posted in Cybersecurity, Data Analysis, data driven security, Data Visualization, DataVis, DataViz, ggplot, R, slopegraph
Tagged post
Comments (4)

Cybersecurity is a domain that really likes surveys, or at the very least it has many folks within it that like to conduct and report on surveys. One recent survey on threat intelligence is in it’s second year, so it sets about comparing answers across years. Rather than go into the many technical/statistical issues with this survey, I’d like to focus on alternate ways to visualize the comparison across years.

We’ll use the data that makes up this chart (Figure 3 from the report):

since it’s pretty representative of the remainder of the figures.

Let’s start by reproducing this figure with ggplot2:

library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(scales)
library(ggthemes)
library(extrafont)

loadfonts(quiet=TRUE)

read.csv("question.csv", stringsAsFactors=FALSE) %>%
  gather(year, value, -belief) %>%
  mutate(year=factor(sub("y", "", year)),
         belief=str_wrap(belief, 40)) -> question

beliefs <- unique(question$belief)
question$belief <- factor(beliefs, levels=rev(beliefs[c(1,2,4,5,3,7,6)]))

gg <- ggplot(question, aes(belief, value, group=year))
gg <- gg + geom_bar(aes(fill=year), stat="identity", position="dodge",
                    color="white", width=0.85)
gg <- gg + geom_text(aes(label=percent(value)), hjust=-0.15,
                     position=position_dodge(width=0.8), size=3)
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,0.8))
gg <- gg + scale_fill_tableau(name="")
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

Now, the survey does caveat the findings and talks about non-response bias, sampling-frame bias and self-reporting bias. However, nowhere does it talk about the margin of error or anything relating to uncertainty. Thankfully, both the 2014 and 2015 reports communicate population and sample sizes, so we can figure out the margin of error:

library(samplesize4surveys)

moe_2014 <- e4p(19915, 701, 0.5)
## With the parameters of this function: N = 19915 n =  701 P = 0.5 DEFF =  1 conf = 0.95 . 
## The estimated coefficient of variation is  3.709879 . 
## The margin of error is 3.635614 . 
## 

moe_2015 <- e4p(18705, 692, 0.5)
## With the parameters of this function: N = 18705 n =  692 P = 0.5 DEFF =  1 conf = 0.95 . 
## The estimated coefficient of variation is  3.730449 . 
## The margin of error is 3.655773 .

They are both roughly 3.65% so let's take a look at our dodged bar chart again with this new information:

mutate(question, ymin=value-0.0365, ymax=value+0.0365) -> question

gg <- ggplot(question, aes(belief, value, group=year))
gg <- gg + geom_bar(aes(fill=year), stat="identity",
                    position=position_dodge(0.85),
                    color="white", width=0.85)
gg <- gg + geom_linerange(aes(ymin=ymin, ymax=ymax),
                         position=position_dodge(0.85),
                         size=1.5, color="#bdbdbd")
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,0.85))
gg <- gg + scale_fill_tableau(name="")
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

Hrm. There seems to be a bit of overlap. Let's just focus on that:

gg <- ggplot(question, aes(belief, value, group=year))
gg <- gg + geom_pointrange(aes(ymin=ymin, ymax=ymax),
                         position=position_dodge(0.25),
                         size=1, color="#bdbdbd", fatten=1)
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,1))
gg <- gg + scale_fill_tableau(name="")
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

The report actually makes hard claims based on the year-over-year change in the answers to many of the questions (not just this chart). Most have these overlapping intervals. Now, I understand that when a paying customer says they want a report that they wouldn't really be satisfied with a one-pager saying "See last years's report", but not communicating the uncertainty in these results seems like a significant omission.

But, I digress. There are better (or at least alternate) ways than bars to show this comparison. One is a "dumbbell chart".

question %>%
  group_by(belief) %>%
  mutate(line_col=ifelse(diff(value)<0, "2015", "2014"),
         hjust=ifelse(diff(value)<0, -0.5, 1.5)) %>%
  ungroup() -> question

gg <- ggplot(question)
gg <- gg + geom_path(aes(x=value, y=belief, group=belief, color=line_col))
gg <- gg + geom_point(aes(x=value, y=belief, color=year))
gg <- gg + geom_text(data=filter(question, year=="2015"),
                     aes(x=value, y=belief, label=percent(value),
                         hjust=hjust), size=2.5)
gg <- gg + scale_x_continuous(expand=c(0,0), limits=c(0,0.8))
gg <- gg + scale_color_tableau(name="")
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

I've used line color to indicate whether the 2015 value increased or decreased from 2014.

But, we still have the issue of communicating the margin of error. One way I came up with (which is not perfect) is to superimpose the dot-plot on top of the entire margin of error interval. While it doesn't show the discrete start/end margin for each year it does help to show that making definitive statements on the value comparisons is not exactly a good idea:

group_by(question, belief) %>%
  summarize(xmin=min(ymin), xmax=max(ymax)) -> band

gg <- ggplot(question)
gg <- gg + geom_segment(data=band,
                        aes(x=xmin, xend=xmax, y=belief, yend=belief),
                        color="#bdbdbd", alpha=0.5, size=3)
gg <- gg + geom_path(aes(x=value, y=belief, group=belief, color=line_col),
                     show.legend=FALSE)
gg <- gg + geom_point(aes(x=value, y=belief, color=year))
gg <- gg + geom_text(data=filter(question, year=="2015"),
                     aes(x=value, y=belief, label=percent(value),
                         hjust=hjust), size=2.5)
gg <- gg + scale_x_continuous(expand=c(0,0), limits=c(0,0.8))
gg <- gg + scale_color_tableau(name="")
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

Finally, the year-to-year nature of the data was just begging for a slopegraph:

question %>% mutate(vjust=0.5) -> question
question[(question$belief=="Makes threat data more actionable") &
           (question$year=="2015"),]$vjust <- -1
question[(question$belief=="Reduces the cost of detecting and\npreventing cyber attacks") &
           (question$year=="2015"),]$vjust <- 1.5

question$year <- factor(question$year, levels=c("2013", "2014", "2015", "2016", "2017", "2018"))

gg <- ggplot(question)
gg <- gg + geom_path(aes(x=year, y=value, group=belief, color=line_col))
gg <- gg + geom_point(aes(x=year, y=value), shape=21, fill="black", color="white")
gg <- gg + geom_text(data=filter(question, year=="2015"),
                     aes(x=year, y=value,
                         label=sprintf("\u2000%s %s", percent(value),
                                       gsub("\n", " ", belief)),
                         vjust=vjust), hjust=0, size=3)
gg <- gg + geom_text(data=filter(question, year=="2014"),
                     aes(x=year, y=value, label=percent(value)),
                     hjust=1.3, size=3)
gg <- gg + scale_x_discrete(expand=c(0,0.1), drop=FALSE)
gg <- gg + scale_color_tableau(name="")
gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participating\n")
gg <- gg + theme_tufte(base_family="Arial Narrow")
gg <- gg + theme(axis.ticks=element_blank())
gg <- gg + theme(axis.text=element_blank())
gg <- gg + theme(legend.position="none")
gg <- gg + theme(plot.title=element_text(hjust=0.5))
gg <- gg + theme(plot.title=element_text(hjust=0))
gg

It doesn't help communicate uncertainty but it's a nice alternative to bars.

Hopefully this helps provide some alternatives to bars for these types of comparisons and also ways to communicate uncertainty without confusing the reader (communicating uncertainty to a broad audience is hard).

Perhaps those conducting surveys (or data analyses in general) could subscribe to a "data visualizers" paraphrase of a quote from Epidemics, Book I, of the Hippocratic school:

"Practice two things in your dealings with data: either help or do not harm the reader."

The full Rmd and data for this post is in this gist.