Awk Quoted CSV Example

Author

@hrbrmstr

Published

September 12, 2023

Demonstration that awk’s CSV support handles quoted fields.

Take a peek at the CSV file we’re using:

readLines("spike-01.csv", n=5) |> 
  writeLines()

"dt","hour","app_protocol","destination_port","unique","n_sensors","total"
"2022-07-02","2022-07-02 04:00:00.000","TLS","443","6","1","106"
"2022-07-04","2022-07-04 03:00:00.000","HTTPS","443","134","38","6281"
"2022-07-04","2022-07-04 02:00:00.000","HTTPS","80","7","10","326"
"2022-07-02","2022-07-02 17:00:00.000","HTTPS","443","290","36","4763"

So. Many. Quotes.

We’ll do some light processing of that with awk, using a custom {knitr} engine that’s included in this qmd file. Use the code tools in the top right corner to see the code chunks.

In the one below, we define variable names for the positional fields and specify the file(s) to operate on in the chunk options.

NR > 1 { 
  cum_total_by[$app_protocol ":" $destination_port] += $total 
} END { 
  for (proto_port in cum_total_by) 
    print proto_port " => " cum_total_by[proto_port] 
}

HTTPS:80 => 24206
TLS:443 => 5945
TLS:80 => 10
HTTPS:8090 => 12637
TLS:47001 => 1
HTTPS:443 => 155323
TLS:3389 => 115

---
title: "Awk Quoted CSV Example"
author: "@hrbrmstr"
date: "2023-09-12"
code-tools: true
format:
  html:
    self-contained: true
    embed-resources: true
    theme:
      light: flatly
      dark: darkly
engine: knitr
---

Demonstration that awk's CSV support handles quoted fields.

```{r setup, echo=FALSE}
# define custom knitr engine for our bespoke compiled awk with CSV support
knitr::knit_engines$set(
  cawk = function(opts) {

    code <- paste(opts$code, collapse = '\n')

    tf <- tempfile(fileext = ".awk")
    writeLines(code, tf)
    on.exit(unlink(tf))

    args <- c("-f", tf)

    # enable awk CSV parsing if requested
    if (!is.null(opts[["awk.csv"]])) {
      args <- c(args, "--csv")
    }

    # enable awk variable assignment if requested
    awk_vars <- opts[grepl("^awk\\.var", names(opts))]

    if (length(awk_vars) > 0) {
      for (var_name in names(awk_vars)) {
        args <- c(
          args, "-v", 
          sprintf(
            "%s=%s", 
            sub("awk.var.", "", var_name, fixed=TRUE), 
            shQuote(awk_vars[[var_name]])
          )
        )
      }
    }

    # get data file(s) to process
    awk_files <- opts[grepl("^awk\\.file", names(opts))]

    if (length(awk_files) > 0) {
      for (fil in awk_files) {
        args <- c(args, shQuote(fil))
      }
    }

    out <- system2(  
      command = Sys.which("cawk"),
      args = args,
      stdout = TRUE
    )

    # so we get syntax highlighting
    opts$engine <- "awk"

    knitr::engine_output(
      options = opts, 
      code = code, 
      out = out,
      extra = NULL
    )

  }
)
```

Take a peek at the CSV file we're using:

```{r peek}
readLines("spike-01.csv", n=5) |> 
  writeLines()
```

So. Many. Quotes.

We'll do some light processing of that with awk, using a custom {knitr} engine that's included in this qmd file. Use the code tools in the top right corner to see the code chunks. 

In the one below, we define variable names for the positional fields and specify the file(s) to operate on in the chunk options.

```{cawk, awk-block, awk.csv=TRUE}
#| awk.var.dt: 1
#| awk.var.hour: 2
#| awk.var.app_protocol: 3
#| awk.var.destination_port: 4
#| awk.var.unique: 5
#| awk.var.n_sensors: 6
#| awk.var.total: 7
#| awk.file.1: "spike-01.csv"
NR > 1 { 
  cum_total_by[$app_protocol ":" $destination_port] += $total 
} END { 
  for (proto_port in cum_total_by) 
    print proto_port " => " cum_total_by[proto_port] 
}
```