πŸ§ͺ πŸ•ΈοΈ Vite + πŸ¦† DuckDB via Observable's Standard Library

A Toy Modeling Example


Experiment Hypothesis:

We can use DuckDB to wrangle data for us, let R do some "modeling", and let Observable Plot show us the results

Experiment parameters:


When Will GreyNoise Have 1,000 "Tags"

GreyNoise will reach 1,000 tags on or about ❓❓❓❓❓❓.

Adopt, Adapt, And Improve

Building off of the previous experiment, today we will combine DuckDB data ops with WebR, letting R do some trivial modeling with glm on data we load and wrangle with DuckDB.

Let's be super clear, right up front: this data is small enough to load into R, process in R, and then model and plot in R without any other packages (save {svglite}). It is deliberately a toy example to make it easier to work with while showing the core concepts of loading from a database, doing a more than trivial database query, passing data to R, and getting a result back.

At work, one of the core work products from my team are what we call "tags". They are detection rules for vulnerability exploit checks/attempts, good/bad actors, and more. We're coming up on the human-psyche-significant "1,000" value for total number of tags. Today's example predicts when that happens based on the volume time series.

Here are the tables we have:

This is the schema for our tags table:

This is what's in it:

Now, we need to compute the cumulative sum for each day and keep track of days elapsed so we can pass those vectors to our model.

It's not a horrible SQL query, especially if we break it up using common table expressions (ref: duckdb.js):

-- Setup a date range that spans the entire min/max created_at
-- We need this b/c we don't have tags every day so there are
-- gaps in the time series
WITH date_range AS (
  SELECT UNNEST(generate_series(
    (SELECT MIN(created_at) FROM tags),
    (SELECT MAX(created_at) FROM tags),
    INTERVAL '1 day'
  )) AS date
),

-- count number of tags/day
grouped_tags AS (
  SELECT
  created_at,
    COUNT(*) AS daily_count
  FROM
    tags
  GROUP BY
    created_at
),

-- join to the full range and fill in values
joined_dates_counts AS (
  SELECT
    dr.date,
    COALESCE(gt.daily_count, 0) AS filled_daily_count
  FROM
    date_range dr
  LEFT JOIN
    grouped_tags gt
  ON
    dr.date = gt.created_at
)

-- get the cumulative sum and days since the min created_at
SELECT
  date,
  filled_daily_count,
  SUM(filled_daily_count) OVER (ORDER BY date) AS running_cumulative_sum,
  DATEDIFF('day', (SELECT MIN(date) FROM joined_dates_counts), date) AS days_elapsed
FROM
  joined_dates_counts;

Here's what those "tag stats" look like:

We will use R to predict when the tag count will reach a specified value, this is the function we'll be using (ref: r.js):

function(csum, days_elapsed, target_csum) {

  # saddest. model. ever.

  model <- glm(csum ~ days_elapsed, family = "poisson")

  predicted_days_elapsed <- days_elapsed
  predicted_days_elapsed_ret <- c()
  predicted_days_csum_ret <- c()

  while (TRUE) {

    predicted_days_elapsed <- max(predicted_days_elapsed) + 1

    predict(
      model, 
      newdata = data.frame(days_elapsed = predicted_days_elapsed), 
      type = "response"
    ) -> predicted_csum

    predicted_days_csum_ret <- c(predicted_days_csum_ret, predicted_csum)
    predicted_days_elapsed_ret <- c(predicted_days_elapsed_ret, predicted_days_elapsed)

    if (predicted_csum >= target_csum) break

  }

  data.frame(
    days_elapsed = predicted_days_elapsed_ret,
    tagCount = predicted_days_csum_ret
  )

}

Sure, that could be fancier, but we don't need fancy for this example.

We then use the fact that:

await R`function NAME(…) {}`

produces a callable JS function (also in r.js) and we use it with the vectors we made from the database

// call the function
const nDays = await predict(
  tagsCumSum.map(d => d.csum),
  tagsCumSum.map(d => d.days_elapsed),
  1_000
)

// get the last ("1,000" prediction) elapsed day and min date 
const lastDay = nDays.values[0].values[ nDays.values[0].values.length-1]
const minDate = ddbResToArray(
	await db.sql`SELECT min(created_at) AS min_date FROM tags`
)[0].min_date

// …

// display the computed "1,000" date
predictedDate.textContent = addDays(minDate, lastDay).toDateString()

Project Layout

Core files:

β”œβ”€β”€ index.md                  # what we render into index.html via the justfile
β”œβ”€β”€ src
β”‚Β Β  β”œβ”€β”€ components.css        # CSS specific to component styling
β”‚Β Β  β”œβ”€β”€ index.css             # core SSS
β”‚Β Β  β”œβ”€β”€ action-button.js      # Lit component for the button
β”‚Β Β  β”œβ”€β”€ data-frame-view.js    # Lit component for displaying tables
β”‚Β Β  β”œβ”€β”€ ojs-shorthand-plot.js # Lit component for Observable plots
β”‚Β Β  β”œβ”€β”€ simple-message.js     # Lit component for simple output messages/text
β”‚Β Β  β”œβ”€β”€ status-message.js     # Lit component for my WebR status message up top
β”‚Β Β  β”œβ”€β”€ main.js               # main app runner
β”‚Β Β  β”œβ”€β”€ r.js                  # WebR context creation and support functions
β”‚Β Β  β”œβ”€β”€ duckdb.js             # DuckDB context creation and support functions and queries
β”‚Β Β  └── utils.js              # Miscellaneous utilities
└──

FIN

You can find the source on GitHub.

Brought to you by @hrbrmstr

"Carnac" image by The Tonight Show Starring Johnny Carson, Fair use, https://en.wikipedia.org/w/index.php?curid=2560897