Skip navigation

@ted_dunning recently updated the t-Digest algorithm he created back in 2013. What is this “t-digest”? Fundamentally, it is a probabilistic data structure for estimating any percentile of distributed/streaming data. Ted explains it quite elegantly in this short video:

Said video has a full transcript as well.

T-digests have been baked into many “big data” analytics ecosystems for a while but I hadn’t seen any R packages for them (ref any in a comment if you do know of some) so I wrapped one of the low-level implementation libraries by ajwerner into a diminutive R package boringly, but appropriately named tdigest:

There are wrappers for the low-level accumulators and quantile/value extractors along with vectorised functions for creating t-digest objects and retrieving quantiles from them (including a tdigest S3 method for stats::quantile()).

This:

install.packages("tdigest", repos="https://cinc.rud.is/")

will install from source or binaries onto your system(s).

Basic Ops

The low-level interface is more useful in “streaming” operations (i.e. accumulating input over time):

set.seed(2019-04-03)

td <- td_create()

for (i in 1:100000) {
  td_add(td, sample(100, 1), 1)
}

quantile(td)
## [1]   1.00000  25.62222  53.09883  74.75522 100.00000

More R-like Ops

Vectorisation is the name of the game in R and we can use tdigest() to work in a vectorised manner:

set.seed(2019-04-03)

x <- sample(100, 1000000, replace=TRUE)

td <- tdigest(x)

quantile(td)
## [1]   1.00000  25.91914  50.79468  74.76439 100.00000

Need for Speed

The t-digest algorithm was designed for both streaming operations and speed. It’s pretty, darned fast:

microbenchmark::microbenchmark(
  tdigest = tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)),
  r_quantile = quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
)
## Unit: microseconds
##        expr      min         lq        mean    median       uq       max neval
##     tdigest    22.81    26.6525    48.70123    53.355    63.31    151.29   100
##  r_quantile 57675.34 59118.4070 62992.56817 60488.932 64731.23 160130.50   100

Note that “accurate” is not the same thing as “precise”, so regular quantile ops in R will be close to what t-digest computes, but not always exactly the same.

FIN

This was a quick (but, complete) wrapper and could use some tyre kicking. I’ve a mind to add serialization to the C implementation so I can then enable [de]serialization on the R-side since that would (IMO) make t-digest ops more useful in an R-context, especially since you can merge two different t-digests.

As always, code/PR where you want to and file issues with any desired functionality/enhancements.

Also, whomever started the braces notation for package names (e.g. {ggplot2}): brilliant!

5 Comments

  1. How long does the tdigest operation take? Isn’t that the more fair comparison to R’s quantile function?

    • I was rly hoping someone would ask that! (I deliberately left it out just to see how carefully folks read it). I’ll def add that but I’m also going to suggest that the independent metric is useful since R’s quantile() can only operate on vectors and not in a streaming capacity. So both measures are definitely needed for comparison.

      • While it depends on your definition of ‘streaming’, my fromo package supports a running (or ‘rolling’) approximate quantile computation based on moments and Cornish Fisher approximation: https://rdrr.io/cran/fromo/man/runningquantiles.html

        I would also not be surprised if some improvements in base:quantile were possible, based on experiments I did some years ago, but that is another question entirely.

        • (We’ll see how this formats…different machine for the benchmarks, but such is life)

          microbenchmark::microbenchmark(
            tdigest = { 
              td <- tdigest(x, 1000); 
              tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)) 
              td_add(td, 30, 1)
              tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)) 
              td_add(td, 50, 1)
              tquantile(td, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)) 
            },
            r_quantile = { 
              quantile(x, c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
              quantile(c(x, 30), c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
              quantile(c(x, 30, 50), c(0, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1))
            }
          )
          ## Unit: milliseconds
          ##        expr       min       lq      mean    median        uq      max neval cld
          ##     tdigest  62.16809  63.0422  67.10284  64.39257  68.09766 188.4217   100  a 
          ##  r_quantile 141.03567 146.0779 160.58416 149.82952 155.60319 291.3323   100   b
          

          If you read the linked resources in the post, the main context for t-digest is actual accumulation of values (and it makes alot more sense in a Spark or Hadoop context than it does in R IMO). I did the above benchmarks b/c it shows that it is indeed much faster. It’s a different use-case and not designed as a substitute for base quantile() unless you do want to re-query quantiles again (or different ones) from the initial t-digest creation, then it does win handily over re-using base R’s quantile().

          • I’m kind of amazed WP managed to format that half-decently.


One Trackback/Pingback

  1. […] leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data […]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.