Measuring & Monitoring Internet Speed with R

Working remotely has many benefits, but if you work remotely in an area like, say, rural Maine, one of those benefits is not massively speedy internet connections. Being able to go fast and furious on the internet is one of the many things I miss about our time in Seattle and it is unlikely that we’ll be seeing Google Fiber in my small town any time soon. One other issue is that residential plans from evil giants like Comcast come with things like “bandwidth caps”. I suspect many WFH-ers can live within those limits, but I work with internet-scale data and often shunt extracts or whole datasets to the DatCave™ server farm for local processing. As such, I pay an extra penalty as a Comcast “Business-class” user that has little benefit besides getting slightly higher QoS and some faster service response times when there are issues.

Why go into all that? Well, I recently decided to bump up the connection from 100 Mb/s to 150 Mb/s (and managed to do so w/o increasing the overall monthly bill at all) but wanted to have a way to regularly verify I am getting what I’m paying for without having to go to an interactive “speed test” site.

There’s a handy speedtest-cli?, which is a python-based module with a command-line interface that can perform speed tests against Ookla’s legacy server. (You’ll notice that link forwards you to their new socket-based test service; neither speedtest-cli nor the code in the package being shown here uses that yet)

I run plenty of ruby, python, node and go (et al) programs on the command-line, but I wanted a way to measure this in R-proper (storing individual test results) on a regular basis as well as run something from the command-line whenever I wanted an interactive test. Thus begat speedtest?.

Testing the Need for Speed

After you devtools::install_github("hrbrmstr/speedtest") the package, you can either type speedtest::spd_test() at an R console or Rscript --quiet -e 'speedtest::spd_test()' on the command-line to see something like the following:

What you see there is a short-hand version of what’s available in the package:

  • spd_best_servers: Find “best” servers (latency-wise) from master server list
  • spd_closest_servers: Find “closest” servers (geography-wise) from master server list
  • spd_compute_bandwidth: Compute bandwidth from bytes transferred and time taken
  • spd_config: Retrieve client configuration information for the speedtest
  • spd_download_test: Perform a download speed/bandwidth test
  • spd_servers: Retrieve a list of SpeedTest servers
  • spd_upload_test: Perform an upload speed/bandwidth test
  • spd_test: Test your internet speed/bandwidth

The general idiom is to grab the test configuration file, collect the master list of servers, figure out which servers you’re going to test against and perform upload/download tests + collect the resultant statistics:

library(speedtest)
library(stringi)
library(hrbrthemes)
library(ggbeeswarm)
library(tidyverse)

config <- spd_config()

servers <- spd_servers(config=config)
closest_servers <- spd_closest_servers(servers, config=config)
only_the_best_severs <- spd_best_servers(closest_servers, config)

glimpse(spd_download_test(closest_servers[1,], config=config))
## Observations: 1
## Variables: 15
## $ url     <chr> "http://speed0.xcelx.net/speedtest/upload.php"
## $ lat     <dbl> 42.3875
## $ lng     <dbl> -71.1
## $ name    <chr> "Somerville, MA"
## $ country <chr> "United States"
## $ cc      <chr> "US"
## $ sponsor <chr> "Axcelx Technologies LLC"
## $ id      <chr> "5960"
## $ host    <chr> "speed0.xcelx.net:8080"
## $ url2    <chr> "http://speed1.xcelx.net/speedtest/upload.php"
## $ min     <dbl> 14.40439
## $ mean    <dbl> 60.06834
## $ median  <dbl> 55.28457
## $ max     <dbl> 127.9436
## $ sd      <dbl> 34.20695

glimpse(spd_upload_test(only_the_best_severs[1,], config=config))
## Observations: 1
## Variables: 18
## $ ping_time      <dbl> 0.02712567
## $ total_time     <dbl> 0.059917
## $ retrieval_time <dbl> 2.3e-05
## $ url            <chr> "http://speed0.xcelx.net/speedtest/upload.php"
## $ lat            <dbl> 42.3875
## $ lng            <dbl> -71.1
## $ name           <chr> "Somerville, MA"
## $ country        <chr> "United States"
## $ cc             <chr> "US"
## $ sponsor        <chr> "Axcelx Technologies LLC"
## $ id             <chr> "5960"
## $ host           <chr> "speed0.xcelx.net:8080"
## $ url2           <chr> "http://speed1.xcelx.net/speedtest/upload.php"
## $ min            <dbl> 6.240858
## $ mean           <dbl> 9.527599
## $ median         <dbl> 9.303148
## $ max            <dbl> 12.56686
## $ sd             <dbl> 2.451778
```

Spinning More Wheels

The default operation for each test function is to return summary statistics. If you dig into the package source, or look at how the legacy measuring site performs the speed test tasks, you’ll see that the process involves uploading and downloading a series of files (or, more generally, random data) of increasing size and calculating how long it takes to perform those actions. Smaller size tests will have skewed results on fast connections with low latency, which is why the sites do the incremental tests (and why you see the “needles” move the way they do on interactive tests). We can disable summary statistics and retrieve all of results of individual tests (as we’ll do in a bit).

Speed tests are also highly dependent on the processing capability of the target server as well as the network location (hops away), “quality” and load. You’ll often see interactive sites choose the “best” server. There are many ways to do that. One is geography, which has some merit, but should not be the only factor used. Another is latency, which is comparing a small connection test against each server and measuring which one comes back the fastest.

It’s straightforward to show the impact of each with a simple test harness. We’ll pick 3 geographic “closest” geographic servers, 3 “best” latency servers (those may overlap with “closest”) and 3 randomly chosen ones:

set.seed(8675309)

bind_rows(

  closest_servers[1:3,] %>%
    mutate(type="closest"),

  only_the_best_severs[1:3,] %>%
    mutate(type="best"),

  filter(servers, !(id %in% c(closest_servers[1:3,]$id, only_the_best_severs[1:3,]$id))) %>%
    sample_n(3) %>%
    mutate(type="random")

) %>%
  group_by(type) %>%
  ungroup() -> to_compare

select(to_compare, sponsor, name, country, host, type)
## # A tibble: 9 x 5
##                   sponsor            name       country
##                     <chr>           <chr>         <chr>
## 1 Axcelx Technologies LLC  Somerville, MA United States
## 2                 Comcast      Boston, MA United States
## 3            Starry, Inc.      Boston, MA United States
## 4 Axcelx Technologies LLC  Somerville, MA United States
## 5 Norwood Light Broadband     Norwood, MA United States
## 6       CCI - New England  Providence, RI United States
## 7                 PirxNet         Gliwice        Poland
## 8           Interoute VDC Los Angeles, CA United States
## 9                   UNPAD         Bandung     Indonesia
## # ... with 2 more variables: host <chr>, type <chr>

As predicted, there are some duplicates, but we’ll perform upload and download tests for each of them and compare the results. First, download:

map_df(1:nrow(to_compare), ~{
  spd_download_test(to_compare[.x,], config=config, summarise=FALSE, timeout=30)
}) -> dl_results_full

mutate(dl_results_full, type=stri_trans_totitle(type)) %>%
  ggplot(aes(type, bw, fill=type)) +
  geom_quasirandom(aes(size=size, color=type), width=0.15, shape=21, stroke=0.25) +
  scale_y_continuous(expand=c(0,5), labels=c(sprintf("%s", seq(0,150,50)), "200 Mb/s"), limits=c(0,200)) +
  scale_size(range=c(2,6)) +
  scale_color_manual(values=c(Random="#b2b2b2", Best="#2b2b2b", Closest="#2b2b2b")) +
  scale_fill_ipsum() +
  labs(x=NULL, y=NULL, title="Download bandwidth test by selected server type",
       subtitle="Circle size scaled by size of file used in that speed test") +
  theme_ipsum_rc(grid="Y") +
  theme(legend.position="none")

I’m going to avoid hammering really remote servers with the upload test (truth-be-told I just don’t want to wait as long as I know it’s going to take):

bind_rows(
  closest_servers[1:3,] %>% mutate(type="closest"),
  only_the_best_severs[1:3,] %>% mutate(type="best")
) %>%
  distinct(.keep_all=TRUE) -> to_compare

select(to_compare, sponsor, name, country, host, type)
## # A tibble: 6 x 5
##                   sponsor           name       country
##                     <chr>          <chr>         <chr>
## 1 Axcelx Technologies LLC Somerville, MA United States
## 2                 Comcast     Boston, MA United States
## 3            Starry, Inc.     Boston, MA United States
## 4 Axcelx Technologies LLC Somerville, MA United States
## 5 Norwood Light Broadband    Norwood, MA United States
## 6       CCI - New England Providence, RI United States
## # ... with 2 more variables: host <chr>, type <chr>

map_df(1:nrow(to_compare), ~{
  spd_upload_test(to_compare[.x,], config=config, summarise=FALSE, timeout=30)
}) -> ul_results_full

ggplot(ul_results_full, aes(x="Upload Test", y=bw)) +
  geom_quasirandom(aes(size=size, fill="col"), width=0.1, shape=21, stroke=0.25, color="#2b2b2b") +
  scale_y_continuous(expand=c(0,0.5), breaks=seq(0,16,4),
                     labels=c(sprintf("%s", seq(0,12,4)), "16 Mb/s"), limits=c(0,16)) +
  scale_size(range=c(2,6)) +
  scale_fill_ipsum() +
  labs(x=NULL, y=NULL, title="Upload bandwidth test by selected server type",
       subtitle="Circle size scaled by size of file used in that speed test") +
  theme_ipsum_rc(grid="Y") +
  theme(legend.position="none")

As indicated, data payload size definitely impacts the speed tests (as does “proximity”), but you’ll also see variance if you run the same tests again with the same target servers. Note, also, that the “command-line” ready function tries to make the quickest target selection and only chooses the max values.

FIN

The github site has a list of TODOs and any/all are welcome to pile on (full credit for any contributions, as usual). So kick the tyres, file issues and start measuring!

Cover image from Data-Driven Security
Amazon Author Page

4 Comments Measuring & Monitoring Internet Speed with R

  1. Nick Forrester

    Well, this is depressing…not only do you continue to throw out new and interesting packages like it’s no effort at all, I now see that I’m struggling along with a measly 24 Mbit/s download speed compared to your 100+ (maybe a good simile for UK vs US).
    On a serious note, really enjoy reading all your stuff.
    Cheers

    Reply
  2. Pingback: Measuring & Monitoring Internet Speed with R – Mubashir Qasim

  3. Michael

    This seems really useful- thanks for creating it. I’ve yet to play with it, but a preliminary question is: how intensive are the upload/ download speed tests- ie, while they run how much bandwidth do the chew up pinging and so forth?

    I ask because my use case might using them for frequent tests on a continuously running script over a web socket, but I’m wondering if the test is bandwidth expensive.

    Again, thanks

    Reply
    1. hrbrmstr

      Thx. It should be fine for regular use and a number of products use similar frameworks with similar d/l & u/l test sizes. I’ve been able to forego using this regularly (I still use it interactively occasionally) as I’ve got a Ubiquity firewall that has this test functionality baked-in and run a RIPE Atlas probe which also has similar tests.

      Fundamentally, the R functions perform the same level of testing that the browser-based speed tests do.

      Upload sizes are here: https://github.com/hrbrmstr/speedtest/blob/master/R/upload.r#L34

      Download sizes are here: https://github.com/hrbrmstr/speedtest/blob/master/R/download.r#L44

      If having a parameter to choose sizes would be useful pls shoot an issue over at GitLab or GitHub and I’ll see when I can get to it.

      Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.