Quick Hit: Comparison of “Whole File Reading” Methods

(This is part 1 of n posts using this same data; n will likely be 2-3, and the posts are more around optimization than anything else.)

I recently had to analyze HTTP response headers (generated by a HEAD request) from around 74,000 sites (each response stored in a text file). They look like this:

HTTP/1.1 200 OK
Date: Mon, 08 Jun 2020 14:40:45 GMT
Server: Apache
Last-Modified: Sun, 26 Apr 2020 00:06:47 GMT
ETag: "ace-ec1a0-5a4265fd413c0"
Accept-Ranges: bytes
Content-Length: 967072
X-Frame-Options: SAMEORIGIN
Content-Type: application/x-msdownload

I do this quite a bit in R when we create new studies at work, but I’m usually only working with a few files. In this case I had to go through all these files to determine if a condition hypothesis (more on that in one of the future posts) was accurate.

Reading in a bunch of files (each one into a string) is fairly straightforward in R since readChar() will do the work of reading and we just wrap that in an iterator:

length(fils)
## [1] 73514 

# check file size distribution
summary(
  vapply(
    X = fils,
    FUN = file.size,
    FUN.VALUE = numeric(1),
    USE.NAMES = FALSE
  )
)
## Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 19.0   266.0   297.0   294.8   330.0  1330.0 

# they're all super small

system.time(
  vapply(
    X = fils, 
    FUN = function(.f) readChar(.f, file.size(.f)), 
    FUN.VALUE = character(1), 
    USE.NAMES = FALSE
  ) -> tmp 
)
##  user  system elapsed 
## 2.754   1.716   4.475 

NOTE: You can use lapply() or sapply() to equal effect as they all come in around 5 seconds on a modern SSD-backed system.

Now, five seconds is completely acceptable (though that brief pause does feel awfully slow for some reason), but can we do better? I mean we do have some choices when it comes to slurping up the contents of a file into a length 1 character vector:

  • base::readChar()
  • readr::read_file()
  • stringi::stri_read_raw() (+ rawToChar())

Do any of them beat {base}? Let’s see (using the largest of the files):

library(stringi)
library(readr)
library(microbenchmark)

largest <- fils[which.max(sapply(fils, file.size))]

file.size(largest)
## [1] 1330

microbenchmark(
  base = readChar(largest, file.size(largest)),
  readr = read_file(largest),
  stringi = rawToChar(stri_read_raw(largest)),
  times = 1000,
  control = list(warmup = 100)
)
## Unit: microseconds
##     expr     min       lq      mean   median       uq     max neval
##     base  79.862  93.5040  98.02751  95.3840 105.0125 161.566  1000
##    readr 163.874 186.3145 190.49073 189.1825 192.1675 421.256  1000
##  stringi  52.113  60.9690  67.17392  64.4185  74.9895 249.427  1000

I had predicted that the {stringi} approach would be slower given that we have to explicitly turn the raw vector into a character vector, but it is modestly faster. ({readr} has quite a bit of functionality baked into it — for good reasons — which doesn’t help it win any performance contests).

I still felt there had to be an even faster method, especially since I knew that the files all had HTTP response headers and that they every one of them could each be easily read into a string in (pretty much) one file read operation. That knowledge will let us make a C++ function that cuts some corners (more like “sands” some corners, really). We’ll do that right in R via {Rcpp} in this function (annotated in C++ code comments):

library(Rcpp)

cppFunction(code = '
String cpp_read_file(std::string fil) {

  // our input stream
  std::ifstream in(fil, std::ios::in | std::ios::binary);

  if (in) { // we can work with the file

  #ifdef Win32
    struct _stati64 st; // gosh i hate windows
    _wstati64(fil.cstr(), &st) // this shld work but I did not test it
  #else
    struct stat st;
    stat(fil.c_str(), &st);
  #endif

    std::string out(st.st_size, 0); // make a string buffer to hold the data

    in.seekg(0, std::ios::beg); // ensure we are at the beginning
    in.read(&out[0], st.st_size); // read in the file
    in.close();

    return(out);

  } else {
    return(NA_STRING); // file missing or other errors returns NA
  }

}
', includes = c(
  "#include <fstream>",
  "#include <string>",
  "#include <sys/stat.h>"
))

Is that going to be faster?

microbenchmark(
  base = readChar(largest, file.size(largest)),
  readr = read_file(largest),
  stringi = rawToChar(stri_read_raw(largest)),
  rcpp = cpp_read_file(largest),
  times = 1000,
  control = list(warmup = 100)
)
## Unit: microseconds
##     expr     min       lq      mean   median       uq     max neval
##     base  80.500  91.6910  96.82752  94.3475 100.6945 295.025  1000
##    readr 161.679 175.6110 185.65644 186.7620 189.7930 399.850  1000
##  stringi  51.959  60.8115  66.24508  63.9250  71.0765 171.644  1000
##     rcpp  15.072  18.3485  21.20275  21.0930  22.6360  62.988  1000

It sure looks like it, but let’s put it to the test:

system.time(
  vapply(
    X = fils, 
    FUN = cpp_read_file, 
    FUN.VALUE = character(1), 
    USE.NAMES = FALSE
  ) -> tmp 
)
##  user  system elapsed 
## 0.446   1.244   1.693 

I’ll take a two-second wait over a five-second wait any day!

FIN

I have a few more cases coming up where there will be 3-5x the number of (similar) files that I’ll need to process, and this optimization will shave time off as I iterate through each analysis, so the trivial benefits here will pay off more down the road.

The next post in this particular series will show how to use the {future} family to reduce the time it takes to turn those HTTP headers into data we can use.

If I missed your favorite file slurping function, drop a note in the comments and I’ll update the post with new benchmarks.

Cover image from Data-Driven Security
Amazon Author Page

7 Comments Quick Hit: Comparison of “Whole File Reading” Methods

  1. Pingback: Quick Hit: Comparison of “Whole File Reading” Methods – Data Science Austria

  2. Pingback: Quick Hit: Speeding Up Data Frame Creation – Data Science Austria

    1. hrbrmstr

      #ty for catching that! I simplified the function to avoid dealing with wide character filename edge cases in windows and neglected to replacce that reference with fil.c_str().

      The line is fixed. It now is:

      _wstati64(fil.cstr(), &st)
      
      Reply
  3. Pingback: Quick Hit: Speeding Up Data Frame Creation | rud.is

  4. Hongyuan Jia

    I would like to know if data.table::fread(sep= NULL) provides promising results or not. When sep is set to NULL, fread will read each line in a character and put them in a column.

    Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.