Using MonetDB[Lite] with real-world CSV files

MonetDBLite (for R) was announced/released today and, while the examples they provide are compelling there’s a “gotcha” for potential new folks using SQL in general and SQL + MonetDB + R together. The toy example on the site shows dumping mtcars with dbWriteTable and then doing things. Real-world CSV files have headers and commas (MonetDB by default expects no headers and | as a separator). Also, you need to make a MonetDB table (with a schema) before copying your giant CSV file full of data into it. That’s a pain to do by hand.

Here’s another toy example that shows how to:

  • use a specific directory for the embedded MonetDB files
  • auto-generate the CREATE TABLE syntax from a sample of the real-world CSV file
  • load the data from the real-world CSV file (i.e. skipping the header and using a , as a delimiter
  • wire it up to R & dplyr

It’s very similar to the MonetDBLite toy example but may help folks get up and running in the real world with less frustration.

library(MonetDBLite)
library(MonetDB.R)
library(dplyr)
 
# use built-in mtcars to make a CS File
# we're more likely to find a file in this format vs what dbWriteTable produces
# i.e. it has a header and commas for separator
write.csv(add_rownames(mtcars, "auto"), "mtcars.csv", row.names=FALSE)
 
# make a connection and get rid of the old table if it exists since
# we are just playing around. in real life you prbly want to keep
# the giant table there vs recreate it every time
mdb <- dbConnect(MonetDBLite(), "/full/path/to/your/preferred/monetdb/data/dir")
try(invisible(dbSendQuery(mdb, "DROP TABLE mtcars")), silent=TRUE)
 
# now we guess the column types by reading in a small fraction of the rows
guess <- read.csv("mtcars.csv", stringsAsFactors=FALSE, nrows=1000)
create <- sprintf("CREATE TABLE mtcars ( %s )", 
                  paste0(sprintf('"%s" %s', colnames(guess), 
                                 sapply(guess, dbDataType, dbObj=mdb)), collapse=","))
 
# we build the table creation dynamically from what we've learned from guessing
invisible(dbSendQuery(mdb, create))
 
# and then we load the data into the database, skipping the header and specifying a comma
invisible(dbSendQuery(mdb, "COPY OFFSET 2 
                                 INTO mtcars 
                                 FROM '/full/path/to/where/you/wrote/the/csv/to/mtcars.csv' USING  DELIMITERS ','"))
 
# now wire it up to dplyr
mdb_src <- src_monetdb(embedded="/full/path/to/your/preferred/monetdb/data/dir")
mdb_mtcars <- tbl(mdb_src, "mtcars")
 
# and have some fun
count(mdb_mtcars, cyl)
 
## Source: MonetDB  ()
## From: <derived table> [?? x 2]
## 
##      cyl     n
##    (int) (dbl)
## 1      6     7
## 2      4    11
## 3      8    14
## ..   ...   ...
Data-Driven Security Podcast
Episode 29: With Great Power Law…
R World News Podcast — Episode 2
Buy on AmazonDDS Blog
DDS PodcastAmazon Author Page

3 Comments Using MonetDB[Lite] with real-world CSV files

  1. Pingback: Using MonetDBLite with real-world CSV files | Dinesh Ram Kali.

  2. charles

    Thanks very much for this post. I’be been trying to grok the interface between MonetDB and R and this helped quite a bit. I noticed that MonetDB.R has a monetdb.read.csv() function. I tried it on your example and it seems to work.

    Reply
  3. Hannes Mühleisen

    Hi, thanks for your post! Quick hint, explicitly including the MonetDBLite package is not required. Otherwise, I also recommend the monetdb.read.csv function which tries to automate the CSV loading process.

    Reply

Leave a Reply