Recipe 10 Working With Custom Delimited Format Files

10.1 Problem

You have a custom delimited file format that you want to be able to work with Drill and R.

10.2 Solution

Add a new entries to Drill storage workspaces and formats and use custom queries to have sane data types.

10.3 Discussion

A helpful chap—Ben Marwick—provided an example of a truly evil set of delimited files from the Wikimedia Commons that have pagecount data for all Wikimedia properties. The files are huge so we’ll be working with an extract. The link to the sample extract file will be posted soon, but it was created with:

$ bzcat pagecounts-2012-12-01.bz2 | head -1000024 > pagecounts-2012-02-01-excerpt.csvw
$ bzip2 --compress pagecounts-2012-02-01-excerpt.csvw

You’ll find out the reason for the csvw in a bit.

The files all look like this:

# Wikimedia page request counts for 01/12/2012 (dd/mm/yyyy)
#
# Each line shows 'project page daily-total hourly-counts'
#
# Project is 'language-code project-code'
#
# Project-code is
#
# b:wikibooks,
# k:wiktionary,
# n:wikinews,
# q:wikiquote,
# s:wikisource,
# v:wikiversity,
# z:wikipedia (z added by merge script: wikipedia happens to be sorted last in dammit.lt files, but without suffix)
#
# Counts format: only hours with page view count > 0 (or data missing) are represented,
#
# Hour 0..23 shown as A..X (saves up to 22 bytes per line compared to comma separated values), followed by view count.
# If data are missing for some hour (file missing or corrupt) a question mark (?) is shown,
# and a adjusted daily total is extrapolated as follows: for each missing hour the total is incremented with hourly average
#
# Page titles are shown unmodified (preserves sort sequence)
#
Ar.mw Ar 5 M3R2
De.mw De 3 T3
En.mw En 3 E1F2
aa.b ?banner=B12_5C_113020_hover_nohover 11 A11
aa.b File:Broom_icon.svg 2 D1X1
aa.b File:Commons-logo.svg 1 X1

Just with another 65 million lines (or so) per file.

They come bzipped and Drill can definitely handle files with the bz2 extension, but we are going to rename these files to help ensure we recognize the delimited encoding type when we browse the file system and when constructing queries.

The comments in the data files are helpful since they tell us the format:

  • comment character is #
  • space ' ' is the delimiter
  • project is really language-code and project-code separated by a . (which we’ll deal with here)
  • there’s a wonky bit of encoding in the hourly-counts (which we’ll deal with in another recipe)

We’re going to store all these pagecount files (we only have one due to size) in a ~/Data/wikimedia directory so we can apply an input storage format directory-wide (this isn’t 100% necessary but an organized filesystem is a happy filesystem) and create a special storage format for these files.

You’ll need to go to http://localhost:8047/storage and “update” the dfs storage plugin so we can add these new entries. While the editor has syntax checking, it is recommended that you either use a browser that lets you change the size of the input text area box or use an external text editor to do this work.

Find the line that has:

"workspaces": {

and add a new workspace, which is (for this filesystem example) just a dfs shortcut name to a location on the filesystem:

"wikimedia": {
  "location": "/Users/bob/Data/wikimedia",
  "writable": true,
  "defaultInputFormat": "wikicsv",
  "allowAccessOutsideWorkspace": false
},

This is JSON so the ending , is necessary unless it’s the last entry in workspaces. You’ll (again) need to change the specific path to the place where you’ve downloaded the file. We’ve taken a chance and made this directory writeable and told Drill that the default format will be “wikicsv” files, which we’ll define next.

Now, find the line that starts with:

"formats": {

and add an entry for our new “wikicsv” format.

"wikicsv": {
  "type": "text",
  "extensions": [
    "csvw"
  ],
  "delimiter": " "
},

The default comment character for delimited files is “#” so there’s no need to save that.

Save the configuration update and head over to http://localhost:8047/query (or use drill-conf on the command-line) to test out the new setup.

The example pagecount file is named pagecounts-2012-02-01-excerpt.csvw.bz2 which lets Drill know it’s a wikimedia delimited file and also that it’s bzip compressed.

Let’s try a query:

0: jdbc:drill:> SELECT * FROM dfs.wikimedia.`/*.csvw.bz2` LIMIT 10;
+------------------------------------------------------------+
|                          columns                           |
+------------------------------------------------------------+
| ["Ar.mw","Ar","5","M3R2"]                                  |
| ["De.mw","De","3","T3"]                                    |
| ["En.mw","En","3","E1F2"]                                  |
| ["aa.b","?banner=B12_5C_113020_hover_nohover","11","A11"]  |
| ["aa.b","File:Broom_icon.svg","2","D1X1"]                  |
| ["aa.b","File:Commons-logo.svg","1","X1"]                  |
| ["aa.b","File:Incubator-notext.svg","5","C1G1V1X2"]        |
| ["aa.b","File:Wikibooks-logo.svg","1","X1"]                |
| ["aa.b","File:Wikimania.svg","1","X1"]                     |
| ["aa.b","File:Wikimedia-logo.svg","1","X1"]                |
+------------------------------------------------------------+
10 rows selected (0.268 seconds)

While it’s not exactly what we want, the good news is that the headers were skipped and the file decompressed perfectly. However, we’ll need to use columns[#] to help get to the data in each column. Let’s try that again:

0: jdbc:drill:> SELECT columns[0], columns[1], columns[2], columns[3] FROM dfs.wikimedia.`/*.csvw.bz2` LIMIT 10;
+---------+--------------------------------------+---------+-----------+
| EXPR$0  |                EXPR$1                | EXPR$2  |  EXPR$3   |
+---------+--------------------------------------+---------+-----------+
| Ar.mw   | Ar                                   | 5       | M3R2      |
| De.mw   | De                                   | 3       | T3        |
| En.mw   | En                                   | 3       | E1F2      |
| aa.b    | ?banner=B12_5C_113020_hover_nohover  | 11      | A11       |
| aa.b    | File:Broom_icon.svg                  | 2       | D1X1      |
| aa.b    | File:Commons-logo.svg                | 1       | X1        |
| aa.b    | File:Incubator-notext.svg            | 5       | C1G1V1X2  |
| aa.b    | File:Wikibooks-logo.svg              | 1       | X1        |
| aa.b    | File:Wikimania.svg                   | 1       | X1        |
| aa.b    | File:Wikimedia-logo.svg              | 1       | X1        |
+---------+--------------------------------------+---------+-----------+
10 rows selected (0.18 seconds)

Better! However, those computed column names will have to go and we also need to do something about the first column since it’s really two columns compressed together. Renaming is straightforward (the sqlline prefixes will be dropped from now on to make it easier to copy/past the SQL queries; just assume it’s being used in a Drill web query box or sqlline prompt):

SELECT 
  columns[0] AS project,
  columns[1] AS page,
  CAST(columns[2] AS DOUBLE) AS daily_total,
  columns[3] AS hr_ct
FROM dfs.wikimedia.`/*.csvw.bz2` LIMIT 10 
+----------+--------------------------------------+--------------+-----------+
| project  |                 page                 | daily_total  |   hr_ct   |
+----------+--------------------------------------+--------------+-----------+
| Ar.mw    | Ar                                   | 5.0          | M3R2      |
| De.mw    | De                                   | 3.0          | T3        |
| En.mw    | En                                   | 3.0          | E1F2      |
| aa.b     | ?banner=B12_5C_113020_hover_nohover  | 11.0         | A11       |
| aa.b     | File:Broom_icon.svg                  | 2.0          | D1X1      |
| aa.b     | File:Commons-logo.svg                | 1.0          | X1        |
| aa.b     | File:Incubator-notext.svg            | 5.0          | C1G1V1X2  |
| aa.b     | File:Wikibooks-logo.svg              | 1.0          | X1        |
| aa.b     | File:Wikimania.svg                   | 1.0          | X1        |
| aa.b     | File:Wikimedia-logo.svg              | 1.0          | X1        |
+----------+--------------------------------------+--------------+-----------+
10 rows selected (1.039 seconds)

Along with having better column names we also took the opportunity to ensure daily_total is a numeric type vs a character. The hr_ct column conversion will be covered in another recipe (we’ll need to make a Drill User Defined Function [UDF] for it) but we can handle separating project out here:

SELECT 
  SUBSTR(columns[0], 1, STRPOS(columns[0], '.')-1) AS language_code,
  SUBSTR(columns[0], STRPOS(columns[0], '.')+1) AS project_code,
  columns[1] AS page,
  CAST(columns[2] AS DOUBLE) AS daily_total,
  columns[3] AS hr_ct
FROM dfs.wikimedia.`/*.csvw.bz2` LIMIT 10
+----------------+---------------+--------------------------------------+--------------+-----------+
| language_code  | project_code  |                 page                 | daily_total  |   hr_ct   |
+----------------+---------------+--------------------------------------+--------------+-----------+
| Ar             | mw            | Ar                                   | 5.0          | M3R2      |
| De             | mw            | De                                   | 3.0          | T3        |
| En             | mw            | En                                   | 3.0          | E1F2      |
| aa             | b             | ?banner=B12_5C_113020_hover_nohover  | 11.0         | A11       |
| aa             | b             | File:Broom_icon.svg                  | 2.0          | D1X1      |
| aa             | b             | File:Commons-logo.svg                | 1.0          | X1        |
| aa             | b             | File:Incubator-notext.svg            | 5.0          | C1G1V1X2  |
| aa             | b             | File:Wikibooks-logo.svg              | 1.0          | X1        |
| aa             | b             | File:Wikimania.svg                   | 1.0          | X1        |
| aa             | b             | File:Wikimedia-logo.svg              | 1.0          | X1        |
+----------------+---------------+--------------------------------------+--------------+-----------+
10 rows selected (0.188 seconds)

Now, we can try that from R with sergeant:

## # Source:   table<( SELECT SUBSTR(columns[0], 1, STRPOS(columns[0],
## #   '.')-1) AS language_code, SUBSTR(columns[0], STRPOS(columns[0],
## #   '.')+1) AS project_code, columns[1] AS page, CAST(columns[2] AS
## #   DOUBLE) AS daily_total, columns[3] AS hr_ct FROM
## #   dfs.wikimedia.`/*.csvw.bz2` )> [?? x 5]
## # Database: DrillConnection
##    language_code daily_total hr_ct    project_code page                   
##    <chr>               <dbl> <chr>    <chr>        <chr>                  
##  1 Ar                     5. M3R2     mw           Ar                     
##  2 De                     3. T3       mw           De                     
##  3 En                     3. E1F2     mw           En                     
##  4 aa                    11. A11      b            ?banner=B12_5C_113020_…
##  5 aa                     2. D1X1     b            File:Broom_icon.svg    
##  6 aa                     1. X1       b            File:Commons-logo.svg  
##  7 aa                     5. C1G1V1X2 b            File:Incubator-notext.…
##  8 aa                     1. X1       b            File:Wikibooks-logo.svg
##  9 aa                     1. X1       b            File:Wikimania.svg     
## 10 aa                     1. X1       b            File:Wikimedia-logo.svg
## # ... with more rows
## # A tibble: 23 x 2
##    language_code      n
##  * <chr>          <int>
##  1 ar            576618
##  2 az             88845
##  3 af             67132
##  4 an             46220
##  5 als            28824
##  6 ast            27216
##  7 arz            25133
##  8 am             20915
##  9 bar            19659
## 10 ba             18648
## 11 ang            16531
## 12 as             14390
## 13 bat-smg        11806
## 14 arc             9524
## 15 ace             9301
## 16 av              7092
## 17 ab              6177
## 18 ay              3720
## 19 ak              1522
## 20 aa               724
## 21 En                 1
## 22 De                 1
## 23 Ar                 1

As noted, converting the hr_ct column is complex enough that it needs its own recipe. And, take note of Recipe TBD that shows how to convert these delimited files into parquet files which can really speed up processing and reduce memory requirements.

NOTE: These Wikimedia Commons pagecount files would work best converted to parquet files and stored in in a small Drill cluster but they can be handled on a single-node system with the right configuration.