Recipe 11 Writing “Simple” Drill Custom Functions (UDFs) For Field Transformations

11.1 Problem

You have a field that requires complex transformations in order to be useful in a Drill context.

11.2 Solution

Write a Drill custom function/user-defined function (UDF)

11.3 Discussion

In Recipe TBD we came across a field that holds encoded hourly counts of pageviews for Wikimedia property pages. The encoding is a contiguous string that uses a series of single A-Z characters to encode hours 00:23 followed by an integer value. This would be an example of such a string:

M3R20A131Q27H53

It’s possible to handle this with some serious SQL machinations, but we’d likely go insane in the process. An alternative is to write a Drill user-defined function (UDF) that handles some of this for us. One big hurdle for us is that UDFs are written in Java. So, you will either need some Java familiarity to get through this recipe or offer up some adult beverages in exchange for some assistance from someone who can tolerate the verbosity and idioms of Java.

An absolute prerequisite for this recipe is a complete review of the Drill manual chapter on user-defined functions. It doesn’t make much sense to duplicate that content, but we’ll draw on concepts in that example for this solution.

Drill UDF “projects” define naming conventions, extra Java library dependencies, UDF incoming parameters (including types) and the UDF return value(s). We’re going to keep it simple for this example and take in a value such as:

M3R20A131Q27H53

and return back a JSON string like:

[{"hr":12,"hr_ct":3},{"hr":17,"hr_ct":20},{"hr":0,"hr_ct":131},{"hr":16,"hr_ct":27},{"hr":7,"hr_ct":53}]

which will enable us to work with standard Drill functions for further transformations.

Star by cloning the example UDF repo from the tutorial and renaming it for this example (we’ll assume you’re in your “development” directory because you keep things organized, right?):

$ git clone https://github.com/tgrall/drill-simple-mask-function.git
$ mv drill-simple-mask-function/ pagecount-hour-count-split

In the top-level of that new directory there’s a pom.xml file which holds all the metadata about this project. We’re only going to change a few bits of this metadata since we’re not going to add in any other dependencies. Change each of these XML tags to look like this:

<artifactId>pagecount-hour-count-split</artifactId>
<version>1.0</version>
<name>Drill Pagecount Hour Count Splitter Function</name>

We also need to modify some configuration information in src/main/resources/drill-module.conf since we are going to have our own class name for our new UDF. Make that file look like:

drill {

  classpath.scanning {
    base.classes : ${?drill.classpath.scanning.base.classes} [
      org.apache.drill.contrib.function.ConvertPageCountFunc
    ],
    packages : ${?drill.classpath.scanning.packages} [
      org.apache.drill.contrib.function
    ]
  }
}

org.apache.drill.contrib.function.ConvertPageCountFunc is going to be our new function. But, we’ll first need to rename the file SimpleMaskFunc.java in src/main/java/org/apache/drill/contrib/function/ to ConvertPageCountFunc.java.

We’ll edit the file to look like this:

package org.apache.drill.contrib.function;

import com.google.common.base.Strings;
import io.netty.buffer.DrillBuf;
import org.apache.drill.exec.expr.DrillSimpleFunc;
import org.apache.drill.exec.expr.annotations.FunctionTemplate;
import org.apache.drill.exec.expr.annotations.Output;
import org.apache.drill.exec.expr.annotations.Param;
import org.apache.drill.exec.expr.holders.NullableVarCharHolder;
import org.apache.drill.exec.expr.holders.VarCharHolder;

import javax.inject.Inject;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

@FunctionTemplate(
  name = "pagecounts_to_json", // THIS IS WHAT WE'LL USE IN Drill QUERIES TO TRANSFORM THE DATA
  scope = FunctionTemplate.FunctionScope.SIMPLE,
  nulls = FunctionTemplate.NullHandling.NULL_IF_NULL
  )
public class ConvertPageCountFunc implements DrillSimpleFunc {

  @Param
  NullableVarCharHolder input; // ONLY ONE INPUT PARAMETER (i.e. THE COLUMN WE'RE PASSING IN)

  @Output
  VarCharHolder out; // ONLY ONE OUTPUT VALUE (i.e. THE TRANSFORMED DATA TO JSON)

  @Inject
  DrillBuf buffer;

  public void setup() {}

  public void eval() {

    // get the pagecount compacted hour-count string
    String inputValue = org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(input.start, input.end, input.buffer);

    // this pattern will match all valid hour-count entries (and deal with "?")
    java.util.regex.Pattern hcList = java.util.regex.Pattern.compile("([A-Xa-x][0-9]+|[A-Xa-x]\\?)");
    java.util.regex.Matcher hcListMatcher = hcList.matcher(inputValue);

    // this pattern will match individual components of hour-count entries
    java.util.regex.Pattern hc = java.util.regex.Pattern.compile("([A-Xa-x])([0-9]+|\\?)");
    java.util.regex.Matcher hc_entry; // used in the while loop

    String outputValue = "["; // start of 'json'
    
    // for each hr-ct match:
    //   - extract hr
    //   - extract ct; if ct is ? then ct is null

    while (hcListMatcher.find()) {
      
      hc_entry = hc.matcher(hcListMatcher.group());
      
      hc_entry.find();
      outputValue = outputValue + "{\"hr\":" + (int)(Character.toUpperCase(hc_entry.group(1).charAt(0))-65) + ",\"hr_ct\":";
      outputValue = outputValue + (hc_entry.group(2) == "?" ? "null" : hc_entry.group(2)) + "},";
      
    }
    
    outputValue = outputValue + "]"; // end of json
    outputValue = outputValue.replace(",]", "]"); // pesky trailing ,

    // put the output value in the out buffer
    out.buffer = buffer;
    out.start = 0;
    out.end = outputValue.getBytes().length;
    buffer.setBytes(0, outputValue.getBytes());

  }

}

Even if you’re not super-familiar with Java the above code should be pretty readable. We’re taking in a string and returning a string. We made things easier since there are no dependencies on third-party Java classes that don’t ship with Drill. One thing folks who are familiar with Java will notice is that we fully qualify virtually all of the named classes. That’s due to the context in which Drill is running these functions. It’s a tiny bit of developer-ionconvenience for some performance and execution assurance benefits.

Once we’re done editing that file, fire up a command prompt and ensure you’re in the UDF top-level directory. Then run the following:

$ mvn clean package # there should be no errors

$ # make our new functions accessible by Drill
$ cp target/pagecount-hour-count-split*.jar /usr/local/drill/jars/3rdparty/ 

$ drillbit.sh restart # restart Drill

Drill should startup without issue. Now we can do something like the following in the Drill query web interface or from a sqlline prompt:

SELECT 
  language_code, 
  project_code, 
  page, 
  daily_total, 
  b.hc.hr AS hr, 
  b.hc.hr_ct AS hr_ct 
FROM (
  SELECT 
    language_code, 
    project_code, 
    page, 
    daily_total, 
    FLATTEN(a.hr_ct) AS hc 
  FROM (
    SELECT 
      SUBSTR(columns[0], 1, STRPOS(columns[0], '.')-1) AS language_code,
      SUBSTR(columns[0], STRPOS(columns[0], '.')+1) AS project_code,
      columns[1] AS page,
      CAST(columns[2] AS DOUBLE) AS daily_total,
      convert_fromJSON(pagecounts_to_json(columns[3])) AS hr_ct
    FROM dfs.wikimedia.`/*.csvw.bz2` 
    LIMIT 10) a
  ) b
+----------------+---------------+--------------------------------------+--------------+-----+--------+
| language_code  | project_code  |                 page                 | daily_total  | hr  | hr_ct  |
+----------------+---------------+--------------------------------------+--------------+-----+--------+
| Ar             | mw            | Ar                                   | 5.0          | 12  | 3      |
| Ar             | mw            | Ar                                   | 5.0          | 17  | 2      |
| De             | mw            | De                                   | 3.0          | 19  | 3      |
| En             | mw            | En                                   | 3.0          | 4   | 1      |
| En             | mw            | En                                   | 3.0          | 5   | 2      |
| aa             | b             | ?banner=B12_5C_113020_hover_nohover  | 11.0         | 0   | 11     |
| aa             | b             | File:Broom_icon.svg                  | 2.0          | 3   | 1      |
| aa             | b             | File:Broom_icon.svg                  | 2.0          | 23  | 1      |
| aa             | b             | File:Commons-logo.svg                | 1.0          | 23  | 1      |
| aa             | b             | File:Incubator-notext.svg            | 5.0          | 2   | 1      |
| aa             | b             | File:Incubator-notext.svg            | 5.0          | 6   | 1      |
| aa             | b             | File:Incubator-notext.svg            | 5.0          | 21  | 1      |
| aa             | b             | File:Incubator-notext.svg            | 5.0          | 23  | 2      |
| aa             | b             | File:Wikibooks-logo.svg              | 1.0          | 23  | 1      |
| aa             | b             | File:Wikimania.svg                   | 1.0          | 23  | 1      |
| aa             | b             | File:Wikimedia-logo.svg              | 1.0          | 23  | 1      |
+----------------+---------------+--------------------------------------+--------------+-----+--------+
16 rows selected (0.427 seconds) 

Most of the “magic” happens in this part of the first SELECT:

convert_fromJSON(pagecounts_to_json(columns[3])) AS hr_ct

You should recognize pagecounts_to_json() as our new UDF! We pass in the last column to it for transformation. We also deliberately returned a valid JSON array so we could use FLATTEN() on it which turns one record into multiple records (as many as are in the array). From there we can use standard Drill idioms for querying the JSON data in the “atomic” records.

SQL queries like that one can really make on appreciate the abstractions afforded in “normal” dplyr use in R.

You can find the working UDF source code tree in the extra directory of this book’s GitHub repository.