New Apache Drill UDF for Processing Twitter Tweet Text

There are many ways to gather Twitter data for analysis and many R and Python (et al) libraries make full use of the Twitter API when building a corpus to extract useful metadata for each tweet along with the text of each tweet. However, many corpus archives are minimal and only retain a small portion of the metadata — often just tweet timestamp, the tweet creator and the tweet text — leaving to the analyst the trudging work of re-extracting hashtags, mentions, URLs (etc).

Twitter provides a tweet-text processing library for many languages. One of these languages is Java. Since it make sense to perform at-scale data operations in Apache Drill, it also seemed to make sense that Apache Drill could use a tweet metadata extraction set of user-defined functions (UDFs). Plus, there just aren’t enough examples of Drill UDFs out there. Thus begat drill-twitter-text?.

What’s Inside the Tin?

There are five UDF functions in the package:

tw_parse_tweet(string): Parses the tweet text and returns a map column with the following named values:
- weightedLength: (int) the overall length of the tweet with code points weighted per the ranges defined in the configuration file
- permillage: (int) indicates the proportion (per thousand) of the weighted length in comparison to the max weighted length. A value > 1000 indicates input text that is longer than the allowable maximum.
- isValid: (boolean) indicates if input text length corresponds to a valid result.
- display_start / display_end: (int) indices identifying the inclusive start and exclusive end of the displayable content of the Tweet.
- valid_start / valid_end: (int) indices identifying the inclusive start and exclusive end of the valid content of the Tweet.
tw_extract_hashtags(string): Extracts all hashtags in the tweet text into a list which can be FLATTEN()ed.
tw_extract_screennames(string): Extracts all screennames in the tweet text into a list which can be FLATTEN()ed.
tw_extract_urls(string): Extracts all URLs in the tweet text into a list which can be FLATTEN()ed.
tw_extract_reply_screenname(): Extracts the reply screenname (if any) from the tweet text into a VARCHAR.

The repo has all the necessary bits and info to help you compile and load the necessary JARs, but those in a hurry can just copy all the files in the target directory to your local jars/3rparty directory and restart Drill.

Usage

Here’s an example of how to call each UDF along with the output:

SELECT 
  tw_extract_screennames(tweetText) AS mentions,
  tw_extract_hashtags(tweetText) AS tags,
  tw_extract_urls(tweetText) AS urls,
  tw_extract_reply_screenname(tweetText) AS reply_to,
  tw_parse_tweet(tweetText) AS tweet_meta
FROM
  (SELECT 
     '@youThere Load data from #Apache Drill to @QlikSense - #Qlik Tuesday Tips and Tricks #ApacheDrill #BigData https://t.co/fkAJokKF5O https://t.co/bxdNCiqdrE' AS tweetText
   FROM (VALUES((1))))

+----------+------+------+----------+------------+
| mentions | tags | urls | reply_to | tweet_meta |
+----------+------+------+----------+------------+
| ["youThere","QlikSense"] | ["Apache","Qlik","ApacheDrill","BigData"] | ["https://t.co/fkAJokKF5O","https://t.co/bxdNCiqdrE"] | youThere | {"weightedLength":154,"permillage":550,"isValid":true,"display_start":0,"display_end":153,"valid_start":0,"valid_end":153} |
+----------+------+------+----------+------------+

FIN

Kick the tyres and file issues and PRs as needed.

rud.is