There are many ways to gather Twitter data for analysis and many R and Python (et al) libraries make full use of the Twitter API when building a corpus to extract useful metadata for each tweet along with the text of each tweet. However, many corpus archives are minimal and only retain a small portion of the metadata — often just tweet timestamp, the tweet creator and the tweet text — leaving to the analyst the trudging work of re-extracting hashtags, mentions, URLs (etc).
Twitter provides a tweet-text processing library for many languages. One of these languages is Java. Since it make sense to perform at-scale data operations in Apache Drill, it also seemed to make sense that Apache Drill could use a tweet metadata extraction set of user-defined functions (UDFs). Plus, there just aren’t enough examples of Drill UDFs out there. Thus begat drill-twitter-text?.
What’s Inside the Tin?
There are five UDF functions in the package:
tw_parse_tweet(string): Parses the tweet text and returns a map column with the following named values:weightedLength: (int) the overall length of the tweet with code points weighted per the ranges defined in the configuration filepermillage: (int) indicates the proportion (per thousand) of the weighted length in comparison to the max weighted length. A value > 1000 indicates input text that is longer than the allowable maximum.isValid: (boolean) indicates if input text length corresponds to a valid result.display_start/display_end: (int) indices identifying the inclusive start and exclusive end of the displayable content of the Tweet.valid_start/valid_end: (int) indices identifying the inclusive start and exclusive end of the valid content of the Tweet.
tw_extract_hashtags(string): Extracts all hashtags in the tweet text into a list which can beFLATTEN()ed.tw_extract_screennames(string): Extracts all screennames in the tweet text into a list which can beFLATTEN()ed.tw_extract_urls(string): Extracts all URLs in the tweet text into a list which can beFLATTEN()ed.tw_extract_reply_screenname(): Extracts the reply screenname (if any) from the tweet text into aVARCHAR.
The repo has all the necessary bits and info to help you compile and load the necessary JARs, but those in a hurry can just copy all the files in the target directory to your local jars/3rparty directory and restart Drill.
Usage
Here’s an example of how to call each UDF along with the output:
SELECT
tw_extract_screennames(tweetText) AS mentions,
tw_extract_hashtags(tweetText) AS tags,
tw_extract_urls(tweetText) AS urls,
tw_extract_reply_screenname(tweetText) AS reply_to,
tw_parse_tweet(tweetText) AS tweet_meta
FROM
(SELECT
'@youThere Load data from #Apache Drill to @QlikSense - #Qlik Tuesday Tips and Tricks #ApacheDrill #BigData https://t.co/fkAJokKF5O https://t.co/bxdNCiqdrE' AS tweetText
FROM (VALUES((1))))
+----------+------+------+----------+------------+
| mentions | tags | urls | reply_to | tweet_meta |
+----------+------+------+----------+------------+
| ["youThere","QlikSense"] | ["Apache","Qlik","ApacheDrill","BigData"] | ["https://t.co/fkAJokKF5O","https://t.co/bxdNCiqdrE"] | youThere | {"weightedLength":154,"permillage":550,"isValid":true,"display_start":0,"display_end":153,"valid_start":0,"valid_end":153} |
+----------+------+------+----------+------------+