There are many ways to gather Twitter data for analysis and many R and Python (et al) libraries make full use of the Twitter API when building a corpus to extract useful metadata for each tweet along with the text of each tweet. However, many corpus archives are minimal and only retain a small portion of the metadata — often just tweet timestamp, the tweet creator and the tweet text — leaving to the analyst the trudging work of re-extracting hashtags, mentions, URLs (etc).
Twitter provides a tweet-text processing library for many languages. One of these languages is Java. Since it make sense to perform at-scale data operations in Apache Drill, it also seemed to make sense that Apache Drill could use a tweet metadata extraction set of user-defined functions (UDFs). Plus, there just aren’t enough examples of Drill UDFs out there. Thus begat drill-twitter-text
?.
What’s Inside the Tin?
There are five UDF functions in the package:
tw_parse_tweet(string)
: Parses the tweet text and returns a map column with the following named values:weightedLength
: (int) the overall length of the tweet with code points weighted per the ranges defined in the configuration filepermillage: (int) indicates the proportion (per thousand) of the weighted length in comparison to the max weighted length. A value > 1000 indicates input text that is longer than the allowable maximum.
isValid
: (boolean) indicates if input text length corresponds to a valid result.display_start
/display_end
: (int) indices identifying the inclusive start and exclusive end of the displayable content of the Tweet.valid_start
/valid_end
: (int) indices identifying the inclusive start and exclusive end of the valid content of the Tweet.
tw_extract_hashtags(string)
: Extracts all hashtags in the tweet text into a list which can beFLATTEN()
ed.tw_extract_screennames(string)
: Extracts all screennames in the tweet text into a list which can beFLATTEN()
ed.tw_extract_urls(string)
: Extracts all URLs in the tweet text into a list which can beFLATTEN()
ed.tw_extract_reply_screenname()
: Extracts the reply screenname (if any) from the tweet text into aVARCHAR
.
The repo has all the necessary bits and info to help you compile and load the necessary JARs, but those in a hurry can just copy all the files in the target
directory to your local jars/3rparty
directory and restart Drill.
Usage
Here’s an example of how to call each UDF along with the output:
SELECT
tw_extract_screennames(tweetText) AS mentions,
tw_extract_hashtags(tweetText) AS tags,
tw_extract_urls(tweetText) AS urls,
tw_extract_reply_screenname(tweetText) AS reply_to,
tw_parse_tweet(tweetText) AS tweet_meta
FROM
(SELECT
'@youThere Load data from #Apache Drill to @QlikSense - #Qlik Tuesday Tips and Tricks #ApacheDrill #BigData https://t.co/fkAJokKF5O https://t.co/bxdNCiqdrE' AS tweetText
FROM (VALUES((1))))
+----------+------+------+----------+------------+
| mentions | tags | urls | reply_to | tweet_meta |
+----------+------+------+----------+------------+
| ["youThere","QlikSense"] | ["Apache","Qlik","ApacheDrill","BigData"] | ["https://t.co/fkAJokKF5O","https://t.co/bxdNCiqdrE"] | youThere | {"weightedLength":154,"permillage":550,"isValid":true,"display_start":0,"display_end":153,"valid_start":0,"valid_end":153} |
+----------+------+------+----------+------------+