Skip navigation

Category Archives: SQL

While the future of Bluesky is nowhere near certain, it is most certainly growing. It’s also the largest community of users for the AT Protocol.

Folks are using Bluesky much the same way as any online forum/chat. One of those ways is to share URLs to content.

For the moment, it is possible to eavesdrop on the Bluesky “firehose” sans authentication. I’ve been curious as to what folks are sharing on the platform and decided to do more than poke at it casually in my hacky terminal firehose viewer.

This GitLab project contains all the code necessary to log URLs seen in the firehose to a local SQLite database. As Bluesky grows, this will definitely not scale, but it’s fine for right now, and scaling just means moving the websocket capture client to a more capable environment than my home server and setting up something like a Kafka stream. Might as well move to Postgres while we’re at it.

But, for now, this lightweight script/database is fine.

NOTE: I’m deliberately not tracking any other data, but the code is easy to modify to log whatever you want from the firehose post.

I’m syncing the data to this server every ~30 minutes or so and have created an Observable notebook which keeps track of the most popular domains.

I don’t know what card.syui.ai is (Perplexity had some ideas), but it appears to be some AI-driven “card” game that has AT protocol and ActivityPub integration. Due to the programmatic nature of the posts with URLs containing that domain, I suspect it’ll be in the lead for quite some time.

There are some neat sites in the long tail of the distribution.

I think I’ll set up one to monitor post with CVE’s, soon, too.

There are many ways to gather Twitter data for analysis and many R and Python (et al) libraries make full use of the Twitter API when building a corpus to extract useful metadata for each tweet along with the text of each tweet. However, many corpus archives are minimal and only retain a small portion of the metadata — often just tweet timestamp, the tweet creator and the tweet text — leaving to the analyst the trudging work of re-extracting hashtags, mentions, URLs (etc).

Twitter provides a tweet-text processing library for many languages. One of these languages is Java. Since it make sense to perform at-scale data operations in Apache Drill, it also seemed to make sense that Apache Drill could use a tweet metadata extraction set of user-defined functions (UDFs). Plus, there just aren’t enough examples of Drill UDFs out there. Thus begat drill-twitter-text?.

What’s Inside the Tin?

There are five UDF functions in the package:

  • tw_parse_tweet(string): Parses the tweet text and returns a map column with the following named values:
    • weightedLength: (int) the overall length of the tweet with code points weighted per the ranges defined in the configuration file
    • permillage: (int) indicates the proportion (per thousand) of the weighted length in comparison to the max weighted length. A value > 1000 indicates input text that is longer than the allowable maximum.
    • isValid: (boolean) indicates if input text length corresponds to a valid result.
    • display_start / display_end: (int) indices identifying the inclusive start and exclusive end of the displayable content of the Tweet.
    • valid_start / valid_end: (int) indices identifying the inclusive start and exclusive end of the valid content of the Tweet.
  • tw_extract_hashtags(string): Extracts all hashtags in the tweet text into a list which can be FLATTEN()ed.
  • tw_extract_screennames(string): Extracts all screennames in the tweet text into a list which can be FLATTEN()ed.
  • tw_extract_urls(string): Extracts all URLs in the tweet text into a list which can be FLATTEN()ed.
  • tw_extract_reply_screenname(): Extracts the reply screenname (if any) from the tweet text into a VARCHAR.

The repo has all the necessary bits and info to help you compile and load the necessary JARs, but those in a hurry can just copy all the files in the target directory to your local jars/3rparty directory and restart Drill.

Usage

Here’s an example of how to call each UDF along with the output:

SELECT 
  tw_extract_screennames(tweetText) AS mentions,
  tw_extract_hashtags(tweetText) AS tags,
  tw_extract_urls(tweetText) AS urls,
  tw_extract_reply_screenname(tweetText) AS reply_to,
  tw_parse_tweet(tweetText) AS tweet_meta
FROM
  (SELECT 
     '@youThere Load data from #Apache Drill to @QlikSense - #Qlik Tuesday Tips and Tricks #ApacheDrill #BigData https://t.co/fkAJokKF5O https://t.co/bxdNCiqdrE' AS tweetText
   FROM (VALUES((1))))

+----------+------+------+----------+------------+
| mentions | tags | urls | reply_to | tweet_meta |
+----------+------+------+----------+------------+
| ["youThere","QlikSense"] | ["Apache","Qlik","ApacheDrill","BigData"] | ["https://t.co/fkAJokKF5O","https://t.co/bxdNCiqdrE"] | youThere | {"weightedLength":154,"permillage":550,"isValid":true,"display_start":0,"display_end":153,"valid_start":0,"valid_end":153} |
+----------+------+------+----------+------------+

FIN

Kick the tyres and file issues and PRs as needed.

The Apache Drill folks have a nice walk-through tutorial on how to analyze the Yelp Academic Dataset with Drill. It’s a bit out of date (the current Yelp data set structure is different enough that the tutorial will error out at various points), but it’s a great example of how to work with large, nested JSON files as a SQL data source. By ‘large’ I mean around 4GB of JSON data spread across 5 files.

If you have enough memory and wanted to work with “flattened” versions of the files in R you could use my ndjson package (there are other JSON “flattener” packages as well, and a new one — corpus::read_ndjson — is even faster than mine, but it fails to read this file). Drill doesn’t necessarily load the entire JSON structure into memory (you can check out the query profiles after the fact to see how much each worker component ended up using) and I’m only mentioning that “R can do this w/o Drill” to stave off some of those types of comments.

The main reasons for replicating their Yelp example was to both have a more robust test suite for sergeant (it’s hitting CRAN soon now that dplyr 0.7.0 is out) and to show some Drill SQL to R conversions. Part of the latter reason is also to show how to use SQL calls to create a tbl that you can then use dplyr verbs to manipulate.

The full tutorial replication is at https://rud.is/rpubs/yelp.html but also iframe’d below.

2021-11-04 UPDATE: Just use {arrow}.


Apache Drill is a nice tool to have in the toolbox as it provides a SQL front-end to a wide array of database and file back-ends and runs in standalone/embedded mode on every modern operating system (i.e. you can get started with or play locally with Drill w/o needing a Hadoop cluster but scale up almost effortlessly). It’s also a bit more lightweight than Spark and a nice alternative to Spark if you only need data wrangling and not the functionality in Spark’s MLlib.

When you’re in this larger-data world, parquet files are one of the core data storage formats. They’re designed to be compact and are optimized for columnar operations. Unlike CSV, JSON files or even R Data files, it’s not necessary to read or scan an entire parquet file to filter, select, aggregate, etc across columns. Unfortunately, parquet files aren’t first-class citizens in R. Well, they aren’t now, but thanks to this project it might not be too difficult to make an R interface to them. But, for now, you have to use some other means to convert or read parquet files.

Spark and sparklyr can help you write parquet files but I don’t need to run Spark all the time.

If you’re already a Drill user, you already know how easy it is to make parquet files with Drill:

CREATE TABLE dfs.tmp.sampleparquet AS 
  (SELECT trans_id, 
   cast(`date` AS date) transdate, 
   cast(`time` AS time) transtime, 
   cast(amount AS double) amountm,
   user_info, marketing_info, trans_info 
   FROM dfs.`/Users/drilluser/sample.json`);

If you’re not used to SQL, that may seem very ugly/foreign/verbose to you and you can thank Hadley for designing a better grammar of tidyness that seamlessly builds SQL queries like that behind the scenes for you. That SQL statement uses a JSON file as a data source (which you can do with Drill) make sure the field data types are correct by explicitly casting them to SQL data types (which is a good habit to get into even if it is verbose) and then tells Drill to make a parquet file (it’s actually a directory of parquet files) from it.

I’ve been working on an R package — sergeant — that provides RJDBC, direct REST and dplyr interfaces to Apache Drill for a while now. There are a number of complexities associated with creating a function to help users make parquet files from R data frames in Drill (which is why said function still does not exist in sergeant):

  • Is Drill installed or does there need to be a helper set of functions for installing and running Drill in embedded mode?
  • Even if there’s a Drill cluster running, does the user — perhaps — want to do the conversion locally in embedded mode? Embedded is way easier since all the files are local. The only real way to convert a data frame to Drill is to save a data frame to a temporary, interim file and them have Drill read it in. In a cluster mode where your local filesystem is not part of the cluster, that would mean finding the right way to get the file to the cluster. Which leads to the next item…
  • Where does the user want the necessary temporary files stored? Local dfs. file system? HDFS?
  • Do we need two different methods? One for quick conversion and one that forces explicit column data type casting?
  • Do we need to support giving the user explicit casting control and column selection capability?
  • Who put the bomp in the bomp, bomp, bomp?

OK, perhaps not that last one (but I think it still remains a mystery despite claims by Jan and Dean).

It’s difficult to wrap something like that up in a simple package that will make 80% of the possible user-base happy (having Drill and Spark operate behind the scenes like “magic” seems like a bad idea to me despite how well sparklyr masks the complexity).

As I continue to work that out (you are encouraged to file an issue with your opines on it at the gh repo) here’s a small R script that you can use it to turn R data frames into parquet files:

library(sergeant)
library(tidyverse)

# make a place to hold our temp files
# this is kinda super destructive. make sure you have the path right
unlink("/tmp/pqtrans", recursive=TRUE, force=TRUE)
dir.create("/tmp/pqtrans", showWarnings=FALSE)

# save off a large-ish tibble
write_csv(nycflights13::flights, "/tmp/pqtrans/flights.csvh")

# connect to drill
db <- src_drill("localhost")

# make the parquet file
dbGetQuery(db$con, "
CREATE TABLE dfs.tmp.`/pqtrans/flights.parquet` AS SELECT * FROM dfs.tmp.`/pqtrans/flights.csvh`
")
## # A tibble: 1 × 2
##   `Number of records written` Fragment
## *                       <int>    <chr>
## 1                      336776      0_0

# prove we did it
list.files("/tmp/pqtrans", recursive=TRUE, include.dirs=TRUE)
## [1] "flights.csvh"                  "flights.parquet"              
## [3] "flights.parquet/0_0_0.parquet"

# prove it again
flights <- tbl(db, "dfs.tmp.`/pqtrans/flights.parquet`")

flights
## Source:   query [?? x 19]
## Database: Drill 1.9.0 [localhost:8047] [8GB direct memory]
## 
##    flight arr_delay distance  year tailnum dep_time sched_dep_time origin
##     <int>     <dbl>    <dbl> <int>   <chr>    <int>          <int>  <chr>
## 1    1545        11     1400  2013  N14228      517            515    EWR
## 2    1714        20     1416  2013  N24211      533            529    LGA
## 3    1141        33     1089  2013  N619AA      542            540    JFK
## 4     725       -18     1576  2013  N804JB      544            545    JFK
## 5     461       -25      762  2013  N668DN      554            600    LGA
## 6    1696        12      719  2013  N39463      554            558    EWR
## 7     507        19     1065  2013  N516JB      555            600    EWR
## 8    5708       -14      229  2013  N829AS      557            600    LGA
## 9      79        -8      944  2013  N593JB      557            600    JFK
## 10    301         8      733  2013  N3ALAA      558            600    LGA
## # ... with more rows, and 11 more variables: sched_arr_time <int>,
## #   dep_delay <dbl>, dest <chr>, minute <dbl>, carrier <chr>, month <int>,
## #   hour <dbl>, arr_time <int>, air_time <dbl>, time_hour <dttm>,
## #   day <int>

# work with the drill parquet file
count(flights, year, origin) %>%
  collect()
## Source: local data frame [3 x 3]
## Groups: year [1]
## 
##    year origin      n
## * <int>  <chr>  <int>
## 1  2013    EWR 120835
## 2  2013    LGA 104662
## 3  2013    JFK 111279

That snippet:

  • assumes Drill is running, which is really as easy as entering drill-embedded at a shell prompt, but try out Drill in 10 Minutes if you don’t believe me
  • dfs.tmp points to /tmp (i.e. you need to modify that if yours doesn’t…see, I told you this wasn’t simple)
  • assumes we’re OK with letting Drill figure out column types
  • assumes we want ALL THE COLUMNS
  • uses the .csvh extension which tells Drill to read the column names from the first line so we don’t have to create the schema from scratch
  • is slow because of ↑ due to the need to create the csvh file first
  • exploits the fact that we can give dplyr the cold shoulder and talk directly to Drill anytime we feel like it with DBI calls by using the $con list field (the dbGetQuery(db$con, …) line).

It’s a naive and destructive snippet, but does provide a means to get your data frames into parquet and into Drill.

Most of my Drill parquet needs are converting ~20-100K JSON files a day into parquet, which is why I haven’t focused on making a nice interface for this particular use case (data frame to parquet) in R. Ultimately, I’ll likely go the “wrap parquet-cpp route” (unless you’re working on that, which — if you are — you should @-ref me in that gh-repo of yours so I can help out). But, if having a sergeant function to do this conversion would help you, drop an issue in the repo.

I recently mentioned that I’ve been working on a development version of an Apache Drill R package called sergeant. Here’s a lifted “TLDR” on Drill:

Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.
Drill’s datastore-aware optimizer automatically restructures a query plan to leverage the datastore’s internal processing capabilities. In addition, Drill supports data locality, so it’s a good idea to co-locate Drill and the datastore on the same nodes.

It also supports reading formats such as:

  • Avro
  • [CTP]SV ([C]omma-, [T]ab-, [P]ipe-Separated-Values)
  • Parquet
  • Hadoop Sequence Files

It’s a bit like Spark in that you can run it on a single workstation and scale up to a YUGE cluster. It lacks the ML components of Spark, but it connects to everything without the need to define a schema up front. Said “everything” includes parquet files on local filesystems, so if you need to slice through GBs of parquet data and have a beefy enough Linux workstation (I believe Drill runs on Windows and know it runs on macOS fine, too, but that’s $$$ for a bucket of memory & disk space) you can take advantage of the optimized processing power Drill offers on a single system (while also joining the data with any other data format you can think of). You can also seamlessly move the data to a cluster and barely tweak your code to support said capacity expansion.

Why sergeant?

There’s already an R package on CRAN to work with Drill: DrillR. It’s S4 class-based, has a decent implementation and interfaces with the REST API. However, it sticks httr::verbose() everywhere: https://github.com/cran/DrillR/search?utf8=%E2%9C%93&q=verbose.

The sergeant package interfaces with the REST API as well, but also works with the JDBC driver (the dev version includes the driver with the package, but this will be removed for the eventual CRAN submission) and includes some other niceties around Drill options viewing and setting and some other non-SQL bits. Of note: the REST API version shows an httr progress bar for data downloading and you can wrap the calls with httr::with_verbose(…) if you really like seeing cURL messages.

The other thing sergeant has going for it is a nascent dplyr interface. Presently, this is a hack-ish wrapper around the RJDBC JDBCConnection presented by the Drill JDBC driver. While basic functionality works, I firmly believe Drill needs it’s own DBI driver (like is second-cousin Preso has) to avoid collisions withy any other JDBC connections you might have open, plus more work needs to be done under the covers to deal with quoting properly and exposing more Drill built-in SQL functions.

SQL vs dplyr

For some truly complex data machinations you’re going to want to work at the SQL level and I think it’s important to know SQL if you’re ever going to do data work outside JSON & CSV files just to appreciate how much gnashing of teeth dplyr saves you from. Using SQL for many light-to-medium aggregation tasks that feed data to R can feel like you’re banging rocks together to make fire when you could just be using your R precision welder. What would you rather write:

SELECT  gender ,  marital_status , COUNT(*) AS  n 
FROM  cp.`employee.json` 
GROUP BY  gender ,  marital_status

in a drill-embedded or drill-localhost SQL shell? Or:

library(RJDBC)
library(dplyr)
library(sergeant)

ds <- src_drill("localhost:31010", use_zk=FALSE)

db <- tbl(ds, "cp.`employee.json`") 

count(db, gender, marital_status) %>% collect()

(NOTE: that SQL statement is what ultimately gets sent to Drill from dplyr)

Now, dplyr tbl_df idioms don’t translate 1:1 to all other src_es, but they are much easier on the eyes and more instructive in analysis code (and, I fully admit that said statement is more opinion than fact).

sergeant and dplyr

The src_drill() function uses the JDBC Drill driver and, hence, has an RJDBC dependency. The Presto folks (a “competing” offering to Drill) wrapped a DBI interface around their REST API to facilitate the use of dplyr idioms. I’m not sold on whether I’ll continue with a lightweight DBI wrapper using RJDBC or go the RPresto route, but for now the basic functionality works and changing the back-end implementation should not break anything (much).

You’ve said “parquet” alot…

Yes. Yes, I have. Parquet is a “big data” compressed columnar storage format that is generally used in Hadoop shops. Parquet is different from ‘feather’ (‘feather’ is based on another Apache foundation project: Arrow). Arrow/feather is great for things that fit in memory. Parquet and the idioms that sit on top of it enable having large amounts data available in a cluster for processing with Hadoop / Spark / Drill / Presto (etc). Parquet is great for storing all kinds of data, including log and event data which I have to work with quite a bit and it’s great being able to prototype on a single workstation then move code to hit a production cluster. Plus, it’s super-easy to, say, convert an entire, nested directory tree of daily JSON log files into parquet with Drill:

CREATE TABLE dfs.destination.`source/2016/12/2016_12_source_event_logs.parquet` AS
  SELECT src_ip, dst_ip, src_port, dst_port, event_message, ts 
  FROM dfs.source.`/log/dir/root/2016/12/*/event_log.json`;

Kick the tyres

The REST and JDBC functions are solid (I’ve been using them at work for a while) and the dplyr support has handled some preliminary production work well (though, remember, it’s not fully-baked). There are plenty of examples — including a dplyr::left_join() between parquet and JSON data — in the README and all the exposed functions have documentation.

File an issue with a feature request or bug report.

I expect to have this CRAN-able in January, 2017.