While the future of Bluesky is nowhere near certain, it is most certainly growing. It’s also the largest community of users for the AT Protocol. Folks are using Bluesky much the same way as any online forum/chat. One of those ways is to share URLs to content. For the moment, it is possible to eavesdrop… Continue reading
Post Category → SQL
New Apache Drill UDF for Processing Twitter Tweet Text
There are many ways to gather Twitter data for analysis and many R and Python (et al) libraries make full use of the Twitter API when building a corpus to extract useful metadata for each tweet along with the text of each tweet. However, many corpus archives are minimal and only retain a small portion… Continue reading
Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant
The Apache Drill folks have a nice walk-through tutorial on how to analyze the Yelp Academic Dataset with Drill. It’s a bit out of date (the current Yelp data set structure is different enough that the tutorial will error out at various points), but it’s a great example of how to work with large, nested… Continue reading
Create Parquet Files From R Data Frames With sergeant & Apache Drill (a.k.a. Make Parquet Files Great Again in R)
2021-11-04 UPDATE: Just use {arrow}. Apache Drill is a nice tool to have in the toolbox as it provides a SQL front-end to a wide array of database and file back-ends and runs in standalone/embedded mode on every modern operating system (i.e. you can get started with or play locally with Drill w/o needing a… Continue reading
sergeant : An R Boot Camp for Apache Drill
I recently mentioned that I’ve been working on a development version of an Apache Drill R package called sergeant. Here’s a lifted “TLDR” on Drill: Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A… Continue reading