Continuing the blog’s UDF theme of late, there are two new UDF kids in town: drill-url-tools? for slicing & dicing URI/URLs (just going to use ‘URL’ from now on in the post) drill-domain-tools? for slicing & dicing internet domain names (IDNs). Now, if you’re an Apache Drill fanatic, you’re likely thinking “Hey hrbrmstr: don’t you… Continue reading
Post Category → Apache Drill
New Apache Drill UDF for Processing Twitter Tweet Text
There are many ways to gather Twitter data for analysis and many R and Python (et al) libraries make full use of the Twitter API when building a corpus to extract useful metadata for each tweet along with the text of each tweet. However, many corpus archives are minimal and only retain a small portion… Continue reading
Painless ODBC + dplyr Connections to Amazon Athena and Apache Drill with R & odbc
I spent some time this morning upgrading the JDBC driver (and changing up some supporting code to account for changes to it) for my metis package? which connects R up to Amazon Athena via RJDBC. I’m used to JDBC and have to deal with Java separately from R so I’m also comfortable with Java, JDBC… Continue reading
R⁶ Series — Random Sampling From Apache Drill Tables With R & sergeant
(For first-timers, R⁶ tagged posts are short & sweet with minimal expository; R⁶ feed) At work-work I mostly deal with medium-to-large-ish data. I often want to poke at new or existing data sets w/o working across billions of rows. I also use Apache Drill for much of my exploratory work. Here’s how to uniformly sample… Continue reading
Increasing Output Buffer Size in Apache Drill UDFs Custom (Simple) Functions
Putting this here to make it easier for others who try to Google this topic to find it w/o having to find and tediously search through other UDFs (user-defined functions). I was/am making a custom UDF for base64 decoding/encoding and ran into: It’s incredibly easy to “fix” (and, if my Java weren’t so rusty I’d… Continue reading
Reading PCAP Files with Apache Drill and the sergeant R Package
It’s no secret that I’m a fan of Apache Drill. One big strength of the platform is that it normalizes the access to diverse data sources down to ANSI SQL calls, which means that I can pull data from parquet, Hie, HBase, Kudu, CSV, JSON, MongoDB and MariaDB with the same SQL syntax. This also… Continue reading
Ten-HUT! The Apache Drill R interface package — sergeant — is now on CRAN
I’m extremely pleased to announce that the sergeant package is now on CRAN or will be hitting your local CRAN mirror soon. sergeant provides JDBC, DBI and dplyr/dbplyr interfaces to Apache Drill. I’ve also wrapped a few goodies into the dplyr custom functions that work with Drill and if you have Drill UDFs that don’t… Continue reading
Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant
The Apache Drill folks have a nice walk-through tutorial on how to analyze the Yelp Academic Dataset with Drill. It’s a bit out of date (the current Yelp data set structure is different enough that the tutorial will error out at various points), but it’s a great example of how to work with large, nested… Continue reading