Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant

The Apache Drill folks have a nice walk-through tutorial on how to analyze the Yelp Academic Dataset with Drill. It’s a bit out of date (the current Yelp data set structure is different enough that the tutorial will error out at various points), but it’s a great example of how to work with large, nested JSON files as a SQL data source. By ‘large’ I mean around 4GB of JSON data spread across 5 files.

If you have enough memory and wanted to work with “flattened” versions of the files in R you could use my ndjson package (there are other JSON “flattener” packages as well, and a new one — corpus::read_ndjson — is even faster than mine, but it fails to read this file). Drill doesn’t necessarily load the entire JSON structure into memory (you can check out the query profiles after the fact to see how much each worker component ended up using) and I’m only mentioning that “R can do this w/o Drill” to stave off some of those types of comments.

The main reasons for replicating their Yelp example was to both have a more robust test suite for sergeant (it’s hitting CRAN soon now that dplyr 0.7.0 is out) and to show some Drill SQL to R conversions. Part of the latter reason is also to show how to use SQL calls to create a tbl that you can then use dplyr verbs to manipulate.

The full tutorial replication is at https://rud.is/rpubs/yelp.html but also iframe’d below.

Cover image from Data-Driven Security
Amazon Author Page

3 Comments Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant

  1. Pingback: Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant | A bunch of data

  2. Pingback: Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant – Mubashir Qasim

  3. Pingback: Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant – Cyber Security

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.