Replicating the Apache Drill 'Yelp' Academic Dataset Analysis with sergeant

The Apache Drill folks have a nice walk-through tutorial on how to analyze the Yelp Academic Dataset with Drill. It’s a bit out of date (the current Yelp data set structure is different enough that the tutorial will error out at various points), but it’s a great example of how to work with large, nested JSON files as a SQL data source. By ‘large’ I mean around 4GB of JSON data spread across 5 files.

If you have enough memory and wanted to work with “flattened” versions of the files in R you could use my ndjson package (there are other JSON “flattener” packages as well, and a new one — corpus::read_ndjson — is even faster than mine, but it fails to read this file). Drill doesn’t necessarily load the entire JSON structure into memory (you can check out the query profiles after the fact to see how much each worker component ended up using) and I’m only mentioning that “R can do this w/o Drill” to stave off some of those types of comments.

The main reasons for replicating their Yelp example was to both have a more robust test suite for sergeant (it’s hitting CRAN soon now that dplyr 0.7.0 is out) and to show some Drill SQL to R conversions. Part of the latter reason is also to show how to use SQL calls to create a tbl that you can then use dplyr verbs to manipulate.

The full tutorial replication is at https://rud.is/rpubs/yelp.html but also iframe’d below.

3 Trackbacks/Pingbacks

By Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant | A bunch of data on 17 Jun 2017 at 4:23 pm

[…] article was first published on R – rud.is, and kindly contributed to […]
By Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant – Mubashir Qasim on 17 Jun 2017 at 6:18 pm

[…] article was first published on R – rud.is, and kindly contributed to […]
By Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant – Cyber Security on 18 Jun 2017 at 2:12 am

[…] The Apache Drill folks have a nice walk-through tutorial on how to analyze the Yelp Academic Dataset with Drill. It’s a bit out of date (the current Yelp data set structure is different enough that the tutorial will error out at various points), but it’s a great example of how to work with large, nested… Continue reading → […]

rud.is