

{"id":4929,"date":"2017-01-22T09:53:35","date_gmt":"2017-01-22T14:53:35","guid":{"rendered":"https:\/\/rud.is\/b\/?p=4929"},"modified":"2021-12-04T11:20:28","modified_gmt":"2021-12-04T16:20:28","slug":"create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r","status":"publish","type":"post","link":"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/","title":{"rendered":"Create Parquet Files From R Data Frames With sergeant &#038; Apache Drill (a.k.a. Make Parquet Files Great Again in R)"},"content":{"rendered":"<p>2021-11-04 UPDATE: Just use {arrow}.<\/p>\n<hr \/>\n<p><a href=\"https:\/\/drill.apache.org\/\">Apache Drill<\/a> is a nice tool to have in the toolbox as it provides a SQL front-end to a wide array of database and file back-ends and runs in <a href=\"http:\/\/drill.apache.org\/docs\/embedded-mode-prerequisites\/\">standalone\/embedded mode<\/a> on every modern operating system (i.e. you can get started with or play locally with Drill w\/o needing a Hadoop cluster but scale up almost effortlessly). It&#8217;s also a bit more lightweight than Spark and a nice alternative to Spark if you only need data wrangling and not the functionality in Spark&#8217;s MLlib.<\/p>\n<p>When you&#8217;re in this larger-data world, <a href=\"https:\/\/parquet.apache.org\/\">parquet files<\/a> are one of the core data storage formats. They&#8217;re designed to be compact and are optimized for columnar operations. Unlike CSV, JSON files or even R Data files, it&#8217;s not necessary to read or scan an entire parquet file to filter, select, aggregate, etc across columns. Unfortunately, parquet files aren&#8217;t first-class citizens in R. Well, they aren&#8217;t <em>now<\/em>, but thanks to <a href=\"https:\/\/github.com\/apache\/parquet-cpp\">this project<\/a> it might not be too difficult to make an R interface to them. But, for now, you have to use some other means to convert or read parquet files.<\/p>\n<p>Spark and <code>sparklyr<\/code> can <a href=\"https:\/\/rdrr.io\/cran\/sparklyr\/man\/spark_write_parquet.html\">help you write parquet files<\/a> but I don&#8217;t need to run Spark all the time.<\/p>\n<p>If you&#8217;re already a Drill user, you already know how easy it is to make parquet files with Drill:<\/p>\n<pre id=\"parquet-sql-01\"><code class=\"language-sql\">CREATE TABLE dfs.tmp.sampleparquet AS \r\n  (SELECT trans_id, \r\n   cast(`date` AS date) transdate, \r\n   cast(`time` AS time) transtime, \r\n   cast(amount AS double) amountm,\r\n   user_info, marketing_info, trans_info \r\n   FROM dfs.`\/Users\/drilluser\/sample.json`);<\/code><\/pre>\n<p>If you&#8217;re not used to SQL, that may seem very ugly\/foreign\/verbose to you and you can thank Hadley for designing a better grammar of tidyness that seamlessly builds SQL queries like that behind the scenes for you. That SQL statement uses a JSON file as a data source (which you can do with Drill) make sure the field data types are correct by explicitly casting them to SQL data types (which is a good habit to get into even if it is verbose) and then tells Drill to make a parquet file (it&#8217;s actually a directory of parquet files) from it.<\/p>\n<p>I&#8217;ve been working on an R package \u2014 <a href=\"https:\/\/github.com\/hrbrmstr\/sergeant\"><code>sergeant<\/code><\/a> \u2014 that provides RJDBC, direct REST and <code>dplyr<\/code> interfaces to Apache Drill for a while now. There are a number of complexities associated with creating a function to help users make parquet files from R data frames in Drill (which is why said function still does not exist in <code>sergeant<\/code>):<\/p>\n<ul>\n<li>Is Drill installed or does there need to be a helper set of functions for installing and running Drill in embedded mode?<\/li>\n<li>Even if there&#8217;s a Drill cluster running, does the user \u2014\u00a0perhaps \u2014 want to do the conversion locally in embedded mode? Embedded is way easier since all the files are local. The only real way to convert a data frame to Drill is to save a data frame to a temporary, interim file and them have Drill read it in. In a cluster mode where your local filesystem is not part of the cluster, that would mean finding the right way to get the file to the cluster. Which leads to the next item&hellip;<\/li>\n<li>Where does the user want the necessary temporary files stored? Local <code>dfs.<\/code> file system? HDFS?<\/li>\n<li>Do we need two different methods? One for quick conversion and one that forces explicit column data type casting?<\/li>\n<li>Do we need to support giving the user explicit casting control and column selection capability?<\/li>\n<li><em>Who put the bomp in the bomp, bomp, bomp<\/em>?<\/li>\n<\/ul>\n<p>OK, perhaps not that last one (but I think it still remains a mystery despite claims by <a href=\"https:\/\/www.youtube.com\/watch?v=BwSD_xSzwMI\">Jan and Dean<\/a>).<\/p>\n<p>It&#8217;s difficult to wrap something like that up in a simple package that will make 80% of the possible user-base happy (having Drill and Spark operate behind the scenes like &#8220;magic&#8221; seems like a bad idea to me despite how well <code>sparklyr<\/code> masks the complexity).<\/p>\n<p>As I continue to work that out (you are encouraged to file an issue with your opines on it at the gh repo) here&#8217;s a small R script that you can use it to turn R data frames into parquet files:<\/p>\n<pre id=\"parquet-drill-01\"><code class=\"language-r\">library(sergeant)\r\nlibrary(tidyverse)\r\n\r\n# make a place to hold our temp files\r\n# this is kinda super destructive. make sure you have the path right\r\nunlink(&quot;\/tmp\/pqtrans&quot;, recursive=TRUE, force=TRUE)\r\ndir.create(&quot;\/tmp\/pqtrans&quot;, showWarnings=FALSE)\r\n\r\n# save off a large-ish tibble\r\nwrite_csv(nycflights13::flights, &quot;\/tmp\/pqtrans\/flights.csvh&quot;)\r\n\r\n# connect to drill\r\ndb &lt;- src_drill(&quot;localhost&quot;)\r\n\r\n# make the parquet file\r\ndbGetQuery(db$con, &quot;\r\nCREATE TABLE dfs.tmp.`\/pqtrans\/flights.parquet` AS SELECT * FROM dfs.tmp.`\/pqtrans\/flights.csvh`\r\n&quot;)\r\n## # A tibble: 1 \u00d7 2\r\n##   `Number of records written` Fragment\r\n## *                       &lt;int&gt;    &lt;chr&gt;\r\n## 1                      336776      0_0\r\n\r\n# prove we did it\r\nlist.files(&quot;\/tmp\/pqtrans&quot;, recursive=TRUE, include.dirs=TRUE)\r\n## [1] &quot;flights.csvh&quot;                  &quot;flights.parquet&quot;              \r\n## [3] &quot;flights.parquet\/0_0_0.parquet&quot;\r\n\r\n# prove it again\r\nflights &lt;- tbl(db, &quot;dfs.tmp.`\/pqtrans\/flights.parquet`&quot;)\r\n\r\nflights\r\n## Source:   query [?? x 19]\r\n## Database: Drill 1.9.0 [localhost:8047] [8GB direct memory]\r\n## \r\n##    flight arr_delay distance  year tailnum dep_time sched_dep_time origin\r\n##     &lt;int&gt;     &lt;dbl&gt;    &lt;dbl&gt; &lt;int&gt;   &lt;chr&gt;    &lt;int&gt;          &lt;int&gt;  &lt;chr&gt;\r\n## 1    1545        11     1400  2013  N14228      517            515    EWR\r\n## 2    1714        20     1416  2013  N24211      533            529    LGA\r\n## 3    1141        33     1089  2013  N619AA      542            540    JFK\r\n## 4     725       -18     1576  2013  N804JB      544            545    JFK\r\n## 5     461       -25      762  2013  N668DN      554            600    LGA\r\n## 6    1696        12      719  2013  N39463      554            558    EWR\r\n## 7     507        19     1065  2013  N516JB      555            600    EWR\r\n## 8    5708       -14      229  2013  N829AS      557            600    LGA\r\n## 9      79        -8      944  2013  N593JB      557            600    JFK\r\n## 10    301         8      733  2013  N3ALAA      558            600    LGA\r\n## # ... with more rows, and 11 more variables: sched_arr_time &lt;int&gt;,\r\n## #   dep_delay &lt;dbl&gt;, dest &lt;chr&gt;, minute &lt;dbl&gt;, carrier &lt;chr&gt;, month &lt;int&gt;,\r\n## #   hour &lt;dbl&gt;, arr_time &lt;int&gt;, air_time &lt;dbl&gt;, time_hour &lt;dttm&gt;,\r\n## #   day &lt;int&gt;\r\n\r\n# work with the drill parquet file\r\ncount(flights, year, origin) %&gt;%\r\n  collect()\r\n## Source: local data frame [3 x 3]\r\n## Groups: year [1]\r\n## \r\n##    year origin      n\r\n## * &lt;int&gt;  &lt;chr&gt;  &lt;int&gt;\r\n## 1  2013    EWR 120835\r\n## 2  2013    LGA 104662\r\n## 3  2013    JFK 111279<\/code><\/pre>\n<p>That snippet:<\/p>\n<ul>\n<li>assumes Drill is running, which is really as easy as entering <code>drill-embedded<\/code> at a shell prompt, but try out <a href=\"https:\/\/drill.apache.org\/docs\/drill-in-10-minutes\/\">Drill in 10 Minutes<\/a> if you don&#8217;t believe me<\/li>\n<li><code>dfs.tmp<\/code> points to <code>\/tmp<\/code> (i.e. you need to modify that if yours doesn&#8217;t&hellip;see, I told you this wasn&#8217;t simple)<\/li>\n<li>assumes we&#8217;re OK with letting Drill figure out column types<\/li>\n<li>assumes we want ALL THE COLUMNS<\/li>\n<li>uses the <code>.csvh<\/code> extension which tells Drill to read the column names from the first line so we don&#8217;t have to create the schema from scratch<\/li>\n<li>is slow because of \u2191 due to the need to create the <code>csvh<\/code> file first<\/li>\n<li>exploits the fact that we can give <code>dplyr<\/code> the cold shoulder and talk directly to Drill anytime we feel like it with DBI calls by using the <code>$con<\/code> list field (the <code>dbGetQuery(db$con, \u2026)<\/code> line).<\/li>\n<\/ul>\n<p>It&#8217;s a naive and destructive snippet, but does provide a means to get your data frames into parquet and into Drill.<\/p>\n<p>Most of my Drill parquet needs are converting ~20-100K JSON files a day into parquet, which is why I haven&#8217;t focused on making a nice interface for this particular use case (data frame to parquet) in R. Ultimately, I&#8217;ll likely go the &#8220;wrap <code>parquet-cpp<\/code> route&#8221; (unless <em>you&#8217;re<\/em> working on that, which \u2014 if you are \u2014 you should @-ref me in that gh-repo of yours so I can help out). But, if having a <code>sergeant<\/code> function to do this conversion would help you, drop an issue <a href=\"https:\/\/github.com\/hrbrmstr\/sergeant\">in the repo<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>2021-11-04 UPDATE: Just use {arrow}. Apache Drill is a nice tool to have in the toolbox as it provides a SQL front-end to a wide array of database and file back-ends and runs in standalone\/embedded mode on every modern operating system (i.e. you can get started with or play locally with Drill w\/o needing a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":3,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":""},"categories":[819,781,91,778],"tags":[810],"class_list":["post-4929","post","type-post","status-publish","format-standard","hentry","category-apache-drill","category-drill","category-r","category-sql","tag-post"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.2 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Create Parquet Files From R Data Frames With sergeant &amp; Apache Drill (a.k.a. Make Parquet Files Great Again in R) - rud.is<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Create Parquet Files From R Data Frames With sergeant &amp; Apache Drill (a.k.a. Make Parquet Files Great Again in R) - rud.is\" \/>\n<meta property=\"og:description\" content=\"2021-11-04 UPDATE: Just use {arrow}. Apache Drill is a nice tool to have in the toolbox as it provides a SQL front-end to a wide array of database and file back-ends and runs in standalone\/embedded mode on every modern operating system (i.e. you can get started with or play locally with Drill w\/o needing a [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/\" \/>\n<meta property=\"og:site_name\" content=\"rud.is\" \/>\n<meta property=\"article:published_time\" content=\"2017-01-22T14:53:35+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2021-12-04T16:20:28+00:00\" \/>\n<meta name=\"author\" content=\"hrbrmstr\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"hrbrmstr\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/\"},\"author\":{\"name\":\"hrbrmstr\",\"@id\":\"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886\"},\"headline\":\"Create Parquet Files From R Data Frames With sergeant &#038; Apache Drill (a.k.a. Make Parquet Files Great Again in R)\",\"datePublished\":\"2017-01-22T14:53:35+00:00\",\"dateModified\":\"2021-12-04T16:20:28+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/\"},\"wordCount\":958,\"commentCount\":5,\"publisher\":{\"@id\":\"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886\"},\"keywords\":[\"post\"],\"articleSection\":[\"Apache Drill\",\"drill\",\"R\",\"SQL\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/\",\"url\":\"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/\",\"name\":\"Create Parquet Files From R Data Frames With sergeant & Apache Drill (a.k.a. Make Parquet Files Great Again in R) - rud.is\",\"isPartOf\":{\"@id\":\"https:\/\/rud.is\/b\/#website\"},\"datePublished\":\"2017-01-22T14:53:35+00:00\",\"dateModified\":\"2021-12-04T16:20:28+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/rud.is\/b\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Create Parquet Files From R Data Frames With sergeant &#038; Apache Drill (a.k.a. Make Parquet Files Great Again in R)\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/rud.is\/b\/#website\",\"url\":\"https:\/\/rud.is\/b\/\",\"name\":\"rud.is\",\"description\":\"&quot;In God we trust. All others must bring data&quot;\",\"publisher\":{\"@id\":\"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/rud.is\/b\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886\",\"name\":\"hrbrmstr\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1\",\"url\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1\",\"contentUrl\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1\",\"width\":460,\"height\":460,\"caption\":\"hrbrmstr\"},\"logo\":{\"@id\":\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1\"},\"description\":\"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7\",\"sameAs\":[\"http:\/\/rud.is\"],\"url\":\"https:\/\/rud.is\/b\/author\/hrbrmstr\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Create Parquet Files From R Data Frames With sergeant & Apache Drill (a.k.a. Make Parquet Files Great Again in R) - rud.is","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/","og_locale":"en_US","og_type":"article","og_title":"Create Parquet Files From R Data Frames With sergeant & Apache Drill (a.k.a. Make Parquet Files Great Again in R) - rud.is","og_description":"2021-11-04 UPDATE: Just use {arrow}. Apache Drill is a nice tool to have in the toolbox as it provides a SQL front-end to a wide array of database and file back-ends and runs in standalone\/embedded mode on every modern operating system (i.e. you can get started with or play locally with Drill w\/o needing a [&hellip;]","og_url":"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/","og_site_name":"rud.is","article_published_time":"2017-01-22T14:53:35+00:00","article_modified_time":"2021-12-04T16:20:28+00:00","author":"hrbrmstr","twitter_card":"summary_large_image","twitter_misc":{"Written by":"hrbrmstr","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/#article","isPartOf":{"@id":"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/"},"author":{"name":"hrbrmstr","@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"headline":"Create Parquet Files From R Data Frames With sergeant &#038; Apache Drill (a.k.a. Make Parquet Files Great Again in R)","datePublished":"2017-01-22T14:53:35+00:00","dateModified":"2021-12-04T16:20:28+00:00","mainEntityOfPage":{"@id":"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/"},"wordCount":958,"commentCount":5,"publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"keywords":["post"],"articleSection":["Apache Drill","drill","R","SQL"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/","url":"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/","name":"Create Parquet Files From R Data Frames With sergeant & Apache Drill (a.k.a. Make Parquet Files Great Again in R) - rud.is","isPartOf":{"@id":"https:\/\/rud.is\/b\/#website"},"datePublished":"2017-01-22T14:53:35+00:00","dateModified":"2021-12-04T16:20:28+00:00","breadcrumb":{"@id":"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/rud.is\/b\/2017\/01\/22\/create-parquet-files-from-r-data-frames-with-sergeant-apache-drill-a-k-a-make-parquet-files-great-again-in-r\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/rud.is\/b\/"},{"@type":"ListItem","position":2,"name":"Create Parquet Files From R Data Frames With sergeant &#038; Apache Drill (a.k.a. Make Parquet Files Great Again in R)"}]},{"@type":"WebSite","@id":"https:\/\/rud.is\/b\/#website","url":"https:\/\/rud.is\/b\/","name":"rud.is","description":"&quot;In God we trust. All others must bring data&quot;","publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/rud.is\/b\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886","name":"hrbrmstr","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","width":460,"height":460,"caption":"hrbrmstr"},"logo":{"@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1"},"description":"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7","sameAs":["http:\/\/rud.is"],"url":"https:\/\/rud.is\/b\/author\/hrbrmstr\/"}]}},"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p23idr-1hv","jetpack_likes_enabled":true,"jetpack-related-posts":[{"id":4753,"url":"https:\/\/rud.is\/b\/2016\/12\/20\/sergeant-a-r-boot-camp-for-apache-drill\/","url_meta":{"origin":4929,"position":0},"title":"sergeant : An R Boot Camp for Apache Drill","author":"hrbrmstr","date":"2016-12-20","format":false,"excerpt":"I recently mentioned that I've been working on a development version of an Apache Drill R package called sergeant. Here's a lifted \"TLDR\" on Drill: Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift,\u2026","rel":"","context":"In &quot;Apache Drill&quot;","block_context":{"text":"Apache Drill","link":"https:\/\/rud.is\/b\/category\/apache-drill\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":6046,"url":"https:\/\/rud.is\/b\/2017\/05\/31\/drilling-into-csvs-teaser-trailer\/","url_meta":{"origin":4929,"position":1},"title":"Drilling Into CSVs \u2014 Teaser Trailer","author":"hrbrmstr","date":"2017-05-31","format":false,"excerpt":"I used reading a directory of CSVs as the foundational example in my recent post on idioms. During my exchange with Matt, Hadley and a few others -- in the crazy Twitter thread that spawned said post -- I mentioned that I'd personally \"just use Drill\u201d. I'll use this post\u2026","rel":"","context":"In &quot;Apache Drill&quot;","block_context":{"text":"Apache Drill","link":"https:\/\/rud.is\/b\/category\/apache-drill\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":6146,"url":"https:\/\/rud.is\/b\/2017\/08\/01\/r%e2%81%b6-reticulating-parquet-files\/","url_meta":{"origin":4929,"position":2},"title":"R\u2076 \u2014 Reticulating Parquet Files","author":"hrbrmstr","date":"2017-08-01","format":false,"excerpt":"The reticulate package provides a very clean & concise interface bridge between R and Python which makes it handy to work with modules that have yet to be ported to R (going native is always better when you can do it). This post shows how to use reticulate to create\u2026","rel":"","context":"In &quot;data wrangling&quot;","block_context":{"text":"data wrangling","link":"https:\/\/rud.is\/b\/category\/data-wrangling\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":6127,"url":"https:\/\/rud.is\/b\/2017\/07\/27\/reading-pcap-files-with-apache-drill-and-the-sergeant-r-package\/","url_meta":{"origin":4929,"position":3},"title":"Reading PCAP Files with Apache Drill and the sergeant R Package","author":"hrbrmstr","date":"2017-07-27","format":false,"excerpt":"It's no secret that I'm a fan of Apache Drill. One big strength of the platform is that it normalizes the access to diverse data sources down to ANSI SQL calls, which means that I can pull data from parquet, Hie, HBase, Kudu, CSV, JSON, MongoDB and MariaDB with the\u2026","rel":"","context":"In &quot;Apache Drill&quot;","block_context":{"text":"Apache Drill","link":"https:\/\/rud.is\/b\/category\/apache-drill\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":6091,"url":"https:\/\/rud.is\/b\/2017\/06\/17\/replicating-the-apache-drill-yelp-academic-dataset-with-sergeant\/","url_meta":{"origin":4929,"position":4},"title":"Replicating the Apache Drill &#8216;Yelp&#8217; Academic Dataset Analysis with sergeant","author":"hrbrmstr","date":"2017-06-17","format":false,"excerpt":"The Apache Drill folks have a nice walk-through tutorial on how to analyze the Yelp Academic Dataset with Drill. It's a bit out of date (the current Yelp data set structure is different enough that the tutorial will error out at various points), but it's a great example of how\u2026","rel":"","context":"In &quot;Apache Drill&quot;","block_context":{"text":"Apache Drill","link":"https:\/\/rud.is\/b\/category\/apache-drill\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":11369,"url":"https:\/\/rud.is\/b\/2018\/08\/11\/connecting-apache-zeppelin-and-apache-drill-postgresql-etc\/","url_meta":{"origin":4929,"position":5},"title":"Connecting Apache Zeppelin and Apache Drill, PostgreSQL, etc.","author":"hrbrmstr","date":"2018-08-11","format":false,"excerpt":"A previous post showed how to use a different authentication provider to wire up Apache Zeppelin and Amazon Athena. As noted in that post, Zeppelin is a \"notebook\" alternative to Jupyter (and other) notebooks. Unlike Jupyter, I can tolerate Zeppelin and it's got some nifty features like plug-and-play JDBC access.\u2026","rel":"","context":"In &quot;Apache Drill&quot;","block_context":{"text":"Apache Drill","link":"https:\/\/rud.is\/b\/category\/apache-drill\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/08\/z-drill-2.png?fit=1200%2C542&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/08\/z-drill-2.png?fit=1200%2C542&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/08\/z-drill-2.png?fit=1200%2C542&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/08\/z-drill-2.png?fit=1200%2C542&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/08\/z-drill-2.png?fit=1200%2C542&ssl=1&resize=1050%2C600 3x"},"classes":[]}],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/4929","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/comments?post=4929"}],"version-history":[{"count":0,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/4929\/revisions"}],"wp:attachment":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/media?parent=4929"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/categories?post=4929"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/tags?post=4929"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}