{"id":6046,"date":"2017-05-31T10:12:33","date_gmt":"2017-05-31T15:12:33","guid":{"rendered":"https:\/\/rud.is\/b\/?p=6046"},"modified":"2018-10-05T11:02:40","modified_gmt":"2018-10-05T16:02:40","slug":"drilling-into-csvs-teaser-trailer","status":"publish","type":"post","link":"https:\/\/rud.is\/b\/2017\/05\/31\/drilling-into-csvs-teaser-trailer\/","title":{"rendered":"Drilling Into CSVs \u2014 Teaser Trailer"},"content":{"rendered":"<p>I used reading a directory of CSVs as the foundational example in my recent post on <a href=\"https:\/\/rud.is\/b\/2017\/05\/23\/r%e2%81%b6-idiomatic-for-the-people\/\">idioms<\/a>.<\/p>\n<p>During my exchange with Matt, Hadley and a few others &#8212; in the crazy Twitter thread that spawned said post &#8212; I mentioned that I&#8217;d personally <em>&#8220;just use <a href=\"https:\/\/drill.apache.org\/\">Drill<\/a>\u201d<\/em>.<\/p>\n<p>I&#8217;ll use this post as a bit of a teaser trailer for the actual post (or, more likely, series of posts) that goes into detail on where to get Apache Drill, basic setup of Drill for standalone workstation use and then organizing data with it.<\/p>\n<p>You can get ahead of those posts by doing two things:<\/p>\n<ol>\n<li><a href=\"https:\/\/drill.apache.org\/docs\/drill-in-10-minutes\/\">Download, install and test your Apache Drill setup<\/a> (it&#8217;s literally 10 minutes on any platform)<\/li>\n<li>Review the U.S. EPA <a href=\"https:\/\/aqsdr1.epa.gov\/aqsweb\/aqstmp\/airdata\/download_files.html#Annual\">annual air quality data archive<\/a> (they have individual, annual CSVs that are perfect for the example)<\/li>\n<\/ol>\n<p>My goals for this post are really to just to pique your interest enough in Drill and parquet files (yes, I&#8217;m ultimately trying to socially engineer you into using parquet files) to convince you to read the future post(s) and show that it&#8217;s worth your time to do Step #1 above.<\/p>\n<h3>Getting EPA Air Quality Data<\/h3>\n<p>The EPA has air quality data going back to 1990 (so, 27 files as of this post). They&#8217;re ~1-4MB ZIP compressed and ~10-30MB uncompressed.<\/p>\n<p>You can use the following code to grab them all with the caveat that the <code>libcurl<\/code> method of performing simultaneous downloads caused some pretty severe issues &#8212; like R crashing &#8212; for some of my students who use Windows. There are plenty of examples for doing sequential downloads of a list of URLs out there that folks should be able to get all the files even if this succinct method does not work on your platform.<\/p>\n<pre id=\"drill-teaser-01\"><code class=\"language-r\">dir.create(&quot;airq&quot;)\r\n\r\nurls &lt;- sprintf(&quot;https:\/\/aqsdr1.epa.gov\/aqsweb\/aqstmp\/airdata\/annual_all_%d.zip&quot;, 1990L:2016L)\r\nfils &lt;- sprintf(&quot;airq\/%s&quot;, basename(urls))\r\n\r\ndownload.file(urls, fils, method = &quot;libcurl&quot;)<\/code><\/pre>\n<p>I normally shy away from this particular method since it really hammers the remote server, but this is a beefy U.S. government server, the files are relatively small in number and size and I&#8217;ve got a super-fast internet connection (no long-lived sockets) so it should be fine.<\/p>\n<p>Putting all those files under the &#8220;control&#8221; of Drill is what the next post is for. For now, i&#8217;m going to show the basic code and benchmarks for reading in all those files and performing a basic query for all the distinct years. Yes, we know that information already, but it&#8217;s a nice, compact task that&#8217;s easy to walk through and illustrates the file reading and querying in all three idioms: Drill, tidyverse and data.table.<\/p>\n<h3>Data Setup<\/h3>\n<p>I&#8217;ve converted the EPA annual ZIP files into <code>bzip2<\/code> format. ZIP is fine for storage and downloads but it&#8217;s not a great format for data analysis tasks. <code>gzip<\/code> would be slightly faster but it&#8217;s not <a href=\"http:\/\/comphadoop.weebly.com\/\">easily splittable<\/a> and &#8212; even though I&#8217;m not using the data in a Hadoop context &#8212; I think it&#8217;s wiser to not have to re-process data later on if I ever had to move raw CSV or JSON data into Hadoop. Uncompressed CSVs are the most portable, but there&#8217;s no need to waste space.<\/p>\n<p>All the following files are in a regular filesystem directory accessible to both Drill and R:<\/p>\n<pre id=\"drill-teaser-02\"><code class=\"language-r\">&gt; (epa_annual_fils &lt;- dir(&quot;~\/Data\/csv\/epa\/annual&quot;, &quot;*.csv.bz2&quot;))\r\n [1] &quot;annual_all_1990.csv.bz2&quot; &quot;annual_all_1991.csv.bz2&quot; &quot;annual_all_1992.csv.bz2&quot;\r\n [4] &quot;annual_all_1993.csv.bz2&quot; &quot;annual_all_1994.csv.bz2&quot; &quot;annual_all_1995.csv.bz2&quot;\r\n [7] &quot;annual_all_1996.csv.bz2&quot; &quot;annual_all_1997.csv.bz2&quot; &quot;annual_all_1998.csv.bz2&quot;\r\n[10] &quot;annual_all_1999.csv.bz2&quot; &quot;annual_all_2000.csv.bz2&quot; &quot;annual_all_2001.csv.bz2&quot;\r\n[13] &quot;annual_all_2002.csv.bz2&quot; &quot;annual_all_2003.csv.bz2&quot; &quot;annual_all_2004.csv.bz2&quot;\r\n[16] &quot;annual_all_2005.csv.bz2&quot; &quot;annual_all_2006.csv.bz2&quot; &quot;annual_all_2007.csv.bz2&quot;\r\n[19] &quot;annual_all_2008.csv.bz2&quot; &quot;annual_all_2009.csv.bz2&quot; &quot;annual_all_2010.csv.bz2&quot;\r\n[22] &quot;annual_all_2011.csv.bz2&quot; &quot;annual_all_2012.csv.bz2&quot; &quot;annual_all_2013.csv.bz2&quot;\r\n[25] &quot;annual_all_2014.csv.bz2&quot; &quot;annual_all_2015.csv.bz2&quot; &quot;annual_all_2016.csv.bz2&quot;<\/code><\/pre>\n<p>Drill can directly read plain or compressed JSON, CSV and Apache web server log files plus can treat a directory tree of them as a single data source. It can also read parquet &amp; avro files (both are used frequently in distributed &#8220;big data&#8221; setups) and access MySQL, MongoDB and other JDBC resources as well as query data stored in Amazon S3 and HDFS (I&#8217;ve already mentioned it works fine in plain &#8216;ol filesystems, too).<\/p>\n<p>I&#8217;ve tweaked my Drill configuration to support reading column header info from <code>.csv<\/code> files (which I&#8217;ll show in the next post). In environments like Drill or even Spark, CSV columns are usually queried with some type of column index (e.g. <code>COLUMN[0]<\/code>) so having named columns makes for less verbose query code.<\/p>\n<p>I turned those individual <code>bzip2<\/code> files into parquet format with one Drill query:<\/p>\n<pre id=\"drill-teaser-03\"><code class=\"language-sql\">CREATE TABLE dfs.pq.`\/epa\/annual.parquet` AS \r\n  SELECT * FROM dfs.csv.`\/epa\/annual\/*.csv.bz2`<\/code><\/pre>\n<p>Future posts will explain the <code>dfs...<\/code> component but they are likely familiar path specifications for folks used to Spark and are pretty straightforward. The first bit (up to the back-tick) is an internal Drill shortcut to the actual storage path (which is a plain directory in this test) followed by the tail end path spec to the subdirectories and\/or target files. That one statement said &#8216;take all the CSV files in that directory and make one big table out of them&#8221;.<\/p>\n<p>The nice thing about parquet files is that they work much like R data frames in that they can be processed on the column level. We&#8217;ll see how that speeds up things in a bit.<\/p>\n<h3>Benchmark Setup<\/h3>\n<p>The tests were performed on a maxed out 2016 13&#8243; MacBook Pro.<\/p>\n<p>There are 55 columns of data in the EPA annual summary files.<\/p>\n<p>To give both <code>read_csv<\/code> and <code>fread<\/code> some benchmark boosts, we&#8217;ll define the columns up-front and pass those in to each function on data ingestion and I&#8217;ll leave them out of this post for brevity (they&#8217;re just a <code>cols()<\/code> specification and <code>colClasses<\/code> vector). Drill gets no similar help for this at least when it comes to CSV processing.<\/p>\n<p>I&#8217;m also disabling progress &amp; verbose reporting in both <code>fread<\/code> and <code>read_csv<\/code> despite not stopping Drill from writing out log messages.<\/p>\n<p>Now, we need some setup code to connect to drill and read in the list of files, plus we&#8217;ll setup the five benchmark functions to read in all the files and get the list of distinct years from each.<\/p>\n<pre id=\"drill-teaser-04\"><code class=\"language-r\">library(sergeant)\r\nlibrary(data.table)\r\nlibrary(tidyverse)\r\n\r\n(epa_annual_fils &lt;- dir(&quot;~\/Data\/csv\/epa\/annual&quot;, &quot;*.csv.bz2&quot;, full.names = TRUE))\r\n\r\ndb &lt;- src_drill(&quot;localhost&quot;)\r\n\r\n# Remember, defining ct &amp; ct_dt - the column types specifications - have been left out for brevity\r\n\r\nmb_drill_csv &lt;- function() {\r\n  epa_annual &lt;- tbl(db, &quot;dfs.csv.`\/epa\/annual\/*.csv.bz2`&quot;)\r\n  select(epa_annual, Year) %&gt;% \r\n    distinct(Year) %&gt;% \r\n    collect()\r\n}\r\n\r\nmb_drill_parquet &lt;- function() {\r\n  epa_annual_pq &lt;- tbl(db, &quot;dfs.pq.`\/epa\/annual.parquet`&quot;)\r\n  select(epa_annual_pq, Year) %&gt;% \r\n    distinct(Year) %&gt;% \r\n    collect()\r\n}\r\n\r\nmb_tidyverse &lt;- function() {\r\n  map_df(epa_annual_fils, read_csv, col_types = ct, progress = FALSE) -&gt; tmp\r\n  unique(tmp$Year)\r\n}\r\n\r\nmb_datatable &lt;- function() {\r\n  rbindlist(\r\n    lapply(\r\n      epa_annual_fils, function(x) { \r\n        fread(sprintf(&quot;bzip2 -c -d %s&quot;, x), \r\n              colClasses = ct_dt, showProgress = FALSE, \r\n              verbose = FALSE) })) -&gt; tmp\r\n  unique(tmp$Year)\r\n}\r\n\r\nmb_rda &lt;- function() {\r\n  read_rds(&quot;~\/Data\/rds\/epa\/annual.rds&quot;) -&gt; tmp\r\n  unique(tmp$Year)\r\n}\r\n\r\nmicrobenchmark(\r\n  csv = { mb_drill_csv()     },\r\n   pq = { mb_drill_parquet() },\r\n   df = { mb_tidyverse()     },\r\n   dt = { mb_datatable()     },\r\n  rda = { mb_rda()           },\r\n  times = 5\r\n) -&gt; mb<\/code><\/pre>\n<p>Yep, it&#8217;s really as simple as:<\/p>\n<pre id=\"drill-teaser-06\"><code class=\"language-r\">tbl(db, &quot;dfs.csv.`\/epa\/annual\/*.csv.bz2`&quot;)<\/code><\/pre>\n<p>to have Drill treat a directory tree as a single table. It&#8217;s also not necessary for all the columns to be in all the files (i.e. you get the <code>bind_rows\/map_df\/rbindlist<\/code> behaviour for &#8220;free&#8221;).<\/p>\n<p>I&#8217;m only doing 5 evaluations here since I don&#8217;t want to make you wait if you&#8217;re going to try this at home now or after the Drill series. I&#8217;ve run it with a more robust benchmark configuration and the results are aligned with this one.<\/p>\n<pre id=\"drill-teaser-05\"><code class=\"language-text\">Unit: milliseconds\r\n expr        min         lq       mean     median         uq        max neval\r\n  csv 15473.5576 16851.0985 18445.3905 19586.1893 20087.1620 20228.9450     5\r\n   pq   493.7779   513.3704   616.2634   550.5374   732.6553   790.9759     5\r\n   df 41666.1929 42361.1423 42701.2682 42661.9521 43110.3041 43706.7498     5\r\n   dt 37500.9351 40286.2837 41509.0078 42600.9916 43105.3040 44051.5247     5\r\n  rda  9466.6506  9551.7312 10012.8560  9562.9114  9881.8351 11601.1517     5<\/code><\/pre>\n<p>The R data route, which is the closest to the parquet route, is definitely better than slurping up CSVs all the time. Both parquet and R data files require pre-processing, so they&#8217;re not as flexible as having individual CSVs (that may get added hourly or daily to a directory).<\/p>\n<p>Drill&#8217;s CSV slurping handily beats the other R methods even with some handicaps the others did not have.<\/p>\n<p>This particular example is gamed a bit, which helped parquet to ultimately &#8220;win&#8221;. Since Drill can target the singular column (<code>Year<\/code>) that was asked for, it doesn&#8217;t need to read all the extra columns just to compute the final product (the distinct list of years).<\/p>\n<p>IMO both the Drill CSV ingestion and Drill parquet access provide compelling enough use-cases to use them over the other three methods, especially since they are easily transferrable to remote Drill servers or clusters with virtually no code changes. A single node Drillbit (like R) is constrained by the memory on that individual system, so it&#8217;s not going to get you out of a memory jam, but it may make it easier to organize and streamline your core data operations before other analysis and visualization tasks.<\/p>\n<h3>FIN<\/h3>\n<p>I&#8217;m sure some member of some other tribe will come up with an example that proves superiority of their particular tribal computations. I&#8217;m hoping one of those tribes is the R\/Spark tribe so that can get added into the mix (using Spark standalone is much like using Drill, but with more stats\/ML functions directly available).<\/p>\n<p>I&#8217;m hopeful that this post has showcased enough of Drill&#8217;s utility to general R users that you&#8217;ll give it a go and consider adding it to your R data analysis toolbox. It can be beneficial having both a precision tools as well as a Swiss Army knife &#8212; which is what Drill really is &#8212; handy.<\/p>\n<p>You can find the <code>sergeant<\/code> package <a href=\"https:\/\/github.com\/hrbrmstr\/sergeant\">on GitHub<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I used reading a directory of CSVs as the foundational example in my recent post on idioms. During my exchange with Matt, Hadley and a few others &#8212; in the crazy Twitter thread that spawned said post &#8212; I mentioned that I&#8217;d personally &#8220;just use Drill\u201d. I&#8217;ll use this post as a bit of a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":3,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":"","jetpack_post_was_ever_published":false},"categories":[819,781,91],"tags":[810],"class_list":["post-6046","post","type-post","status-publish","format-standard","hentry","category-apache-drill","category-drill","category-r","tag-post"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Drilling Into CSVs \u2014 Teaser Trailer - rud.is<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/rud.is\/b\/2017\/05\/31\/drilling-into-csvs-teaser-trailer\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Drilling Into CSVs \u2014 Teaser Trailer - rud.is\" \/>\n<meta property=\"og:description\" content=\"I used reading a directory of CSVs as the foundational example in my recent post on idioms. During my exchange with Matt, Hadley and a few others &#8212; in the crazy Twitter thread that spawned said post &#8212; I mentioned that I&#8217;d personally &#8220;just use Drill\u201d. I&#8217;ll use this post as a bit of a [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/rud.is\/b\/2017\/05\/31\/drilling-into-csvs-teaser-trailer\/\" \/>\n<meta property=\"og:site_name\" content=\"rud.is\" \/>\n<meta property=\"article:published_time\" content=\"2017-05-31T15:12:33+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-10-05T16:02:40+00:00\" \/>\n<meta name=\"author\" content=\"hrbrmstr\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"hrbrmstr\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/05\\\/31\\\/drilling-into-csvs-teaser-trailer\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/05\\\/31\\\/drilling-into-csvs-teaser-trailer\\\/\"},\"author\":{\"name\":\"hrbrmstr\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"headline\":\"Drilling Into CSVs \u2014 Teaser Trailer\",\"datePublished\":\"2017-05-31T15:12:33+00:00\",\"dateModified\":\"2018-10-05T16:02:40+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/05\\\/31\\\/drilling-into-csvs-teaser-trailer\\\/\"},\"wordCount\":1384,\"commentCount\":1,\"publisher\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"keywords\":[\"post\"],\"articleSection\":[\"Apache Drill\",\"drill\",\"R\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/05\\\/31\\\/drilling-into-csvs-teaser-trailer\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/05\\\/31\\\/drilling-into-csvs-teaser-trailer\\\/\",\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/05\\\/31\\\/drilling-into-csvs-teaser-trailer\\\/\",\"name\":\"Drilling Into CSVs \u2014 Teaser Trailer - rud.is\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#website\"},\"datePublished\":\"2017-05-31T15:12:33+00:00\",\"dateModified\":\"2018-10-05T16:02:40+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/05\\\/31\\\/drilling-into-csvs-teaser-trailer\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/05\\\/31\\\/drilling-into-csvs-teaser-trailer\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/05\\\/31\\\/drilling-into-csvs-teaser-trailer\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/rud.is\\\/b\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Drilling Into CSVs \u2014 Teaser Trailer\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#website\",\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/\",\"name\":\"rud.is\",\"description\":\"&quot;In God we trust. All others must bring data&quot;\",\"publisher\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/rud.is\\\/b\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\",\"name\":\"hrbrmstr\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"url\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"contentUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"width\":460,\"height\":460,\"caption\":\"hrbrmstr\"},\"logo\":{\"@id\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\"},\"description\":\"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7\",\"sameAs\":[\"http:\\\/\\\/rud.is\"],\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/author\\\/hrbrmstr\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Drilling Into CSVs \u2014 Teaser Trailer - rud.is","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/rud.is\/b\/2017\/05\/31\/drilling-into-csvs-teaser-trailer\/","og_locale":"en_US","og_type":"article","og_title":"Drilling Into CSVs \u2014 Teaser Trailer - rud.is","og_description":"I used reading a directory of CSVs as the foundational example in my recent post on idioms. During my exchange with Matt, Hadley and a few others &#8212; in the crazy Twitter thread that spawned said post &#8212; I mentioned that I&#8217;d personally &#8220;just use Drill\u201d. I&#8217;ll use this post as a bit of a [&hellip;]","og_url":"https:\/\/rud.is\/b\/2017\/05\/31\/drilling-into-csvs-teaser-trailer\/","og_site_name":"rud.is","article_published_time":"2017-05-31T15:12:33+00:00","article_modified_time":"2018-10-05T16:02:40+00:00","author":"hrbrmstr","twitter_card":"summary_large_image","twitter_misc":{"Written by":"hrbrmstr","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/rud.is\/b\/2017\/05\/31\/drilling-into-csvs-teaser-trailer\/#article","isPartOf":{"@id":"https:\/\/rud.is\/b\/2017\/05\/31\/drilling-into-csvs-teaser-trailer\/"},"author":{"name":"hrbrmstr","@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"headline":"Drilling Into CSVs \u2014 Teaser Trailer","datePublished":"2017-05-31T15:12:33+00:00","dateModified":"2018-10-05T16:02:40+00:00","mainEntityOfPage":{"@id":"https:\/\/rud.is\/b\/2017\/05\/31\/drilling-into-csvs-teaser-trailer\/"},"wordCount":1384,"commentCount":1,"publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"keywords":["post"],"articleSection":["Apache Drill","drill","R"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/rud.is\/b\/2017\/05\/31\/drilling-into-csvs-teaser-trailer\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/rud.is\/b\/2017\/05\/31\/drilling-into-csvs-teaser-trailer\/","url":"https:\/\/rud.is\/b\/2017\/05\/31\/drilling-into-csvs-teaser-trailer\/","name":"Drilling Into CSVs \u2014 Teaser Trailer - rud.is","isPartOf":{"@id":"https:\/\/rud.is\/b\/#website"},"datePublished":"2017-05-31T15:12:33+00:00","dateModified":"2018-10-05T16:02:40+00:00","breadcrumb":{"@id":"https:\/\/rud.is\/b\/2017\/05\/31\/drilling-into-csvs-teaser-trailer\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/rud.is\/b\/2017\/05\/31\/drilling-into-csvs-teaser-trailer\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/rud.is\/b\/2017\/05\/31\/drilling-into-csvs-teaser-trailer\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/rud.is\/b\/"},{"@type":"ListItem","position":2,"name":"Drilling Into CSVs \u2014 Teaser Trailer"}]},{"@type":"WebSite","@id":"https:\/\/rud.is\/b\/#website","url":"https:\/\/rud.is\/b\/","name":"rud.is","description":"&quot;In God we trust. All others must bring data&quot;","publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/rud.is\/b\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886","name":"hrbrmstr","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","width":460,"height":460,"caption":"hrbrmstr"},"logo":{"@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1"},"description":"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7","sameAs":["http:\/\/rud.is"],"url":"https:\/\/rud.is\/b\/author\/hrbrmstr\/"}]}},"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p23idr-1zw","jetpack_likes_enabled":true,"jetpack-related-posts":[{"id":11369,"url":"https:\/\/rud.is\/b\/2018\/08\/11\/connecting-apache-zeppelin-and-apache-drill-postgresql-etc\/","url_meta":{"origin":6046,"position":0},"title":"Connecting Apache Zeppelin and Apache Drill, PostgreSQL, etc.","author":"hrbrmstr","date":"2018-08-11","format":false,"excerpt":"A previous post showed how to use a different authentication provider to wire up Apache Zeppelin and Amazon Athena. As noted in that post, Zeppelin is a \"notebook\" alternative to Jupyter (and other) notebooks. Unlike Jupyter, I can tolerate Zeppelin and it's got some nifty features like plug-and-play JDBC access.\u2026","rel":"","context":"In &quot;Apache Drill&quot;","block_context":{"text":"Apache Drill","link":"https:\/\/rud.is\/b\/category\/apache-drill\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/08\/z-drill-2.png?fit=1200%2C542&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/08\/z-drill-2.png?fit=1200%2C542&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/08\/z-drill-2.png?fit=1200%2C542&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/08\/z-drill-2.png?fit=1200%2C542&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/08\/z-drill-2.png?fit=1200%2C542&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":7637,"url":"https:\/\/rud.is\/b\/2017\/12\/20\/r%e2%81%b6-series-random-sampling-from-apache-drill-tables-with-r-sergeant\/","url_meta":{"origin":6046,"position":1},"title":"R\u2076 Series \u2014 Random Sampling From Apache Drill Tables With R &#038; sergeant","author":"hrbrmstr","date":"2017-12-20","format":false,"excerpt":"(For first-timers, R\u2076 tagged posts are short & sweet with minimal expository; R\u2076 feed) At work-work I mostly deal with medium-to-large-ish data. I often want to poke at new or existing data sets w\/o working across billions of rows. I also use Apache Drill for much of my exploratory work.\u2026","rel":"","context":"In &quot;Apache Drill&quot;","block_context":{"text":"Apache Drill","link":"https:\/\/rud.is\/b\/category\/apache-drill\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":6127,"url":"https:\/\/rud.is\/b\/2017\/07\/27\/reading-pcap-files-with-apache-drill-and-the-sergeant-r-package\/","url_meta":{"origin":6046,"position":2},"title":"Reading PCAP Files with Apache Drill and the sergeant R Package","author":"hrbrmstr","date":"2017-07-27","format":false,"excerpt":"It's no secret that I'm a fan of Apache Drill. One big strength of the platform is that it normalizes the access to diverse data sources down to ANSI SQL calls, which means that I can pull data from parquet, Hie, HBase, Kudu, CSV, JSON, MongoDB and MariaDB with the\u2026","rel":"","context":"In &quot;Apache Drill&quot;","block_context":{"text":"Apache Drill","link":"https:\/\/rud.is\/b\/category\/apache-drill\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":6111,"url":"https:\/\/rud.is\/b\/2017\/07\/17\/ten-hut-the-apache-drill-r-interface-package-sergeant-is-now-on-cran\/","url_meta":{"origin":6046,"position":3},"title":"Ten-HUT! The Apache Drill R interface package \u2014\u00a0sergeant \u2014\u00a0is now on CRAN","author":"hrbrmstr","date":"2017-07-17","format":false,"excerpt":"I'm extremely pleased to announce that the sergeant package is now on CRAN or will be hitting your local CRAN mirror soon. sergeant provides JDBC, DBI and dplyr\/dbplyr interfaces to Apache Drill. I've also wrapped a few goodies into the dplyr custom functions that work with Drill and if you\u2026","rel":"","context":"In &quot;Apache Drill&quot;","block_context":{"text":"Apache Drill","link":"https:\/\/rud.is\/b\/category\/apache-drill\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":6091,"url":"https:\/\/rud.is\/b\/2017\/06\/17\/replicating-the-apache-drill-yelp-academic-dataset-with-sergeant\/","url_meta":{"origin":6046,"position":4},"title":"Replicating the Apache Drill &#8216;Yelp&#8217; Academic Dataset Analysis with sergeant","author":"hrbrmstr","date":"2017-06-17","format":false,"excerpt":"The Apache Drill folks have a nice walk-through tutorial on how to analyze the Yelp Academic Dataset with Drill. It's a bit out of date (the current Yelp data set structure is different enough that the tutorial will error out at various points), but it's a great example of how\u2026","rel":"","context":"In &quot;Apache Drill&quot;","block_context":{"text":"Apache Drill","link":"https:\/\/rud.is\/b\/category\/apache-drill\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":11088,"url":"https:\/\/rud.is\/b\/2018\/07\/26\/two-new-apache-drill-udfs-for-processing-urils-and-internet-domain-names\/","url_meta":{"origin":6046,"position":5},"title":"Two new Apache Drill UDFs for Processing UR[IL]s  and Internet Domain Names","author":"hrbrmstr","date":"2018-07-26","format":false,"excerpt":"Continuing the blog's UDF theme of late, there are two new UDF kids in town: drill-url-tools? for slicing & dicing URI\/URLs (just going to use 'URL' from now on in the post) drill-domain-tools? for slicing & dicing internet domain names (IDNs). Now, if you're an Apache Drill fanatic, you're likely\u2026","rel":"","context":"In &quot;Apache Drill&quot;","block_context":{"text":"Apache Drill","link":"https:\/\/rud.is\/b\/category\/apache-drill\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/6046","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/comments?post=6046"}],"version-history":[{"count":20,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/6046\/revisions"}],"predecessor-version":[{"id":8619,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/6046\/revisions\/8619"}],"wp:attachment":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/media?parent=6046"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/categories?post=6046"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/tags?post=6046"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}