rud.is

Replicating the Apache Drill ‘Yelp’ Academic Dataset Analysis with sergeant

The Apache Drill folks have a nice walk-through tutorial on how to analyze the Yelp Academic Dataset with Drill. It’s a bit out of date (the current Yelp data set structure is different enough that the tutorial will error out at various points), but it’s a great example of how to work with large, nested JSON files as a SQL data source. By ‘large’ I mean around 4GB of JSON data spread across 5 files.

If you have enough memory and wanted to work with “flattened” versions of the files in R you could use my ndjson package (there are other JSON “flattener” packages as well, and a new one — corpus::read_ndjson — is even faster than mine, but it fails to read this file). Drill doesn’t necessarily load the entire JSON structure into memory (you can check out the query profiles after the fact to see how much each worker component ended up using) and I’m only mentioning that “R can do this w/o Drill” to stave off some of those types of comments.

The main reasons for replicating their Yelp example was to both have a more robust test suite for sergeant (it’s hitting CRAN soon now that dplyr 0.7.0 is out) and to show some Drill SQL to R conversions. Part of the latter reason is also to show how to use SQL calls to create a tbl that you can then use dplyr verbs to manipulate.

The full tutorial replication is at https://rud.is/rpubs/yelp.html but also iframe’d below.

Keeping Users Safe While Collecting Data

2017-06-13 – 15:48
Posted in AppSec, data science, Information Security, R, Security Awareness
Tagged post
Comments (2)

I caught a mention of this project by Pete Warden on Four Short Links today. If his name sounds familiar, he’s the creator of the DSTK, an O’Reilly author, and now works at Google. A decidedly clever and decent chap.

The project goal is noble: crowdsource and make a repository of open speech data for researchers to make a better world. Said sourcing is done by asking folks to record themselves saying “Yes”, “No” and other short words.

As I meandered over the blog post I looked in horror on the URL for the application that did the recording: https://open-speech-commands.appspot.com/.

Why would the goal of the project combined with that URL give pause? Read on!

You’ve Got Scams!

Picking up the phone and saying something as simple as ‘Yes’ has been a major scam this year. By recording your voice, attackers can replay it on phone prompts and because it’s your voice it makes it harder to refute the evidence and can foil recognition systems that look for your actual voice.

As the chart above shows, the Better Business Bureau has logged over 5,000 of these scams this year (searching for ‘phishing’ and ‘yes’). You can play with the data (a bit — the package needs work) in R with scamtracker.

Now, these are “analog” attacks (i.e. a human spends time socially engineering a human). Bookmark this as you peruse section 2.

Integrity Challenges in 2017

I “trust” Pete’s intentions, but I sure don’t trust open-speech-commands.appspot.com (and, you shouldn’t either). Why? Go visit https://totally-harmless-app.appspot.com. It’s a Google App Engine app I made for this post. Anyone can make an appspot app and the https is meaningless as far as integrity & authenticity goes since I’m running on google’s infrastructure but I’m not google.

You can’t really trust most SSL/TLS sessions as far as site integrity goes anyway. Let’s Encrypt put the final nail in the coffin with their Certs Gone Wild! initiative. With super-recent browser updates you can almost trust your eyes again when it comes to URLs, but you should be very wary of entering your info — especially uploading voice, prints or eye/face images — into any input box on any site if you aren’t 100% sure it’s a legit site that you trust.

Tracking the Trackers

If you don’t know that you’re being tracked 100% of the time on the internet then you really need to read up on the modern internet.

In many cases your IP address can directly identify you. In most cases your device & browser profile — which most commercial sites log — can directly identify you. So, just visiting a web site means that it’s highly likely that web site can know that you are both not a dog and are in fact you.

Still Waiting for the “So, What?”

Many states and municipalities have engaged in awareness campaigns to warn citizens about the “Say ‘Yes'” scam. Asking someone to record themselves saying ‘Yes’ into a random web site pretty much negates that advice.

Folks like me regularly warn about trust on the internet. I could have cloned the functionality of the original site to open-speech-commmands.appspot.com. Did you even catch the 3rd ‘m’ there? Even without that, it’s an appspot.com domain. Anyone can set one up.

Even if the site doesn’t ask for your name or other info and just asks for your ‘Yes’, it can know who you are. In fact, when you’re enabling the microphone to do the recording, it could even take a picture of you if it wanted to (and you’d likely not know or not object since it’s for SCIENCE!).

So, in the worst case scenario a malicious entity could be asking you for your ‘Yes’, tying it right to you and then executing the post-scam attacks that were being performed in the analog version.

But, go so far as to assume this is a legit site with good intentions. Do you really know what’s being logged when you commit your voice info? If the data was mishandled, it would be just as easy to tie the voice files back to you (assuming a certain level of data logging).

The “so what” is not really a warning to users but a message to researchers: You need to threat model your experiments and research initiatives, especially when innocent end users are potentially being put at risk. Data is the new gold, diamonds and other precious bits that attackers are after. You may think you’re not putting folks at risk and aren’t even a hacker target, but how you design data gathering can reinforce good or bad behaviour on the part of users. It can solidify solid security messages or tear them down. And, you and your data may be more of a target than you really know.

Reach out to interdisciplinary colleagues to help threat model your data collection, storage and dissemination methods to ensure you aren’t putting yourself or others at risk.

FIN

Pete did the right thing:

and, I’m sure the site will be on a “proper” domain soon. When it is, I’ll be one of the first in line to help make a much-needed open data set for research purposes.

Engaging the tidyverse Clean Slate Protocol

I caught the 0.7.0 release of dplyr on my home CRAN server early Friday morning and immediately set out to install it since I’m eager to finish up my sergeant package and get it on CRAN. “Tidyverse” upgrades aren’t trivial for me as I tinker quite a bit with the tidyverse and create packages that depend on various components. The sergeant package provides — amongst other things — a dplyr back-end for Apache Drill, so it has more tidyverse tendrils than other bits of code I maintain.

macOS binaries weren’t available yet (it generally takes 24-48 hrs for that) so I did an install.packages("dplyr", type="source") and was immediately hit with gcc 7 compilation errors. This seemed odd, but switching back to clang worked fine.

I, then, proceeded to run chunks in an Rmd I’m working on and hit “Encoding” errors on mutate() calls. Not having time to debug further I reverted to 0.5.0 of dplyr and went about my day and promised the tidyverse maintainers that I’d work on a reproducible example after work.

I made R data files from the data frames that were tossing errors and extracted & tweaked a code snippet that consistently generated the error and created a rocker container on one of my linux boxes to validate that this was an error and a cross-platform one. The rocker container used a full fresh-from-source copy of the tidyverse including dplyr 0.7.0. The code worked and no error was generated, so I immediately suspected package rot on my main dev macOS box.

Now, my situation is complicated by an insanely hasty migration to macOS 10.13β1 (I refuse to use the Apple macOS catchy names anymore since the most recent one is just silly) and a move to the gcc 7 toolchain (initially prompted to both get rJava working nicely and reproduce some CRAN noted errors with some packages). Further complications were also created by many invocations of install_github() of various packages regularly overwriting bits of the tidyverse over the past few weeks since the R 3.4.0 release. In other words, the integrity of the “tidyverse” was in serious question on my system and it was time for the Clean Slate Protocol.

Rather than itemize package versions and surgically nipping and tucking, I opted to use packrat to get to my desired end-state of a full-integrity tidyverse install. There are many ways to do this. Feel free to “one-up” me and show your l33t method in the comments. This one will likely be accessible to most — if not all — R users.

I started a new RStudio project in a new session and told it to use packrat. In the new project console, I did install.packages("tidyverse", type="source") and let it go for many minutes. I, then, navigated to the packrat subdirectory where the 3.4 package binaries are housed (just follow the project packrat tree down to the R version directory) and moved all 51 packages (yes, 51 O_o) to the main R library path (which you can figure out by running .libPaths() in any non-packrat-maintained project).

After doing that, I fired up the originally failing Rmd and everything worked fine. ?

I don’t do the Clean Slate Protocol too often (we all get to for new R dot-releases) but it came in handy this time. If you run into errors when trying to get the new dplyr working, you may benefit from the Clean Slate Protocol as well.

If you haven’t seen the changes in 0.6.0/0.7.0 you should check them out and give it a go.

R⁶ — Scraping Images To PDFs

I’ve been doing intermittent prep work for a follow-up to an earlier post on store closings and came across this CNN Money “article” on it. Said “article” is a deliberately obfuscated or lazily crafted series of GIF images that contain all the Radio Shack impending store closings. It’s the most comprehensive list I’ve found, but the format is terrible and there’s no easy, in-browser way to download them all.

CNN has ToS that prevent automated data gathering from CNN-proper. But, they used Adobe Document Cloud for these images which has no similar restrictions from a quick glance at their ToS. That means you get an R⁶ post on how to grab the individual 38 images and combine them into one PDF. I did this all with the hopes of OCRing the text, which has not panned out too well since the image quality and font was likely deliberately set to make it hard to do precisely what I’m trying to do.

If you work through the example, you’ll get a feel for:

using sprintf() to take a template and build a vector of URLs
use dplyr progress bars
customize httr verb options to ensure you can get to content
use purrr to iterate through a process of turning raw image bytes into image content (via magick) and turn a list of images into a PDF

library(httr)
library(magick)
library(tidyverse)

url_template <- "https://assets.documentcloud.org/documents/1657793/pages/radioshack-convert-p%s-large.gif"

pb <- progress_estimated(38)

sprintf(url_template, 1:38) %>% 
  map(~{
    pb$tick()$print()
    GET(url = .x, 
        add_headers(
          accept = "image/webp,image/apng,image/*,*/*;q=0.8", 
          referer = "http://money.cnn.com/interactive/technology/radio-shack-closure-list/index.html", 
          authority = "assets.documentcloud.org"))    
  }) -> store_list_pages

map(store_list_pages, content) %>% 
  map(image_read) %>% 
  reduce(image_join) %>% 
  image_write("combined_pages.pdf", format = "pdf")

I figured out the Document Cloud links and necessary httr::GET() options by using Chrome Developer Tools and my curlconverter package.

If any academic-y folks have a ~~test subject~~summer intern with a free hour and would be willing to have them transcribe this list and stick it on GitHub, you’d have my eternal thanks.

Drilling Into CSVs — Teaser Trailer

I used reading a directory of CSVs as the foundational example in my recent post on idioms.

During my exchange with Matt, Hadley and a few others — in the crazy Twitter thread that spawned said post — I mentioned that I’d personally “just use Drill”.

I’ll use this post as a bit of a teaser trailer for the actual post (or, more likely, series of posts) that goes into detail on where to get Apache Drill, basic setup of Drill for standalone workstation use and then organizing data with it.

You can get ahead of those posts by doing two things:

Download, install and test your Apache Drill setup (it’s literally 10 minutes on any platform)
Review the U.S. EPA annual air quality data archive (they have individual, annual CSVs that are perfect for the example)

My goals for this post are really to just to pique your interest enough in Drill and parquet files (yes, I’m ultimately trying to socially engineer you into using parquet files) to convince you to read the future post(s) and show that it’s worth your time to do Step #1 above.

Getting EPA Air Quality Data

The EPA has air quality data going back to 1990 (so, 27 files as of this post). They’re ~1-4MB ZIP compressed and ~10-30MB uncompressed.

You can use the following code to grab them all with the caveat that the libcurl method of performing simultaneous downloads caused some pretty severe issues — like R crashing — for some of my students who use Windows. There are plenty of examples for doing sequential downloads of a list of URLs out there that folks should be able to get all the files even if this succinct method does not work on your platform.

dir.create("airq")

urls <- sprintf("https://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/annual_all_%d.zip", 1990L:2016L)
fils <- sprintf("airq/%s", basename(urls))

download.file(urls, fils, method = "libcurl")

I normally shy away from this particular method since it really hammers the remote server, but this is a beefy U.S. government server, the files are relatively small in number and size and I’ve got a super-fast internet connection (no long-lived sockets) so it should be fine.

Putting all those files under the “control” of Drill is what the next post is for. For now, i’m going to show the basic code and benchmarks for reading in all those files and performing a basic query for all the distinct years. Yes, we know that information already, but it’s a nice, compact task that’s easy to walk through and illustrates the file reading and querying in all three idioms: Drill, tidyverse and data.table.

Data Setup

I’ve converted the EPA annual ZIP files into bzip2 format. ZIP is fine for storage and downloads but it’s not a great format for data analysis tasks. gzip would be slightly faster but it’s not easily splittable and — even though I’m not using the data in a Hadoop context — I think it’s wiser to not have to re-process data later on if I ever had to move raw CSV or JSON data into Hadoop. Uncompressed CSVs are the most portable, but there’s no need to waste space.

All the following files are in a regular filesystem directory accessible to both Drill and R:

> (epa_annual_fils <- dir("~/Data/csv/epa/annual", "*.csv.bz2"))
 [1] "annual_all_1990.csv.bz2" "annual_all_1991.csv.bz2" "annual_all_1992.csv.bz2"
 [4] "annual_all_1993.csv.bz2" "annual_all_1994.csv.bz2" "annual_all_1995.csv.bz2"
 [7] "annual_all_1996.csv.bz2" "annual_all_1997.csv.bz2" "annual_all_1998.csv.bz2"
[10] "annual_all_1999.csv.bz2" "annual_all_2000.csv.bz2" "annual_all_2001.csv.bz2"
[13] "annual_all_2002.csv.bz2" "annual_all_2003.csv.bz2" "annual_all_2004.csv.bz2"
[16] "annual_all_2005.csv.bz2" "annual_all_2006.csv.bz2" "annual_all_2007.csv.bz2"
[19] "annual_all_2008.csv.bz2" "annual_all_2009.csv.bz2" "annual_all_2010.csv.bz2"
[22] "annual_all_2011.csv.bz2" "annual_all_2012.csv.bz2" "annual_all_2013.csv.bz2"
[25] "annual_all_2014.csv.bz2" "annual_all_2015.csv.bz2" "annual_all_2016.csv.bz2"

Drill can directly read plain or compressed JSON, CSV and Apache web server log files plus can treat a directory tree of them as a single data source. It can also read parquet & avro files (both are used frequently in distributed “big data” setups) and access MySQL, MongoDB and other JDBC resources as well as query data stored in Amazon S3 and HDFS (I’ve already mentioned it works fine in plain ‘ol filesystems, too).

I’ve tweaked my Drill configuration to support reading column header info from .csv files (which I’ll show in the next post). In environments like Drill or even Spark, CSV columns are usually queried with some type of column index (e.g. COLUMN[0]) so having named columns makes for less verbose query code.

I turned those individual bzip2 files into parquet format with one Drill query:

CREATE TABLE dfs.pq.`/epa/annual.parquet` AS 
  SELECT * FROM dfs.csv.`/epa/annual/*.csv.bz2`

Future posts will explain the dfs... component but they are likely familiar path specifications for folks used to Spark and are pretty straightforward. The first bit (up to the back-tick) is an internal Drill shortcut to the actual storage path (which is a plain directory in this test) followed by the tail end path spec to the subdirectories and/or target files. That one statement said ‘take all the CSV files in that directory and make one big table out of them”.

The nice thing about parquet files is that they work much like R data frames in that they can be processed on the column level. We’ll see how that speeds up things in a bit.

Benchmark Setup

The tests were performed on a maxed out 2016 13″ MacBook Pro.

There are 55 columns of data in the EPA annual summary files.

To give both read_csv and fread some benchmark boosts, we’ll define the columns up-front and pass those in to each function on data ingestion and I’ll leave them out of this post for brevity (they’re just a cols() specification and colClasses vector). Drill gets no similar help for this at least when it comes to CSV processing.

I’m also disabling progress & verbose reporting in both fread and read_csv despite not stopping Drill from writing out log messages.

Now, we need some setup code to connect to drill and read in the list of files, plus we’ll setup the five benchmark functions to read in all the files and get the list of distinct years from each.

library(sergeant)
library(data.table)
library(tidyverse)

(epa_annual_fils <- dir("~/Data/csv/epa/annual", "*.csv.bz2", full.names = TRUE))

db <- src_drill("localhost")

# Remember, defining ct & ct_dt - the column types specifications - have been left out for brevity

mb_drill_csv <- function() {
  epa_annual <- tbl(db, "dfs.csv.`/epa/annual/*.csv.bz2`")
  select(epa_annual, Year) %>% 
    distinct(Year) %>% 
    collect()
}

mb_drill_parquet <- function() {
  epa_annual_pq <- tbl(db, "dfs.pq.`/epa/annual.parquet`")
  select(epa_annual_pq, Year) %>% 
    distinct(Year) %>% 
    collect()
}

mb_tidyverse <- function() {
  map_df(epa_annual_fils, read_csv, col_types = ct, progress = FALSE) -> tmp
  unique(tmp$Year)
}

mb_datatable <- function() {
  rbindlist(
    lapply(
      epa_annual_fils, function(x) { 
        fread(sprintf("bzip2 -c -d %s", x), 
              colClasses = ct_dt, showProgress = FALSE, 
              verbose = FALSE) })) -> tmp
  unique(tmp$Year)
}

mb_rda <- function() {
  read_rds("~/Data/rds/epa/annual.rds") -> tmp
  unique(tmp$Year)
}

microbenchmark(
  csv = { mb_drill_csv()     },
   pq = { mb_drill_parquet() },
   df = { mb_tidyverse()     },
   dt = { mb_datatable()     },
  rda = { mb_rda()           },
  times = 5
) -> mb

Yep, it’s really as simple as:

tbl(db, "dfs.csv.`/epa/annual/*.csv.bz2`")

to have Drill treat a directory tree as a single table. It’s also not necessary for all the columns to be in all the files (i.e. you get the bind_rows/map_df/rbindlist behaviour for “free”).

I’m only doing 5 evaluations here since I don’t want to make you wait if you’re going to try this at home now or after the Drill series. I’ve run it with a more robust benchmark configuration and the results are aligned with this one.

Unit: milliseconds
 expr        min         lq       mean     median         uq        max neval
  csv 15473.5576 16851.0985 18445.3905 19586.1893 20087.1620 20228.9450     5
   pq   493.7779   513.3704   616.2634   550.5374   732.6553   790.9759     5
   df 41666.1929 42361.1423 42701.2682 42661.9521 43110.3041 43706.7498     5
   dt 37500.9351 40286.2837 41509.0078 42600.9916 43105.3040 44051.5247     5
  rda  9466.6506  9551.7312 10012.8560  9562.9114  9881.8351 11601.1517     5

The R data route, which is the closest to the parquet route, is definitely better than slurping up CSVs all the time. Both parquet and R data files require pre-processing, so they’re not as flexible as having individual CSVs (that may get added hourly or daily to a directory).

Drill’s CSV slurping handily beats the other R methods even with some handicaps the others did not have.

This particular example is gamed a bit, which helped parquet to ultimately “win”. Since Drill can target the singular column (Year) that was asked for, it doesn’t need to read all the extra columns just to compute the final product (the distinct list of years).

IMO both the Drill CSV ingestion and Drill parquet access provide compelling enough use-cases to use them over the other three methods, especially since they are easily transferrable to remote Drill servers or clusters with virtually no code changes. A single node Drillbit (like R) is constrained by the memory on that individual system, so it’s not going to get you out of a memory jam, but it may make it easier to organize and streamline your core data operations before other analysis and visualization tasks.

FIN

I’m sure some member of some other tribe will come up with an example that proves superiority of their particular tribal computations. I’m hoping one of those tribes is the R/Spark tribe so that can get added into the mix (using Spark standalone is much like using Drill, but with more stats/ML functions directly available).

I’m hopeful that this post has showcased enough of Drill’s utility to general R users that you’ll give it a go and consider adding it to your R data analysis toolbox. It can be beneficial having both a precision tools as well as a Swiss Army knife — which is what Drill really is — handy.

You can find the sergeant package on GitHub.

L.A. Unconf-idential : a.k.a. an rOpenSci #runconf17 Retrospective

Last year, I was able to sit back and lazily “RT” Julia Silge’s excellent retrospective on her 2016 @rOpenSci “unconference” experience. Since Julia was not there this year, and the unconference experience is still in primary storage (LMD v2.0 was a success!) I thought this would be the perfect time for a mindful look-back.

And Now, A Word From…

Hosting a conference is an expensive endeavour. These organizations made the event possible:

At most “conferences” you are inundated with advertising from event sponsors. These folks provided resources and said “do good work”. That makes them all pretty amazing but is also an indicator of the awesomeness of this particular unconference.

All For “Un” and “Un” For All

Over the years, I’ve become much less appreciative of “talking heads” events. Don’t get me wrong. There’s great benefit in being part of a larger group experiencing the same message(s) and getting inspired to understand and investigate new ideas, concepts and technologies. Shining examples of what great “conferences” look like include OpenVis Conf and RStudio’s inaugural self-titled event.

The @rOpenSci “unconference” model is incredibly refreshing.

It has the “get’er done” feel of a hackathon but places less importance on the competitive aspect that is usually paramount in hackathons and increases emphasis on forging links, partnerships and creativity across the diverse R community. It’s really like the Three Musketeers saying “all for one and one for all” since we were all there to help each other build great things to enable R users to build even greater things.

What We Going To Do Tonight, BrainKarthik?

I’ll let you peruse the rOpenSci member list and #runconf17 attendee list at your leisure. Those folks came to Los Angeles to work — not just listen — for two days.

In the grand scheme of things, two days is not much time. It takes many organizations two days to just agree on what conference room they’re going to use for an upcoming internal meeting let alone try to get something meaningful accomplished. In two days, the unconference participants cranked out ~20 working projects. No project had every “i” dotted and every “t” crossed but the vast majority were at Minimum Viable Product status by presentation time on Day 2, and none were “trivial”.

You can read all of the projects at the aforementioned link. Any that I fail to mention here is not a reflection on the project but more a factor of needing to keep this post to a reasonable length. To that end, I’m not even elaborating on the project I mainly worked on with Rich, Steph, Oliver & Jeroen (though it is getting a separate blog post soon).

Want to inspire Minecraft enthusiasts to learn R? There’s an app for that. The vast functional programming power that’s enabled the modern statistics and machine learning revolution is now at the fingertips of any player. On the flip side, you now have tools to create 3D models in a world you can literally walk through — as in, literally stand and watch models of migratory patterns of laden swallows that you’ve developed. Or, make a 3D scatterblock™ diagram and inspect — or destroy with an obsidian axe — interesting clusters. Eliminating data set outliers never felt so cathartic! Or, even create mazes algorithmically and see if your AI-controlled avatar can find its way out.

Want to connect up live sensor (or other live stream) data into an R Shiny project? There’s an app for that. Websockets are a more efficient & versatile way to wire up clients and servers. Amazon’s IoT platform even uses it as one way to push data out from your connected hairbrush. R now has a lightweight way to grab this data.

The team even live-demoed how to pick up accelerometer data from a mobile device and collect + plot it live.

Want vastly improved summaries of your data frames so you can find errors, normalize columns and get to visualization and model development faster? There an app for that.

Yes, I — too — SQUEEd at in-console & in-data frame histograms.

There are many more projects for you to investigate and U.S. folks should be thankful for a long weekend so they have time to dive into each of them.

It’s never about the technology. It’s about the people.

(I trust Doctor Who fans will forgive me for usurping Clara’s best line from the Bells of Saint John)

Stefanie, Karthik, Scott & the rest of the rOpenSci team did a phenomenal job organizing and running the unconference. Their efforts ensured it was an open and safe environment for folks (or ?) to just be themselves.

I got to “see” idividuals I’ve only ever previously digitally interacted or collaborated with. Their IRL smiles — a very familiar expression on the faces of attendees during the two-day event — are even wider and brighter than those that come through in their tweets and blog posts.

Each and every attendee I met brought fresh perspectives, unique knowledge, incredible talent and unwavering enthusiasm to the event. Teams and individuals traded ideas and code snippets and provided inspiration and encouragement when not hammering out massive quantities of R code.

You can actually get a mini-unconf experience at any time from the comfort of your own glowing rectangle nesting spot. Pick or start a project, connect with the team and dive in.

FIN

It was great meeting new folks, hanging with familiar faces and having two days to just focus on making things for the R community. I hope more conferences or groups explore the “un” model and look forward to seeing the 2017 projects become production-ready and more folks jumping on board rOpenSci.

R⁶ — Idiomatic (for the People)

NOTE: I’ll do my best to ensure the next post will have nothing to do with Twitter, and this post might not completely meet my R⁶ criteria.

A single, altruistic, nigh exuberant R tweet about slurping up a directory of CSVs devolved quickly — at least in my opinion, and partly (sadly) with my aid — into a thread that ultimately strayed from a crucial point: idiomatic is in the eye of the beholder.

I’m not linking to the twitter thread, but there are enough folks with sufficient Klout scores on it (is Klout even still a thing?) that you can easily find it if you feel so compelled.

I’ll take a page out of the U.S. High School “write an essay” playbook and start with a definition of idiomatic:

using, containing, or denoting expressions that are natural to a native speaker

That comes from idiom:

a form of expression natural to a language, person, or group of people

I usually joke with my students that a strength (and weakness) of R is that there are ~twelve ways to do any given task. While the statement is deliberately hyperbolic, the core message is accurate: there’s more than one way to do most things in R. A cascading truth is: what makes one way more “correct” over another often comes down to idiom.

My rstudio::conf 2017 presentation included an example of my version of using purrr for idiomatic CSV/JSON directory slurping. There are lots of ways to do this in R (the point of the post is not really to show you how to do the directory slurping and it is unlikely that I’ll approve comments with code snippets about that task). Here are three. One from base R tribe, one from the data.table tribe and one from the tidyverse tribe:

# We need some files and we'll use base R to make some
dir.create("readings")
for (i in 1970:2010) write.csv(mtcars, file.path("readings", sprintf("%s.csv", i)), row.names=FALSE)

fils <- list.files("readings", pattern = ".csv$", full.names=TRUE)

do.call(rbind, lapply(fils, read.csv, stringsAsFactors=FALSE))

data.table::rbindlist(lapply(fils, data.table::fread))

purrr::map_df(fils, readr::read_csv)

You get data for all the “years” into a data.frame, data.table and tibble (respectively) with those three “phrases”.

However, what if you want the year as a column? Many of these “datalogger” CSV data sets do not have a temporal “grouping” variable as they let the directory structure & naming conventions embed that bit of metadata. That information would be nice, though:

do.call(rbind, lapply(fils, function(x) {
  f <- read.csv(x, stringsAsFactors=FALSE)
  f$year <- gsub("^readings/|\\.csv$", "", x)
  f
}))

dt <- data.table::rbindlist(lapply(fils, data.table::fread), idcol="year")
dt[, year := gsub("^readings/|\\.csv$", "", fils[year])]

purrr::map_df(fils, readr::read_csv, .id = "year") %>% 
  dplyr::mutate(year = stringr::str_replace_all(fils[as.numeric(year)],
                                                "^readings/|\\.csv$", ""))

All three versions do the same thing, and each tribe understands each idiom.

The data.table and tidyverse versions get you much faster file reading and the ability to “fill” missing columns — another common slurping task. You can hack something together in base R to do column fills (you’ll find a few StackOverflow answers that accomplish such a task) but you will likely decide to choose one of the other idioms for that and become equally as comfortable in that new idiom.

There are multiple ways to further extend the slurping example, but that’s not the point of the post.

Each set of snippets contains 100% valid R code. They accomplish the task and are idiomatic for each tribe. Despite what any “mil gun feos turrach na latsa” experts’ exchange would try to tell you, the best idiom is the one that works for you/you & your collaborators and the one that gets you to the real work — data analysis — in the most straightforward & reproducible way possible (for you).

Idiomatic does not mean there’s only a singular One, True Way™, and I think a whole host of us forget that at times.

Write good, clean, error-free, reproducible code.

Choose idioms that work best for you and your collaborators.

Adapt when necessary.

A Very Palette-able Post

UPDATE: I was reminded that I made a more generic version of adobecolor to handle many types of swatch files which you can find on github.

Many of my posts seem to begin with a link to a tweet, and this one falls into that pattern:

And @_inundata is already working on a #rstats palette. https://t.co/bNfpL7OmVl

— Timothée Poisot (@tpoi) May 21, 2017

I’d seen the Ars Tech post about the named color palette derived from some training data. I could tell at a glance of the resultant palette:

that it would not be ideal for visualizations (use this site test the final image in this post and verify that on your own) but this was a neat, quick project to take on, especially since it let me dust off an old GH package, adobecolor and it was likely I could beat Karthik to creating a palette ;-)

The “B+” goal is to get a color palette that “matches” the one in the Tumlbr post. The “A” goal is to get a named palette.

These are all the packages we end up using:

library(tesseract)
library(magick)
library(stringi)
library(adobecolor) # hrbrmstr/adobecolor - may not be Windows friendly
library(tidyverse)

Attempt #1 (B+!!)

I’m a macOS user, so I’ve got great tools like xScope at my disposal. I’m really handy with that app and the Loupe tool makes it easy to point at a color, save it to a palette board and export an ACO palette file.

That whole process took ~18 seconds (first try). I’m not saying that to brag. But we often get hung up on both speed and programmatic reproducibility. I ultimately — as we’ll see in a bit — really went for speed vs programmatic reproducibility.

It’s dead simple to get the palette into R:

aco_fil <- "ml_cols.aco"
aco_hex <- rev(read_aco(aco_fil))

col2rgb(aco_hex)
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## red    112  203   97  191  120  221  169  233  177   216    62   178   199
## green  112  198   92  174  114  196  167  191  138   200    63   184   172
## blue    85  166   73  156  124  199  171  143  109   185    67   196   146
##       [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23]
## red      48   172   177   203   219   162   152   232   197   191
## green    94   152   100   205   210    98   165   177   161   161
## blue     83   145   107   192   179   106   158   135   171   124

IIRC there may still be a byte-order issue (PRs welcome) I need to deal with on Windows in adobecolor but you likely will never need to use the package again.

A quick eyeball comparison between the Tumblr list and that matrix indicates the colors are off. That could be for many reasons starting from the way they were encoded in the PNG by whatever programming language was used to train the neural net and make the image (likely Python) to Tumblr degrading it to something on my end. You’ll see that the colors are close enough for humans that it’s likely close enough.

There, I’ve got a B+ with about a total of 60s of work! Plenty of time left to try shooting for an A!

Attempt #2 (FAIL)

We’ve got the PNG from the Tumblr post and the tesseract package in R. Perhaps this will be super-quick, too:

pal_img_fil <- "tumblr_inline_opgsh0UI6N1rl9zu7_400.png"

pal_ocr <- ocr(pal_img_fil)
stri_split_lines(pal_ocr)
## [[1]]
##  [1] "-ClaniicFug112113 84"      "-Snowhnn.k 201 199165"    
##  [3] "- Cmbabcl 97 93 68"        "-Bunﬂuw 190 174 155"      
##  [5] "-an:hing Blue 121 114125"  "Bank Bun 221 196199"      
##  [7] "- Caring Tan 171 166170"   "-Smrguun 233191 141"      
##  [9] "-Sink 176 131; 110"        "Slummy Beige 216 200135"  
## [11] "- Durkwumi 61 63 66"       "Flow/£1178 1114 196"      
## [13] "- Sand Dan 2111 172143"    "- Grade 136: 41; 94 x3"   
## [15] "-Ligh[OfBlasll75150147"    "-Grass 13m 176 99108"     
## [17] "Sindis Poop 204 205 194"   "Dupe 219 2119179"         
## [19] "-'n:sling156101 106"       "-SloncrElu13152165 159"   
## [21] "- Buxblc Simp 226 1x1 132" "-Sl.mky 13m197162171"     
## [23] "-'J\\milyl90164116"        ""                         
## [25] ""

Ugh.

Perhaps if we crop out the colors:

image_read(pal_img_fil) %>%
  image_crop("+57") %>%
  ocr() %>%
  stri_split_lines()
## [[1]]
##  [1] "Clanﬁc Fug112113 84"       "Snowhunk 201 199 165"     
##  [3] "Cmbabcl 97 93 as"          "Bunﬂuwl90174155"          
##  [5] "Kunming Blue 121 114 125"  "Bank Bun 221196199"       
##  [7] "Caring Tan 171 ms 170"     "Slarguun 233 191 141"     
##  [9] "Sinkl76135110"             ""                         
## [11] "SIIImmy Beige 216 200 135" "Durkwuud e1 63 66"        
## [13] "Flower 175 154 196"        ""                         
## [15] "Sand Dan 201 172 143"      "Grade 1m AB 94: 53"       
## [17] ""                          "Light 0mm 175 150 147"    
## [19] "Grass Ba! 17a 99 ms"       "sxndis Poop 204 205 194"  
## [21] "Dupe 219 209 179"          ""                         
## [23] "Tesling 156 101 106"       "SloncrEluc 152 165 159"   
## [25] "Buxblc Simp 226 131 132"   "Sumky Bean 197 162 171"   
## [27] "1\\mﬂy 190 164 11a"        ""                         
## [29] ""

Ugh.

I’m woefully unfamiliar with how to use the plethora of tesseract options to try to get better performance and this is taking too much time for a toy post, so we’ll call this attempt a failure :-(

Attempt #3 (A-!!)

I’m going to go outside of R again to New OCR and upload the Tumblr palette there and crop out the colors (it lets you do that in-browser). NOTE: Never use any free site for OCR’ing sensitive data as most are run by content thieves.

Now we’re talkin’:

ocr_cols <- "Clardic Fug 112 113 84
Snowbonk 201 199 165
Catbabel 97 93 68
Bunfiow 190 174 155
Ronching Blue 121 114 125
Bank Butt 221 196 199
Caring Tan 171 166 170
Stargoon 233 191 141
Sink 176 138 110
Stummy Beige 216 200 185
Dorkwood 61 63 66
Flower 178 184 196
Sand Dan 201 172 143
Grade Bat 48 94 83
Light Of Blast 175 150 147
Grass Bat 176 99 108
Sindis Poop 204 205 194
Dope 219 209 179
Testing 156 101 106
Stoncr Blue 152 165 159
Burblc Simp 226 181 132
Stanky Bean 197 162 171
Thrdly 190 164 116"

We can get that into a more useful form pretty quickly:

stri_match_all_regex(ocr_cols, "([[:alpha:] ]+) ([[:digit:]]+) ([[:digit:]]+) ([[:digit:]]+)") %>%
  print() %>%
  .[[1]] -> col_mat
## [[1]]
##       [,1]                         [,2]             [,3]  [,4]  [,5] 
##  [1,] "Clardic Fug 112 113 84"     "Clardic Fug"    "112" "113" "84" 
##  [2,] "Snowbonk 201 199 165"       "Snowbonk"       "201" "199" "165"
##  [3,] "Catbabel 97 93 68"          "Catbabel"       "97"  "93"  "68" 
##  [4,] "Bunfiow 190 174 155"        "Bunfiow"        "190" "174" "155"
##  [5,] "Ronching Blue 121 114 125"  "Ronching Blue"  "121" "114" "125"
##  [6,] "Bank Butt 221 196 199"      "Bank Butt"      "221" "196" "199"
##  [7,] "Caring Tan 171 166 170"     "Caring Tan"     "171" "166" "170"
##  [8,] "Stargoon 233 191 141"       "Stargoon"       "233" "191" "141"
##  [9,] "Sink 176 138 110"           "Sink"           "176" "138" "110"
## [10,] "Stummy Beige 216 200 185"   "Stummy Beige"   "216" "200" "185"
## [11,] "Dorkwood 61 63 66"          "Dorkwood"       "61"  "63"  "66" 
## [12,] "Flower 178 184 196"         "Flower"         "178" "184" "196"
## [13,] "Sand Dan 201 172 143"       "Sand Dan"       "201" "172" "143"
## [14,] "Grade Bat 48 94 83"         "Grade Bat"      "48"  "94"  "83" 
## [15,] "Light Of Blast 175 150 147" "Light Of Blast" "175" "150" "147"
## [16,] "Grass Bat 176 99 108"       "Grass Bat"      "176" "99"  "108"
## [17,] "Sindis Poop 204 205 194"    "Sindis Poop"    "204" "205" "194"
## [18,] "Dope 219 209 179"           "Dope"           "219" "209" "179"
## [19,] "Testing 156 101 106"        "Testing"        "156" "101" "106"
## [20,] "Stoncr Blue 152 165 159"    "Stoncr Blue"    "152" "165" "159"
## [21,] "Burblc Simp 226 181 132"    "Burblc Simp"    "226" "181" "132"
## [22,] "Stanky Bean 197 162 171"    "Stanky Bean"    "197" "162" "171"
## [23,] "Thrdly 190 164 116"         "Thrdly"         "190" "164" "116"

The print() is in the pipe as I can never remember where each stringi functions stick lists but usually guess right, plus I wanted to check the output.

Making those into colors is super-simple:

y <- apply(col_mat[,3:5], 2, as.numeric)

ocr_cols <- rgb(y[,1], y[,2], y[,3], names=col_mat[,2], maxColorValue = 255)

If we look at Attempt #1 and Attempt #2 together:

ocr_cols
##    Clardic Fug       Snowbonk       Catbabel        Bunfiow  Ronching Blue 
##      "#707154"      "#C9C7A5"      "#615D44"      "#BEAE9B"      "#79727D" 
##      Bank Butt     Caring Tan       Stargoon           Sink   Stummy Beige 
##      "#DDC4C7"      "#ABA6AA"      "#E9BF8D"      "#B08A6E"      "#D8C8B9" 
##       Dorkwood         Flower       Sand Dan      Grade Bat Light Of Blast 
##      "#3D3F42"      "#B2B8C4"      "#C9AC8F"      "#305E53"      "#AF9693" 
##      Grass Bat    Sindis Poop           Dope        Testing    Stoncr Blue 
##      "#B0636C"      "#CCCDC2"      "#DBD1B3"      "#9C656A"      "#98A59F" 
##    Burblc Simp    Stanky Bean         Thrdly 
##      "#E2B584"      "#C5A2AB"      "#BEA474"

aco_hex
##  [1] "#707055" "#CBC6A6" "#615C49" "#BFAE9C" "#78727C" "#DDC4C7" "#A9A7AB"
##  [8] "#E9BF8F" "#B18A6D" "#D8C8B9" "#3E3F43" "#B2B8C4" "#C7AC92" "#305E53"
## [15] "#AC9891" "#B1646B" "#CBCDC0" "#DBD2B3" "#A2626A" "#98A59E" "#E8B187"
## [22] "#C5A1AB" "#BFA17C"

we can see they’re really close to each other, and I doubt all but the most egregiously picky color snobs can tell the difference visually, too:

par(mfrow=c(1,2))
scales::show_col(ocr_cols)
scales::show_col(aco_hex)
par(mfrow=c(1,1))

(OK, #3D3F43 is definitely hitting my OCD as being annoyingly different than #3D3F42 on my MacBook Pro so count me in as a color snob.)

Here’s the final palette:

structure(c("#707154", "#C9C7A5", "#615D44", "#BEAE9B", "#79727D", 
"#DDC4C7", "#ABA6AA", "#E9BF8D", "#B08A6E", "#D8C8B9", "#3D3F42", 
"#B2B8C4", "#C9AC8F", "#305E53", "#AF9693", "#B0636C", "#CCCDC2", 
"#DBD1B3", "#9C656A", "#98A59F", "#E2B584", "#C5A2AB", "#BEA474"
), .Names = c("Clardic Fug", "Snowbonk", "Catbabel", "Bunfiow", 
"Ronching Blue", "Bank Butt", "Caring Tan", "Stargoon", "Sink", 
"Stummy Beige", "Dorkwood", "Flower", "Sand Dan", "Grade Bat", 
"Light Of Blast", "Grass Bat", "Sindis Poop", "Dope", "Testing", 
"Stoncr Blue", "Burblc Simp", "Stanky Bean", "Thrdly"))

This third attempt took ~5 minutes vs 60s.

FIN

Why “A-“? Well, I didn’t completely verify the colors and values matched 100% in the final submission. They are likely the same, but the best way to get something corrected by others it to put it on the internet, so there it is :-)

I’d be a better human and coder if I took the time to learn tesseract more, but I don’t have much need for OCR’ing text. It is likely worth the time to brush up on tesseract after you read this post.

Don’t use this palette! I created it mostly to beat Karthik to making the palette (I have no idea if I succeeded), to also show that you should not forego your base R roots (I could have let that be subliminal but I wasn’t trying to socially engineer you in this post) and to bring up the speed/reproducibility topic. I see no issues with manually doing tasks (like uploading an image to a web site) in certain circumstances, but it’d be an interesting topic of debate to see just what “rules” folks use to determine how much effort one should put into 100% programmatic reproducibility.

You can find the ACO file and an earlier, alternate attempt at making the palette in this gist.