R⁶ — Idiomatic (for the People)

NOTE: I’ll do my best to ensure the next post will have nothing to do with Twitter, and this post might not completely meet my R⁶ criteria.

A single, altruistic, nigh exuberant R tweet about slurping up a directory of CSVs devolved quickly — at least in my opinion, and partly (sadly) with my aid — into a thread that ultimately strayed from a crucial point: idiomatic is in the eye of the beholder.

I’m not linking to the twitter thread, but there are enough folks with sufficient Klout scores on it (is Klout even still a thing?) that you can easily find it if you feel so compelled.

I’ll take a page out of the U.S. High School “write an essay” playbook and start with a definition of idiomatic:

using, containing, or denoting expressions that are natural to a native speaker

That comes from idiom:

a form of expression natural to a language, person, or group of people

I usually joke with my students that a strength (and weakness) of R is that there are ~twelve ways to do any given task. While the statement is deliberately hyperbolic, the core message is accurate: there’s more than one way to do most things in R. A cascading truth is: what makes one way more “correct” over another often comes down to idiom.

My rstudio::conf 2017 presentation included an example of my version of using purrr for idiomatic CSV/JSON directory slurping. There are lots of ways to do this in R (the point of the post is not really to show you how to do the directory slurping and it is unlikely that I’ll approve comments with code snippets about that task). Here are three. One from base R tribe, one from the data.table tribe and one from the tidyverse tribe:

# We need some files and we'll use base R to make some
dir.create("readings")
for (i in 1970:2010) write.csv(mtcars, file.path("readings", sprintf("%s.csv", i)), row.names=FALSE)

fils <- list.files("readings", pattern = ".csv$", full.names=TRUE)

do.call(rbind, lapply(fils, read.csv, stringsAsFactors=FALSE))

data.table::rbindlist(lapply(fils, data.table::fread))

purrr::map_df(fils, readr::read_csv)

You get data for all the “years” into a data.frame, data.table and tibble (respectively) with those three “phrases”.

However, what if you want the year as a column? Many of these “datalogger” CSV data sets do not have a temporal “grouping” variable as they let the directory structure & naming conventions embed that bit of metadata. That information would be nice, though:

do.call(rbind, lapply(fils, function(x) {
  f <- read.csv(x, stringsAsFactors=FALSE)
  f$year <- gsub("^readings/|\\.csv$", "", x)
  f
}))

dt <- data.table::rbindlist(lapply(fils, data.table::fread), idcol="year")
dt[, year := gsub("^readings/|\\.csv$", "", fils[year])]

purrr::map_df(fils, readr::read_csv, .id = "year") %>% 
  dplyr::mutate(year = stringr::str_replace_all(fils[as.numeric(year)],
                                                "^readings/|\\.csv$", ""))

All three versions do the same thing, and each tribe understands each idiom.

The data.table and tidyverse versions get you much faster file reading and the ability to “fill” missing columns — another common slurping task. You can hack something together in base R to do column fills (you’ll find a few StackOverflow answers that accomplish such a task) but you will likely decide to choose one of the other idioms for that and become equally as comfortable in that new idiom.

There are multiple ways to further extend the slurping example, but that’s not the point of the post.

Each set of snippets contains 100% valid R code. They accomplish the task and are idiomatic for each tribe. Despite what any “mil gun feos turrach na latsa” experts’ exchange would try to tell you, the best idiom is the one that works for you/you & your collaborators and the one that gets you to the real work — data analysis — in the most straightforward & reproducible way possible (for you).

Idiomatic does not mean there’s only a singular One, True Way™, and I think a whole host of us forget that at times.

Write good, clean, error-free, reproducible code.

Choose idioms that work best for you and your collaborators.

Adapt when necessary.

Cover image from Data-Driven Security
Amazon Author Page

10 Comments R⁶ — Idiomatic (for the People)

  1. spacedman

    Python has done well at making its programmers a single tribe, thanks to “There should be one — and preferably only one — obvious way to do it.” from the Zen of Python. Perl however, has made a point of making every person a tribe of their own with the TMTOWDI attitude prevailing. Your mention of 12 ways to do something would be an understatement for Perl. R is getting towards that, and what’s more it’s being driven by add-on packages rather than features in the core language. By the time we can type a pipe symbol with fewer than three keystrokes (keyboard macros notwithstanding) we’ll have moved on to something else. How many idiom tribes are there for Julia code?

    Reply
  2. Pingback: R⁶ — Idiomatic (for the People) | A bunch of data

  3. Pingback: R⁶ — Idiomatic (for the People) – Mubashir Qasim

  4. groundwalkergmb

    Interesting post. With respect, though, I’m not sure that’s the most relevant definition of idiomatic for the topic. The way I hear/see it used in a programming setting (e.g., see the accepted comment here https://stackoverflow.com/questions/84102/what-is-idiomatic-code) it means adhering to the conventions of “the language”, rather than the intuition of the programmer.

    When I use the term non-idiomatic in the R context, I’m generally talking about things like for loops, or objects with reference mechanics (e.g., ReferenceClass or R6 objects). These things, particularly the latter, are non-idiomatic (in R) in a way that’s actually important (imho). Note also that for loops might be more intuitive that apply/map statements to many novice R programmers, but are still non-idiomatic (when not actually necessary) in an R setting

    Generally I think anyone would be pretty hard-pressed to claim that reasonably written base R code is non-idiomatic under that definition. That said, I think it’s easy to consider tidyverse code idiomatic as well, particularly since that is the flavor of R code that R users tend to think in. The one exception to this would be the heavy use of non-standard evaluation in parts of the tidyverse, which could be considered is at least mildly non-idiomatic essentially by definition (thus the non-standard).

    Ultimately, though I agree with what I think your main point was. Tribal warfare within the R community has just as much value as language warfare between R and python. I.e., none at all.

    Reply
    1. Matt Dowle (@MattDowle)

      There is value. Calling it ‘tribal warfare’ is unhelpful. I see it more as informing about and defending prior work, experimentation and comparison. That knowledge sharing needs to happen. Put it another way: if the thread didn’t happen, we wouldn’t have found out about sergeant or readbulk and the original tweeter wouldn’t have discovered that full.names= exists. That new knowledge has value.

      Reply
      1. hrbrmstr

        I’ve seen the exchanges before. It’s most certainly tribal. I’ll also note many folks who keep up with the broader R universe and not just one part of it knew about sergeant well before today. I’m not devaluing the discovery component, but it would take me less than an hour to show the tribal zealotry on at least the last two idioms. Passion for a project is one thing. Said passion at the expense of acknowledgement of the significant efficacy of others is not.

        Reply
        1. groundwalkergmb

          Indeed, I think it is easy to present a “new/alternate idiom” if that’s what we’d like to call it, in a way that isn’t tribal, but as with Bob I often see it … not presented that way.

          Comparisons are good, dismissing idioms out-of-hand that clearly work for many people (whether it’s the base R API on one side, or the pipes on another, as two examples) is less so. The out-of-hand is important there though. There are reasons I don’t use the pipes, and I’m happy to discuss them, but that doesn’t mean “pipes are bad”. Even in my personal view of them it’s more nuanced than that, and many people love them. I would hope for the same consideration regarding things I do that are not particularly “in vogue”, rather than a “that is stupid/bad/the wrong way to do it”.

          Reply
          1. groundwalkergmb

            An example of this from someone I like and respect very much (who will remain unnamed) is there was a “stringsAsFactors=HELLNO” ribbon going around JSM a couple years ago. That’s a very tribal way of approaching that issue…

  5. Joris Meys

    Nice blogpost, and I have generally the same idea. I also stress there are multiple ways to do it, and I even show at least parts of the tidyverse. It does make a lot of data preparation work a lot faster.

    But I always add that “if you want your code to work 2 years from now, base R is the safer route. If you want to write a package, be very, very careful with tidyverse.” After having to rewrite code several times because Hadley and friends had some marvellous new ideas – including moving functions to entirely new packages – I realized that “reproducible” isn’t their main concern. Unless you use packrat, obviously. But that doesn’t beat R code that’s (almost) guaranteed to work 10 years from now.

    Reply
  6. Pingback: Drilling Into CSVs — Teaser Trailer | rud.is

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.