post Archives - Page 29 of 33

Tag Archives: post

Making waffle charts in R (with the new ‘waffle’ package)

2015-03-18 – 15:44
Posted in Charts & Graphs, Data Visualization, DataVis, DataViz, R
Tagged post
Comments (17)

NOTE: The waffle package (sans JavaScript-y goodness) is up on CRAN so you can do an install.packages("waffle") and library(waffle) vs the devtools dance.

My disdain for pie charts is fairly well-known, but I do concede that there are times one needs to communicate parts of a whole graphically verses using just words or a table. When that need arises, I’m partial to “waffle charts” or “square pie charts”. @eagereyes did a great post a while ago on them (make sure to read the ‘debate’ between Robert and @hadleywickham in the comments, too), so head there for the low-down on them. Rather than have every waffle chart I make be a one-off creation, I made an R package for them.

There is currently one function in the package — waffle — and said function doesn’t mimic all the goodness of these charts as described in Robert’s post (yet). It does, however, do a pretty decent job covering the basics. Let’s take the oft-cited New York times “debt” graphic:

We can replicate that pretty closely in R. To make it as simple as possible, the waffle function takes a named numeric vector. If no names are specified, or you leave some names out, LETTERS will be used to fill in the gaps. The function takes your data quite literally, so if you give it a vector that sums up to, say, 10,000, then the function will try to create a ggplot object with 10,000 geom_rect elements. Needless to say, that’s a bad idea. So, I suggest using the raw numbers in the vector and passing in a scaled version of the vector to the function. That way, you can play with the values to get the desired look. Here’s the R version of of the NYT graphic:

# devtools::install_github("hrbrmstr/waffle")
library(waffle)
savings <- c(`Mortgage ($84,911)`=84911, `Auto and\ntuition loans ($14,414)`=14414, 
             `Home equity loans ($10,062)`=10062, `Credit Cards ($8,565)`=8565)
waffle(savings/392, rows=7, size=0.5, 
       colors=c("#c7d4b6", "#a3aabd", "#a0d0de", "#97b5cf"), 
       title="Average Household Savings Each Year", 
       xlab="1 square == $392")

savings

This package evolved from a teensy gist I made earlier this year to help communicate the scope of the Anthem data breach in the US. Since then, a recent breach at Premera occurred and added to the tally. Here’s two views of that data, one with one square equalling one million people and another with one square equalling ten million people (using the blue shade from each of the company’s logos):

parts <- c(`Un-breached\nUS Population`=(318-11-79), `Premera`=11, `Anthem`=79)
 
waffle(parts, rows=8, size=1, colors=c("#969696", "#1879bf", "#009bda"), 
       title="Health records breaches as fraction of US Population", 
       xlab="One square == 1m ppl")

320

waffle(parts/10, rows=3, colors=c("#969696", "#1879bf", "#009bda"), 
       title="Health records breaches as fraction of US Population", 
       xlab="One square == 10m ppl"

I’m betting that gets alot bluer by the end of the year.

The function returns a ggplot object, so fonts, sizes, etc can all be customized and the source is up on github for all to play with and contribute to.

Along with adding support for filling in the chart as shown in the @eagereyes post, there will also be an htmlwidget version coming as well. Standard drill applies: issues/enhancements to github issues, feedback and your own examples in the comments.

UPDATE

Thanks to a PR by @timelyportfolio, there is now a widget option in the package.

Simple Lower US 48 Albers Maps & Local (no-API) City/State Geocoding in R

2015-03-15 – 08:22
Posted in cartography, Data Visualization, DataVis, DataViz, maps, R
Tagged post
Comments (2)

I’ve been seeing an uptick in static US “lower 48” maps with “meh” projections this year, possibly caused by a flood of new folks resolving to learn R but using pretty old documentation or tutorials. I’ve also been seeing an uptick in folks needing to geocode US city/state to lat/lon. I thought I’d tackle both in a quick post to show how to (simply) use a decent projection for lower 48 US maps and then how to use a _very_ basic package I wrote – [localgeo](http://github.com/hrbrmstr/localgeo) to avoid having to use an external API/service for basic city/state geocoding.

### Albers All The Way

I could just plot an Albers projected map, but it’s more fun to add some data. We’ll start with some setup libraries and then read in some recent earthquake data, then filter it for our map display:

library(ggplot2)
library(dplyr)
library(readr) # devtools::install_github("hadley/readr")
 
# Earthquakes -------------------------------------------------------------
 
# get quake data ----------------------------------------------------------
quakes <- read_csv("http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_month.csv")
 
# filter all but lower 48 US ----------------------------------------------
quakes %>%
  filter(latitude>=24.396308, latitude<=49.384358,
         longitude>=-124.848974, longitude<=-66.885444) -> quakes
 
# bin by .5 ---------------------------------------------------------------
quakes$Magnitude <- as.numeric(as.character(cut(quakes$mag, breaks=c(2.5, 3, 3.5, 4, 4.5, 5),
    labels=c(2.5, 3, 3.5, 4, 4.5), include.lowest=TRUE)))

Many of my mapping posts use quite a few R geo libraries, but this one just needs `ggplot2`. We extract the US map data, turn it into something `ggplot` can work with, then plot our quakes on the map:

us <- map_data("state")
us <- fortify(us, region="region")
 
# theme_map ---------------------------------------------------------------
devtools::source_gist("33baa3a79c5cfef0f6df")
 
# plot --------------------------------------------------------------------
gg <- ggplot()
gg <- gg + geom_map(data=us, map=us,
                    aes(x=long, y=lat, map_id=region, group=group),
                    fill="#ffffff", color="#7f7f7f", size=0.25)
gg <- gg + geom_point(data=quakes,
                      aes(x=longitude, y=latitude, size=Magnitude),
                      color="#cb181d", alpha=1/3)
gg <- gg + coord_map("albers", lat0=39, lat1=45)
gg <- gg + theme_map()
gg <- gg + theme(legend.position="right")
gg

2.5+ mag quakes in Lower US 48 in past 30 days
Plot_Zoom

### Local Geocoding

There are many APIs with corresponding R packages/functions to perform geocoding (one really spiffy recent one is [geocodeHERE](http://cran.r-project.org/web/packages/geocodeHERE/)). While Nokia’s service is less restrictive than Google’s, most of these sites are going to have some kind of restriction on the number of calls per second/minute/day. You could always install the [Data Science Toolkit](http://www.datasciencetoolkit.org/) locally (note: it was down as of the original posting of this blog) and perform the geocoding locally, but it does take some effort (and space/memory) to setup and get going.

If you have relatively clean data and only need city/state resolution, you can use a package I made – [localgeo](http://github.com/hrbrmstr/localgeo) as an alternative. I took a US Gov census shapefile and extracted city, state, lat, lon into a data frame and put a lightweight function shim over it (it’s doing nothing more than `dplyr::left_join`). It won’t handle nuances like “St. Paul, MN” == “Saint Paul, MN” and, for now, it requires you to do the city/state splitting, but I’ll be tweaking it over the year to be a bit more forgiving.

We can give this a go and map the [greenest cities in the US in 2014](http://www.nerdwallet.com/blog/cities/greenest-cities-america/) as crowned by, er, Nerd Wallet. I went for “small data file with city/state in it”, so if you know of a better source I’ll gladly use it instead. Nerd Wallet used DataWrapper, so getting the actual data was easy and here’s a small example of how to get the file, perform the local geocoding and use an Albers projection for plotting the points. The code below assumes you’re still in the R session that used some of the `library` calls earlier in the post.

library(httr)
library(localgeo) # devtools::install_github("hrbrmstr/localgeo")
library(tidyr)
 
# greenest cities ---------------------------------------------------------
# via: http://www.nerdwallet.com/blog/cities/greenest-cities-america/
 
url <- "https://gist.githubusercontent.com/hrbrmstr/1078fb798e3ab17556d2/raw/53a9af8c4e0e3137a0a8d4d6332f7a6073d93fb5/greenest.csv"
greenest <- read.table(text=content(GET(url), as="text"), sep=",", header=TRUE, stringsAsFactors=FALSE)
 
greenest %>%
  separate(City, c("city", "state"), sep=", ") %>%
  filter(!state %in% c("AK", "HI")) -> greenest
 
greenest_geo <- geocode(greenest$city, greenest$state)
 
gg <- ggplot()
gg <- gg + geom_map(data=us, map=us,
                    aes(x=long, y=lat, map_id=region, group=group),
                    fill="#ffffff", color="#7f7f7f", size=0.25)
gg <- gg + geom_point(data=greenest_geo,
                      aes(x=lon, y=lat),
                      shape=21, color="#006d2c", fill="#a1d99b", size=4)
gg <- gg + coord_map("albers", lat0=39, lat1=45)
gg <- gg + labs(title="Greenest Cities")
gg <- gg + theme_map()
gg <- gg + theme(legend.position="right")
gg

Nerd Wallets’s Greenest US (Lower 48) Cities 2014
Plot_Zoom 2

Let me reinforce that the `localgeo` package will most assuredly fail to geocode some city/state combinations. I’m looking for a more comprehensive shapefile to ensure I have the complete list of cities and I’ll be adding some code to help make the lookups more forgiving. It may at least help when you bump into an API limit and need to crank out something in a hurry.

Streamgraph htmlwidget version 0.7 released (adds support for markers & annotations)

In preparation for using some of our streamgraphs for production (PDF/print) graphics, I ended up having to hand-edit labels in on one of the graphics in an Adobe product. This bumped up the priority on adding annotation functions to the streamgraph package (you really don’t want to have to hand-edit graphics if at all possible, trust me). To illustrate them, I’ll use unemployment data that I started gathering for a course I’m teaching in the Fall.

We’ll start with the setup and initial data gathering:

library(dplyr)
library(streamgraph)
library(pbapply)
 
url <- "http://www.bls.gov/lau/ststdsadata.txt"
dat <- readLines(url)

This data is not exactly in a happy format (hit the URL in your browser and you’ll see what I mean). It was definitely made for line printers/human consumption and I feel bad for any human that has to stare at it. The function I’m using to extract data is not necessarily what I’d do to just read in the whole data, but it’s more for teaching something else than optimization. It’ll do for our purposes here:

get_state_data <- function(state) {
 
  section <- paste("^%s|    (", paste0(month.name, sep="", collapse="|"), ")\ +[[:digit:]]{4}", sep="", collapse="")
  section <- sprintf(section, state)
  vals <- gsub("^\ +|\ +$", "", grep(section, dat, value=TRUE))
 
  state_vals <- gsub("^.* \\.+", "", vals[seq(from=2, to=length(vals), by=2)])
 
  cols <- read.table(text=state_vals)
  cols$month <- as.Date(sprintf("01 %s", vals[seq(from=1, to=length(vals), by=2)]),
                        format="%d %B %Y")
  cols$state <- state
 
  cols %>%
    select(8:9, 1:8) %>%
    mutate(V1=as.numeric(gsub(",", "", V1)),
           V2=as.numeric(gsub(",", "", V2)),
           V4=as.numeric(gsub(",", "", V4)),
           V6=as.numeric(gsub(",", "", V6)),
           V3=V3/100,
           V5=V5/100,
           V7=V7/100) %>%
    rename(civ_pop=V1,
           labor_force=V2, labor_force_pct=V3,
           employed=V4, employed_pct=V5,
           unemployed=V6, unemployed_pct=V7)
 
}
 
state_unemployment <- bind_rows(pblapply(state.name, get_state_data))

This will give us a data frame for employment(/unemployment) rates for all the (US) states. I only wanted to focus on New England and a few others for the course example, so this bit filters out them out:

state_unemployment %>%
  filter(state %in% c("California", "Ohio", "Rhode Island", "Maine",
                      "Massachusetts", "Connecticut", "Vermont",
                      "New Hampshire", "Nebraska")) -> some

With that setup out of the way, let me introduce the two new functions: `sg_add_marker` and `sg_annotate`. `sg_add_marker` adds a vertical, dotted line that spans the height of the graph and is placed at the designated spot on the x axis. You can add an optional label for the marker by specifying the y position, label text, color, size, space away from the line and how it’s aligned – start (left), center (middle), right (end). This is primarily useful for placing the label on either side of the line.

`sg_annotate` is for adding text anywhere on the streamgraph. The original use for it was to label streams, but you can use it any way you think would add meaning to your streamgraph. You can see them both in action below, where I plot the streamgraph for unemployment (%) for the selected states, then label the start of each recession since 1980 (with the peak national unemployment rate) with a marker and also label each stream:

streamgraph(some, "state", "unemployed_pct", "month") %>%
  sg_axis_x(tick_interval=10, tick_units = "year", tick_format="%Y") %>%
  sg_axis_y(0) %>%
  sg_add_marker(x=as.Date("1981-07-01"), "1981 (10.8%)", anchor="end") %>%
  sg_add_marker(x=as.Date("1990-07-01"), "1990 (7.8%)", anchor="start") %>%
  sg_add_marker(x=as.Date("2001-03-01"), "2001 (6.3%)", anchor="end") %>%
  sg_add_marker(x=as.Date("2007-12-01"), "2007 (10.1%)", anchor="end") %>%
  sg_annotate(label="Vermont", x=as.Date("1978-04-01"), y=0.6, color="#ffffff") %>%
  sg_annotate(label="Maine", x=as.Date("1978-03-01"), y=0.30, color="#ffffff") %>%
  sg_annotate(label="Nebraska", x=as.Date("1977-06-01"), y=0.41, color="#ffffff") %>%
  sg_annotate(label="Massachusetts", x=as.Date("1977-06-01"), y=0.36, color="#ffffff") %>%
  sg_annotate(label="New Hampshire", x=as.Date("1978-03-01"), y=0.435, color="#ffffff") %>%
  sg_annotate(label="California", x=as.Date("1978-02-01"), y=0.175, color="#ffffff") %>%
  sg_annotate(label="Rhode Island", x=as.Date("1977-11-01"), y=0.55, color="#ffffff") %>%
  sg_annotate(label="Ohio", x=as.Date("1978-06-01"), y=0.485, color="#ffffff") %>%
  sg_annotate(label="Connecticut", x=as.Date("1978-01-01"), y=0.235, color="#ffffff") %>%
  sg_fill_tableau() %>%
  sg_legend(show=TRUE)

Selected State Unemployment Figures Since 1976

I probably could have positioned the annotations a bit better, but this should be a good enough example to get the general idea. I may add an option to place the marker vertical lines behind streamgraph and will be adding some toggle options to the interactive version (to hide/show markers and/or annotations).

As usual, the package is up [on github](https://github.com/hrbrmstr/streamgraph) and a contiguous copy of the above snippets are in [this gist](https://gist.github.com/hrbrmstr/4e181ae045807ca3a858).

Three final notes. First, I suggest enabling the y axis when you’re trying to figure out where the y position for a label should be (since the y axis range is calculated by the summed span of the data). Second, the x axis works with both dates and continuous values, but you need to match what you setup the streamgraph with. Finally, just a tip: I’ve found [SVG Crowbar 2](http://nytimes.github.io/svg-crowbar/) to be super-helpful when I need to extract these streamgraphs out for non-interactive reproduction. Just yank the SVG out with it and hand it (or a converted form of it) to whomever is handling final production and they should be able to work with it.

New R Package – ipapi (IP/Domain Geolocation)

2015-03-09 – 16:32
Posted in cartography, DataVis, DataViz, DNS, R
Tagged post
Comments (2)

I noticed that the @rOpenSci folks had an interface to [ip-api.com](http://ip-api.com/) on their [ToDo](https://github.com/ropensci/webservices/wiki/ToDo) list so I whipped up a small R package to fill said gap.

Their IP Geolocation API will take an IPv4, IPv6 or FQDN and kick back a ASN, lat/lon, address and more. The [ipapi package](https://github.com/hrbrmstr/ipapi) exposes one function – `geolocate` which takes in a character vector of any mixture of IPv4/6 and domains and returns a `data.table` of results. Since `ip-api.com` has a restriction of 250 requests-per-minute, the package also tries to help ensure you don’t get your own IP address banned (there’s a form on their site you can fill in to get it unbanned if you do happen to hit the limit). Overall, there’s nothing fancy in the package, but it gets the job done.

I notified the rOpenSci folks about it, so hopefully it’ll be one less thing on that particular to-do list.

You can see it in action in combination with the super-spiffy [leaflet](http://www.htmlwidgets.org/showcase_leaflet.html) htmlwidget:

library(leaflet)
library(ipapi)
library(maps)
 
# get top 500 domains
sites <- read.csv("http://moz.com/top500/domains/csv", stringsAsFactors=FALSE)
 
# make reproducible
set.seed(1492)
 
# pick out a random 50
sites <- sample(sites$URL, 50) 
sites <- gsub("/", "", sites)
locations <- geolocate(sites)
 
# take a quick look
dplyr::glimpse(locations)
 
## Observations: 50
## Variables:
## $ as          (fctr) AS2635 Automattic, Inc, AS15169 Google Inc., AS3561...
## $ city        (fctr) San Francisco, Mountain View, Chesterfield, Mountai...
## $ country     (fctr) United States, United States, United States, United...
## $ countryCode (fctr) US, US, US, US, US, US, JP, US, US, IT, US, US, US,...
## $ isp         (fctr) Automattic, Google, Savvis, Google, Level 3 Communi...
## $ lat         (dbl) 37.7484, 37.4192, 38.6631, 37.4192, 38.0000, 33.7516...
## $ lon         (dbl) -122.4156, -122.0574, -90.5771, -122.0574, -97.0000,...
## $ org         (fctr) Automattic, Google, Savvis, Google, AddThis, Peer 1...
## $ query       (fctr) 192.0.80.242, 74.125.227.239, 206.132.6.134, 74.125...
## $ region      (fctr) CA, CA, MO, CA, , GA, 13, MA, TX, , MA, TX, CA, , ,...
## $ regionName  (fctr) California, California, Missouri, California, , Geo...
## $ status      (fctr) success, success, success, success, success, succes...
## $ timezone    (fctr) America/Los_Angeles, America/Los_Angeles, America/C...
## $ zip         (fctr) 94110, 94043, 63017, 94043, , 30303, , 02142, 78218...
 
# all i want is the world!
world <- map("world", fill = TRUE, plot = FALSE) 
 
# kick out a a widget
leaflet(data=world) %>% 
  addTiles() %>% 
  addCircleMarkers(locations$lon, locations$lat, 
                   color = '#ff0000', popup=sites)

50 Random Top Sites

Streamgraph package now supports continuous x axis scale

2015-03-07 – 13:08
Posted in Charts & Graphs, d3, Data Visualization, DataVis, DataViz, R
Tagged post
Leave a Comment

A post on [StackOverflow](http://stackoverflow.com/questions/28725604/streamgraphs-dataviz-in-r-wont-plot) asked about using a continuous variable for the x-axis (vs dates) in my [streamgraph package](http://github.com/hrbrmstr/streamgraph). While I provided a workaround for the question, it helped me bump up the priority for adding support for continuous x axis scales. With the [DBIR](http://www.verizonenterprise.com/DBIR/) halfway behind me now, I kicked out a new rev of the package/widget that has support for continuous scales.

Using the data from the SO post, you can see there’s not much difference in how you use continuous vs date scales:

library(streamgraph)
 
dat <- read.table(text="week variable value
40     rev1  372.096
40     rev2  506.880
40     rev3 1411.200
40     rev4  198.528
40     rev5   60.800
43     rev1  342.912
43     rev2  501.120
43     rev3  132.352
43     rev4  267.712
43     rev5   82.368
44     rev1  357.504
44     rev2  466.560", header=TRUE)
 
dat %>% 
  streamgraph("variable","value","week", scale="continuous") %>% 
  sg_axis_x(tick_format="d")

Product Revenue

I’ll be adding support for using a categorical variable on the x axis soon. Once that’s done, it’ll be time to do the CRAN dance.

Introducing the streamgraph htmlwidget R Package

2015-02-15 – 11:54
Posted in d3, Data Visualization, DataVis, DataViz, R
Tagged post
Comments (17)

We were looking for a different type of visualization for a project at work this past week and my thoughts immediately gravitated towards [streamgraphs](http://www.leebyron.com/else/streamgraph/). The TLDR on streamgraphs is they they are generalized versions of stacked area graphs with free baselines across the x axis. They are somewhat [controversial](http://www.visualisingdata.com/index.php/2010/08/making-sense-of-streamgraphs/) but have a “draw you in” aesthetic appeal (which is what we needed for our visualization).

You can make streamgraphs/stacked area charts pretty easily in D3, and since we needed to try many different sets of data in the streamgraph style, it made sense to make this an R [htmlwidget](http://www.htmlwidgets.org/). Thus, the [streamgraph package](https://github.com/hrbrmstr/streamgraph) was born.

### Making a streamgraph

The package isn’t in CRAN yet, so you have to do the `devtools` dance:

devtools::install_github("hrbrmstr/streamgraph")

Streamgraphs require a continuous variable for the x axis, and the `streamgraph` widget/package works with years or dates (support for `xts` objects and `POSIXct` types coming soon). Since they display categorical values in the area regions, the data in R needs to be in [long format](http://blog.rstudio.org/2014/07/22/introducing-tidyr/) which is easy to do with `dplyr` & `tidyr`.

The package recognizes when years are being used and does all the necessary conversions for you. It also uses a technique similar to `expand.grid` to ensure all categories are represented at every observation (not doing so makes `d3.stack` unhappy).

Let’s start by making a `streamgraph` of the number of movies made per year by genre using the `ggplot2` `movies` dataset:

library(streamgraph)
library(dplyr)
 
ggplot2::movies %>%
  select(year, Action, Animation, Comedy, Drama, Documentary, Romance, Short) %>%
  tidyr::gather(genre, value, -year) %>%
  group_by(year, genre) %>%
  tally(wt=value) %>%
  streamgraph("genre", "n", "year") %>%
  sg_axis_x(20) %>%
  sg_fill_brewer("PuOr") %>%
  sg_legend(show=TRUE, label="Genres: ")

Movie count by genre by year

We can also mimic an example from the [Name Voyager](http://www.bewitched.com/namevoyager.html) project (using the `babynames` R package) but change some of the aesthetics, just to give an idea of how some of the options work:

library(dplyr)
library(babynames)
library(streamgraph)
 
babynames %>%
 filter(grepl("^(Alex|Bob|Jay|David|Mike|Jason|Stephen|Kymberlee|Lane|Sophie|John|Andrew|Thibault|Russell)$", name)) %>%
  group_by(year, name) %>%
  tally(wt=n) %>%
  streamgraph("name", "n", "year", offset="zero", interpolate="linear") %>%
  sg_legend(show=TRUE, label="DDSec names: ")

Data-Driven Security Podcast guest+host names by year

There are more examples over at [RPubs](http://rpubs.com/hrbrmstr/streamgraph04) and [github](http://hrbrmstr.github.io/streamgraph/), but I’ll close with a streamgraph of housing data originally made by [Alex Bresler](http://asbcllc.com/blog/2015/february/cre_stream_graph_test/):

dat <- read.csv("http://asbcllc.com/blog/2015/february/cre_stream_graph_test/data/cre_transaction-data.csv")
 
dat %>%
  streamgraph("asset_class", "volume_billions", "year") %>%
  sg_axis_x(1, "year", "%Y") %>%
  sg_fill_brewer("PuOr") %>%
  sg_legend(show=TRUE, label="Assets: ")

Commercial Real Estate Transaction Volume by Asset Class Since 2006

While the radical volume change would have been noticeable in almost any graph style, it’s especially noticeable with the streamgraph version as your eyes tend to naturally follow the curves of the flow.

### Fin

While I wouldn’t have these replace my trusty ggplot2 faceted bar charts for regular EDA and reporting, streamgraphs can add a bit of color and flair, and may be an especially good choice when you need to view many categorical variables over time.

As usual, issues/feature requests on [github](http://github.com/hrbrmstr/streamgraph) and showcase/general feedback in the comments.

SweetheaRstats

I felt compelled to dust off my 2013 [Valentine’s Day](http://rud.is/b/2013/02/14/happy-valentines-day-mrshrbrmstr/) `#rstats` post and make it all Shiny and new again. I used the same math from that post, but made the polygon a bit sharper and used ggplot2 for the plotting.

To kick it up a bit, I decided to pay homage to a local company (Necco) who makes the venerable [Sweethearts candies](http://www.necco.com/candy/sweethearts.aspx) that are popular this time of year (at least in the U.S.). The heart color is randomized to take on one of the signature pastels of the candy and I took the various sayings that have been on their hearts over the years (except 2006, which was a strange year) and also randomized them. The plot also displays the year of the saying.

I wrapped it all up in a Shiny bow, so all you have to do to get a new heart/saying is refresh the page!

To play with it, you can:

– refresh this post (since the `iframe` below is pointing to the shiny app)
– refresh the target [shinyapps.io app](https://hrbrmstr.shinyapps.io/SweetheaRstats/)
– run it locally via `shiny::runGist(“cf34f7230c88bd99b153”)`

NOTE: You’ll probably want/need to run it locally. I only have a free ShinyApps account and it’ll probably run out of free CPU time depending on when you’re reading this post.

The surprisingly small bit of code is in [this gist](https://gist.github.com/hrbrmstr/cf34f7230c88bd99b153).

Now there’s even one more way to have a data-driven romance!

Oh, and Happy Valentine’s Day @mrshrbrmstr!!

A Step to the Right in R Assignments

I received an out-of-band question on the use of `%<>%` in my [CDC FluView](rud.is/b/2015/01/10/new-r-package-cdcfluview-retrieve-flu-data-from-cdcs-fluview-portal/) post, and took the opportunity to address it in a broader, public fashion.

Anyone using R knows that the two most common methods of assignment are the venerable (and sensible) left arrow `<-` and it's lesser cousin `=`. `<-` has an evil sibling, `<<-`, which is used when you want/need to have R search through parent environments for an existing definition of the variable being assigned (up to the global environment). Since the introduction of the "piping idom"--`%>%`–made popular by `magrittr`, `dplyr`, `ggvis` and other packages, I have struggled with the use of `<-` in pipes. Since pipes flow data in a virtual forward motion, that LHS (left hand side) assignment has an awkward characteristic about it. Furthermore, many times you are piping from an object with the intent to replace the contents of said object. For example:

iris$Sepal.Length <- 
  iris$Sepal.Length %>%
  sqrt

(which is from the `magrittr` documentation).

To avoid the repetition of the left-hand side immediately after the assignment operator, Bache & Wickham came up with the `%<>%` operator, which shortens the above to:

iris$Sepal.Length %<>% sqrt

Try as I may (including the CDC FluView blog post), that way of assigning variables still _feels_ awkward, and is definitely confusing to new R users. But, what’s the alternative? I believe it’s R’s infrequently used `->` RHS assignment operator.

Let’s look at that in the context of the somewhat-long pipe in the CDC FluView example:

dat %>%
  mutate(REGION=factor(REGION,
                       levels=unique(REGION),
                       labels=c("Boston", "New York",
                                "Philadelphia", "Atlanta",
                                "Chicago", "Dallas",
                                "Kansas City", "Denver",
                                "San Francisco", "Seattle"),
                       ordered=TRUE)) %>%
  mutate(season_week=ifelse(WEEK>=40, WEEK-40, WEEK),
         season=ifelse(WEEK<40,
                       sprintf("%d-%d", YEAR-1, YEAR),
                       sprintf("%d-%d", YEAR, YEAR+1))) -> dat

That pipe flow says _”take `dat`, change-up some columns, make some new columns and reassign into `dat`”_. It’s a very natural flow and reads well, too, since you’re following a process up to it’s final destination. It’s even more natural in pipes that actually transform the data into something else. For example, to get a vector of the number of US male births since 1880, we’d do:

library(magrittr)
library(rvest)
 
births <- html("http://www.ssa.gov/oact/babynames/numberUSbirths.html")
 
births %>%
  html_nodes("table") %>%
  extract2(2) %>%
  html_table %>%
  use_series(Male) %>%
  gsub(",", "", .) %>%
  as.numeric -> males

That’s very readable (one of the benefits of pipes) and the flow, again, makes sense. Compare that to it’s base R counterpart:

males <- as.numeric(gsub(",", "", html_table(html_nodes(births, "table")[[2]])$Male))

The base R version is short and the LHS assignment fits well as the values “pop out” of the function calls. But, it’s also only initially, quickly readable to veteran R folks. Since code needs to be readable, maintainable and (often times) shared with folks on a team, I believe the pipes help increase overall productivity and aid in documenting what is trying to be achieved in that portion of an analysis (especially when combined with `dplyr` idioms).

Pipes are here to stay and they are definitely a part of my data analysis workflows. Moving forward, so will RHS (`->`) assignments from pipes.