NASA GISS’s Annual Global Temperature Anomaly Trends (dplyr/ggplot version)

D Kelly O’Day did a great post on charting NASA’s Goddard Institute for Space Studies (GISS) temperature anomaly data, but it sticks with base R for data munging & plotting. While there’s absolutely nothing wrong with base R operations, I thought a modern take on the chart using dplyr, magrittr & tidyr for data manipulation and ggplot2 for formatting would be helpful for the scores of new folk learning R this year (our little language is becoming all the rage, it seems). I also really enjoy working with weather data.

Before further exposition, here’s the result:


I made liberal use of the “piping” idiom encouraged magrittr, dplyr and other new R packages, including the forward assignment operator -> (which may put some folks off a bit). That also meant using magrittr‘s aliases for [ and [[, which are more readable in pipes.

I don’t use library(tidyr) since tidyr‘s extract conflicts with magrittr‘s, but you’ll see a tidyr::gather in the code for wide-to-long data shaping.

I chose to use the monthly temperature anomaly data as a base layer in the chart as a contrast to the monthly- and annual-anomaly means. I also marked the hottest annual- and annual-mean anomalies and framed the decades with vertical markers.

There are no hardcoded years or decades anywhere in the ggplot2 code, so this should be quite reusable as the data source gets updated.

As I come back to the chart, I think there may be a bit too much “chart junk” on it, but you can tweak it to your own aesthetic preferences (if you do, drop a note in the comments with a link to your creation).

The code is below and in this gist.

# data retrieval ----------------------------------------------------------
# the user agent string was necessary for me; YMMV
pg <- GET("",
          user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A"))
# extract monthly data ----------------------------------------------------
content(pg, as="text") %>%
  strsplit("\n") %>%
  extract2(1) %>%
  grep("^[[:digit:]]", ., value=TRUE) -> lines
# extract column names ----------------------------------------------------
content(pg, as="text") %>%
  strsplit("\n") %>%
  extract2(1) %>%
  extract(8) %>%
  strsplit("\ +") %>%
  extract2(1) -> lines_colnames
# make data frame ---------------------------------------------------------
data <- read.table(text=lines, stringsAsFactors=FALSE)
colnames(data) <- lines_colnames
# transform data frame ----------------------------------------------------
data %>%
  tidyr::gather(month, value, Jan, Feb, Mar, Apr, May, Jun,
                       Jul, Aug, Sep, Oct, Nov, Dec) %>%     # wide to long
  mutate(value=value/100) %>%                                # convert to degree Celcius change
  select(year=Year, month, value) %>%                        # only need these fields
  mutate(date=as.Date(sprintf("%d-%d-%d", year, month, 1)),  # make proper dates
         decade=year %/% 10,                                 # calc decade
         start=decade*10, end=decade*10+9) %>%               # calc decade start/end
  group_by(decade) %>%
    mutate(decade_mean=mean(value)) %>%                      # calc decade mean
  group_by(year) %>%
    mutate(annum_mean=mean(value)) %>%                       # calc annual mean
  ungroup -> data
# start plot --------------------------------------------------------------
gg <- ggplot()
# decade vertical markers -------------------------------------------------
gg <- gg + geom_vline(data=data %>% select(end),
                      aes(xintercept=as.numeric(as.Date(sprintf("%d-12-31", end)))),
                          size=0.5, color="#4575b4", linetype="dotted", alpha=0.5)
# monthly data ------------------------------------------------------------
gg <- gg + geom_line(data=data, aes(x=date, y=value, color="monthly anomaly"),
                     size=0.35, alpha=0.25)
gg <- gg + geom_point(data=data, aes(x=date, y=value, color"monthly anomaly"),
                      size=0.75, alpha=0.5)
# decade mean -------------------------------------------------------------
gg <- gg + geom_segment(data=data %>% distinct(decade, decade_mean, start, end),
                        aes(x=as.Date(sprintf("%d-01-01", start)),
                            xend=as.Date(sprintf("%d-12-31", end)),
                            y=decade_mean, yend=decade_mean,
                            color="decade mean anomaly"),
# annual data -------------------------------------------------------------
gg <- gg + geom_line(data=data %>% distinct(year, annum_mean),
                      aes(x=as.Date(sprintf("%d-06-15", year)), y=annum_mean,
                          color="annual mean anomaly"),
gg <- gg + geom_point(data=data %>% distinct(year, annum_mean),
                      aes(x=as.Date(sprintf("%d-06-15", year)), y=annum_mean,
                          color="annual mean anomaly"),
# additional annotations --------------------------------------------------
# max annual mean anomaly horizontal marker/text
gg <- gg + geom_hline(yintercept=max(data$annum_mean),  alpha=0.9,
                      color="#d73027", linetype="dashed", size=0.25)
gg <- gg + annotate("text",
                    x=as.Date(sprintf("%d-12-31", mean(range(data$year)))),
                    color="#d73027", alpha=0.9,
                    hjust=0.25, vjust=-1, size=3,
                    label=sprintf("Max annual mean anomaly %2.1fºC", max(data$annum_mean)))
gg <- gg + geom_hline(yintercept=max(data$value),  alpha=0.9,
                      color="#7f7f7f", linetype="dashed", size=0.25)
# max annual anomaly horizontal marker/text
gg <- gg + annotate("text",
                    x=as.Date(sprintf("%d-12-31", mean(range(data$year)))),
                    color="#7f7f7f",  alpha=0.9,
                    hjust=0.25, vjust=-1, size=3,
                    label=sprintf("Max annual anomaly %2.1fºC", max(data$value)))
gg <- gg + annotate("text",
                    x=as.Date(sprintf("%d-12-31", range(data$year)[2])),
                    y=min(data$value), size=3, hjust=1,
# set colors --------------------------------------------------------------
gg <- gg + scale_color_manual(name="", values=c("#d73027", "#4575b4", "#7f7f7f"))
# set x axis limits -------------------------------------------------------
gg <- gg + scale_x_date(expand=c(0, 1),
                        limits=c(as.Date(sprintf("%d-01-01", range(data$year)[1])),
                                 as.Date(sprintf("%d-12-31", range(data$year)[2]))))
# add labels --------------------------------------------------------------
gg <- gg + labs(x=NULL, y="GLOBAL Temp Anomalies in 1.0ºC",
                title=sprintf("GISS Land and Sea Temperature Annual Anomaly Trend (%d to %d)\n",
                              range(data$year)[1], range(data$year)[2]))
# theme/legend tweaks -----------------------------------------------------
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(legend.position=c(0.9, 0.2))
gg <- gg + theme(legend.key=element_blank())
gg <- gg + theme(legend.background=element_blank())

New R Package: cdcfluview — Retrieve Flu Data from CDC’s FluView Portal

NOTE If there’s a particular data set from that you want and that isn’t in the pacakge, please file it as an issue and be as specific as you can (screen shot if possible).

Towards the end of 2014 I had been tinkering with flu data from the CDC’s FluView portal since flu reports began to look like this season was going to go the way of 2009.

While you can track the flu over at The Washington Post, I like to work with data on my own. However the CDC’s portal is Flash-driven and there was no obvious way to get the data files programmatically. This is unfortunate, since there are weekly updates to the data set.

As an information security professional, one of the tools in my arsenal is Burp Proxy, which is an application that—amongst other things—lets you configure a local proxy server for your browser and inspect all web requests. By using this tool, I was able to discern that the Flash portal calls out to with custom POST form parameters (that I also mapped out) to make the data sets it delivers back to the user.

With that information in hand, I whipped together a small R package: cdcfluview to interface with the same server the FluView Portal does. It has a singular function – get_flu_data that lets you choose between different region/sub-region breakdowns and also whether you want data from WHO, ILINet (or both). It also lets you pick which years you want data for.

One reason I wanted to work with the data was to see just how this season differs from previous ones. The view I’ll leave on the blog this time—mostly as an example of how to use the package—is a faceted chart, by CDC region and CDC week showing this season (in red) as it relates to previous ones.

dat <- get_flu_data(region="hhs", 
dat %<>%
                       labels=c("Boston", "New York",
                                "Philadelphia", "Atlanta",
                                "Chicago", "Dallas",
                                "Kansas City", "Denver",
                                "San Francisco", "Seattle"),
                       ordered=TRUE)) %>%
  mutate(season_week=ifelse(WEEK>=40, WEEK-40, WEEK),
                       sprintf("%d-%d", YEAR-1, YEAR),
                       sprintf("%d-%d", YEAR, YEAR+1)))
prev_years <- dat %>% filter(season != "2014-2015")
curr_year <- dat %>% filter(season == "2014-2015")
curr_week <- tail(dat, 1)$season_week
gg <- ggplot()
gg <- gg + geom_point(data=prev_years,
                      aes(x=season_week, y=X..WEIGHTED.ILI, group=season),
                      color="#969696", size=1, alpa=0.25)
gg <- gg + geom_point(data=curr_year,
                      aes(x=season_week, y=X..WEIGHTED.ILI, group=season),
                      color="red", size=1.25, alpha=1)
gg <- gg + geom_line(data=curr_year, 
                     aes(x=season_week, y=X..WEIGHTED.ILI, group=season),
                     size=1.25, color="#d7301f")
gg <- gg + geom_vline(xintercept=curr_week, color="#d7301f", size=0.5, linetype="dashed", alpha=0.5)
gg <- gg + facet_wrap(~REGION, ncol=3)
gg <- gg + labs(x=NULL, y="Weighted ILI Index", 
                title="ILINet - 1999-2015 year weighted flu index history by CDC region\nWeek Ending Jan 3, 2015 (Red == current season)\n")
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())


(You can see an SVG version of that plot here)

Even without looking at the statistics, it’s pretty easy to tell that this is fixing to be a pretty bad season in many regions.

State-level data

Soon after this post I found the state-level API for the CDC FluView interface and added a get_state_data function for it:

get_state_data() %>%
  filter(WEEKEND=="Jan-03-2015") %>%
  select(state=STATENAME, value=ACTIVITY.LEVEL) %>%
  filter(!(state %in% c("Puerto Rico", "New York City"))) %>% # need to add NYC & PR to statebins
  mutate(value=as.numeric(gsub("Level ", "", value))) %>%
  statebins(brewer_pal="RdPu", breaks=4,
            labels=c("Minimal", "Low", "Moderate", "High"),
            legend_position="bottom", legend_title="ILI Activity Level") +
  ggtitle("CDC State FluView (2014-01-03)")


As always, post bugs or feature requests on the github repo and drop a note here if you’ve found the package useful or have some other interesting views or analyses to share.