Visualizing Historical & Most-likely First Snowfall Dates for U.S. Regions

UPDATE: You can now run this as a local Shiny app by entering shiny::runGist("95ec24c1b0cb433a76a5", launch.browser=TRUE) at an R prompt (provided all the dependent libraries (below) are installed) or use it interactively over at Shiny Apps.

The impending arrival of the first real snowfall of the year in my part of Maine got me curious about what the most likely “first snow” dates are for my region. The U.S. Historical Climatology Network (USHCN) maintains [historical daily climate records](http://cdiac.ornl.gov/epubs/ndp/ushcn/daily_doc.html) for each station in each state and has data (for some stations) going back as far as the 1800’s. A quick look at their [data files](http://cdiac.ornl.gov/ftp/ushcn_daily/) indicated that they would definitely help satiate my curiosity (and make for a late night of cranking out some R code and ggplot visualizations).

To start, we’ll need a bit more than base R to get the job done:

library(pbapply)
library(data.table)
library(dplyr)
library(ggplot2)
library(grid)
library(gridExtra)
library(stringi)

In all honesty, `pbapply`, `dplyr` and `stringi` are not necessary, but they definitely make life easier by (respectively) giving us:

– free progress bars for `*apply` operations,
– high efficacy data manipulation idioms, and
– a handy utility for converting strings to title case.

With setup out of the way, the first real task is to see which observer station is closest to my area. To figure that out we need to read in the station data file which is, sadly, in fixed-width format. Some stations have `#` characters in their titles, to we have to account for that when we call `read.fwf`. After reading in the station database we use a naive–but-usable distance calculation to find the closest station:

stations <- read.fwf("data/ushcn-stations.txt",
                     widths=c(6, 9, 10, 7, 3, 31, 7, 7, 7, 3),
                     col.names=c("coop_id", "latitude", "longitude", "elevation",
                                 "state", "name", "component_1", "component_2",
                                 "component_3", "utc_offset"),
                     colClasses=c("character", "numeric", "numeric", "numeric",
                                  "character", "character", "character", "character",
                                  "character", "character"),
                     comment.char="", strip.white=TRUE)
 
# not a great circle, but it gets the job done here
closestStation <- function(stations, lat, lon) {
  index <- which.min(sqrt((stations$latitude-lat)^2 +
                          (stations$longitude-lon)^2))
  stations[index,]
}
 
# what's the closest station?
closestStation(stations, 43.2672, -70.8617)
 
##     coop_id latitude longitude elevation state   name component_1 component_2 component_3 utc_offset
633  272174    43.15    -70.95      24.4    NH DURHAM      ------      ------      ------         +5

As a Mainer, I’m not thrilled that this is the actual, closest station, so we’ll also see what the closest one is in Maine:

closestStation(stations %>% filter(state=="ME"), 43.2672, -70.8617)
##    coop_id latitude longitude elevation state             name component_1 component_2 component_3 utc_offset
10  176905  43.6497  -70.3003      13.7    ME PORTLAND JETPORT      ------      ------      ------         +5

The analysis is easy enough to do for both, so we’ll first take a look at Durham, New Hampshire then do the exact same valuation for Portland, Maine.

Despite being fixed-width, the station database was not too difficult to wrangle. The state-level files that contain the readings are another matter:

Variable	Columns	Type
COOP ID	1-6	Character
YEAR	7-10	Integer
MONTH	11-12	Integer
ELEMENT	13-16	Character
VALUE1	17-21	Integer
MFLAG1	22	Character
QFLAG1	23	Character
SFLAG1	24	Character
VALUE2	25-29	Integer
MFLAG2	30	Character
QFLAG2	31	Character
SFLAG2	32	Character
…	…	…
VALUE31	257-261	Integer
MFLAG31	262	Character
QFLAG31	263	Character
SFLAG31	264	Character

We have fixed-width, wide-format records with 31 days for each month, which proves the existence of truly insidious people in the world. Rather than use `read.fwf` again, we’ll take a different approach (since we ultimately need the data in long format) and use `readLines` to read in all the records from the NH data file, then filter out everything but snowfall entries from the station we’re interested in.

Next, we setup nested `lapply` calls to build a long data frame from each month then combine them all together into a single data frame:

snow <- readLines("data/state27_NH.txt")
 
snow <- grep("SNOW", snow, value=TRUE)
snow <- grep("^272174", snow, value=TRUE)
 
snow_dat <- rbindlist(pblapply(snow, function(x) {
 
  rbindlist(lapply(1:31, function(i) {
 
    # record format described here:
    # http://cdiac.ornl.gov/ftp/ushcn_daily/data_format.txt
 
    start <- 17 + (i-1)*8
 
    list(coop_id=substr(x, 1, 6),
         date=sprintf("%s-%02d-%02d", substr(x, 7, 10), as.numeric(substr(x, 11, 12)), i),
         element=substr(x, 13, 16),
         value=as.numeric(substr(x, start, start+4)),
         mflag=substr(x, start+5, start+5),
         qflag=substr(x, start+6, start+6),
         sflag=substr(x, start+7, start+7))
 
  }))
 
}))

Now, we’ll clean up the records even further by removing invalid entries (those with a `value` == `-9999`) and convert record dates to actual `Date` objects and filter out invalid dates:

snow_dat <- snow_dat %>% filter(value != -9999)
 
# since the data file has 31 days for each records regardless of whether
# that's valid or not we do a shortcut to remove invalid dates by doing the
# a vectorized Date conversion, then removing records with NA dates
 
snow_dat$date <- as.Date(snow_dat$date)
snow_dat <- snow_dat %>% filter(!is.na(date))
 
# having the year extracted is handy for filtering
snow_dat$year <- format(snow_dat$date, "%Y")

Given that Winter in the U.S. spans across two calendar years, we need a way to keep dates in January-May associated with the previous year (yes, that adds an inherent assumption that no first snow is in June, which might not hold true for Alaska). To facilitate this, we’ll convert each date to its corresponding day of year value then add the number of total days in the start year to those values for all months <= May. We really do need to do this, too, since there are many cases where the first snowfall will be in January-March for many states.

snow_dat$doy <- as.numeric(format(snow_dat$date, "%j"))
snow_dat$doy <- ifelse(snow_dat$doy<=180,
                       snow_dat$doy + as.numeric(format(as.Date(sprintf("%s-12-31", snow_dat$year)), "%j")),
                       snow_dat$doy)

Now, the fun begins. We use (mostly) `dplyr` to extract the first snowfall day from each year, then make a dot-line plot from the data:

first <- snow_dat %>%
  filter(value>0) %>%                           # ignore 0 values
  filter(date>=as.Date("1950-01-01")) %>%       # start at 1950 (arbitrary)
  merge(stations, by="coop_id", all.x=TRUE) %>% # merge station details
  group_by(coop_id, year) %>%                   # group by station and year
  arrange(doy) %>%                              # sort by our munged day of year
  filter(row_number(doy) == 1) %>%              # grab the first entry by group
  select(name, state, date, value, doy)         # we only need some variables
 
title_1 <- sprintf("First observed snowfall (historical) at %s, %s", stri_trans_totitle(unique(first$name)), unique(first$state))
 
gg <- ggplot(first, aes(y=year, x=doy))
gg <- gg + geom_segment(aes(xend=min(first$doy)-20, yend=year), color="#9ecae1", size=0.25)
gg <- gg + geom_point(aes(color=coop_id), shape=8, size=3, color="#3182bd")
gg <- gg + geom_text(aes(label=format(date, "%b-%d")), size=3, hjust=-0.2)
gg <- gg + scale_x_continuous(expand=c(0, 0), limits=c(min(first$doy)-20, max(first$doy)+20))
gg <- gg + labs(x=NULL, y=NULL, title=title_1)
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="none")
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.text.y=element_text(color="#08306b"))
by_year <- gg

While that will help us see the diversity across years, we have to do quite a bit of eye tracking to get the most likely date range for the first snowfall, so we’ll add a boxplot into the mix and use `summary` to figure out the quartiles (for labeling the chart) for the actual date values:

wx_range <- summary(as.Date(format(first$date, "2013-%m-%d")))
names(wx_range) <- NULL
min_wx <- gsub("2013-", "", wx_range[2])
max_wx <- gsub("2013-", "", wx_range[5])
 
title_2 <- sprintf("Most likely first snowfall will be between %s & %s", min_wx, max_wx)
 
gg <- ggplot(first %>% mutate(name="0000"), aes(name, doy))
gg <- gg + geom_boxplot(fill="#3182bd", color="#08306b", outlier.colour="#08306b")
gg <- gg + scale_y_continuous(expand=c(0, 0),
                              limits=c(min(first$doy)-20, max(first$doy)+20))
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title=title_2)
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="none")
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_line(color="white"))
gg <- gg + theme(axis.text.y=element_text(color="white"))
gg <- gg + theme(plot.title=element_text(size=11))
box_wx <- gg

Finally, we’ll combine them together to get the finished product:

grid.arrange(by_year, box_wx, nrow=2, heights=unit(c(0.9, 0.1), "npc"))

And, do the same for Portland:

Click for larger

There are many more analyses and visualizations that can be performed on these data sets, but be wary when creating them as I’ve seen a few files with fixed-width formatting errors and have also noticed missing records for some observer stations.

You can find the complete, commented code up on [github](https://github.com/hrbrmstr/snowfirst).

4 Trackbacks/Pingbacks

By Security-Vision » Visualizing Historical & Most-likely First Snowfall Dates for U.S. Regions on 26 Nov 2014 at 3:34 am

[…] climate records for each station in each state and has data (for some stations) going […], Security Bloggers Network, hrbrmstr, […]
By R⁶ — Snow Day Facets | rud.is on 01 Apr 2017 at 8:26 pm

[…] in 2014 I blogged about first snowfall dates for a given U.S. state. It’s April 1, 2017 and we’re slated to get 12-18″ of snow […]
By R⁴ — Snow Day Facets – Mubashir Qasim on 01 Apr 2017 at 10:18 pm

[…] in 2014 I blogged about first snowfall dates for a given U.S. state. It’s April 1, 2017 and we’re slated to get 12-18″ of snow up […]
By R⁴ — Snow Day Facets | A bunch of data on 02 Apr 2017 at 5:54 am

[…] in 2014 I blogged about first snowfall dates for a given U.S. state. It’s April 1, 2017 and we’re slated to get 12-18″ of snow up here in […]

rud.is