Visualizing Historical & Most-likely First Snowfall Dates for U.S. Regions

UPDATE: You can now run this as a local Shiny app by entering shiny::runGist("95ec24c1b0cb433a76a5", launch.browser=TRUE) at an R prompt (provided all the dependent libraries (below) are installed) or use it interactively over at Shiny Apps.

The impending arrival of the first real snowfall of the year in my part of Maine got me curious about what the most likely “first snow” dates are for my region. The U.S. Historical Climatology Network (USHCN) maintains historical daily climate records for each station in each state and has data (for some stations) going back as far as the 1800’s. A quick look at their data files indicated that they would definitely help satiate my curiosity (and make for a late night of cranking out some R code and ggplot visualizations).

To start, we’ll need a bit more than base R to get the job done:


In all honesty, pbapply, dplyr and stringi are not necessary, but they definitely make life easier by (respectively) giving us:

  • free progress bars for *apply operations,
  • high efficacy data manipulation idioms, and
  • a handy utility for converting strings to title case.

With setup out of the way, the first real task is to see which observer station is closest to my area. To figure that out we need to read in the station data file which is, sadly, in fixed-width format. Some stations have # characters in their titles, to we have to account for that when we call read.fwf. After reading in the station database we use a naive–but-usable distance calculation to find the closest station:

stations <- read.fwf("data/ushcn-stations.txt",
                     widths=c(6, 9, 10, 7, 3, 31, 7, 7, 7, 3),
                     col.names=c("coop_id", "latitude", "longitude", "elevation",
                                 "state", "name", "component_1", "component_2",
                                 "component_3", "utc_offset"),
                     colClasses=c("character", "numeric", "numeric", "numeric",
                                  "character", "character", "character", "character",
                                  "character", "character"),
                     comment.char="", strip.white=TRUE)
# not a great circle, but it gets the job done here
closestStation <- function(stations, lat, lon) {
  index <- which.min(sqrt((stations$latitude-lat)^2 +
# what's the closest station?
closestStation(stations, 43.2672, -70.8617)
##     coop_id latitude longitude elevation state   name component_1 component_2 component_3 utc_offset
633  272174    43.15    -70.95      24.4    NH DURHAM      ------      ------      ------         +5

As a Mainer, I’m not thrilled that this is the actual, closest station, so we’ll also see what the closest one is in Maine:

closestStation(stations %>% filter(state=="ME"), 43.2672, -70.8617)
##    coop_id latitude longitude elevation state             name component_1 component_2 component_3 utc_offset
10  176905  43.6497  -70.3003      13.7    ME PORTLAND JETPORT      ------      ------      ------         +5

The analysis is easy enough to do for both, so we’ll first take a look at Durham, New Hampshire then do the exact same valuation for Portland, Maine.

Despite being fixed-width, the station database was not too difficult to wrangle. The state-level files that contain the readings are another matter:

Variable Columns Type
COOP ID 1-6 Character
YEAR 7-10 Integer
MONTH 11-12 Integer
ELEMENT 13-16 Character
VALUE1 17-21 Integer
MFLAG1 22 Character
QFLAG1 23 Character
SFLAG1 24 Character
VALUE2 25-29 Integer
MFLAG2 30 Character
QFLAG2 31 Character
SFLAG2 32 Character
VALUE31 257-261 Integer
MFLAG31 262 Character
QFLAG31 263 Character
SFLAG31 264 Character

We have fixed-width, wide-format records with 31 days for each month, which proves the existence of truly insidious people in the world. Rather than use read.fwf again, we’ll take a different approach (since we ultimately need the data in long format) and use readLines to read in all the records from the NH data file, then filter out everything but snowfall entries from the station we’re interested in.

Next, we setup nested lapply calls to build a long data frame from each month then combine them all together into a single data frame:

snow <- readLines("data/state27_NH.txt")
snow <- grep("SNOW", snow, value=TRUE)
snow <- grep("^272174", snow, value=TRUE)
snow_dat <- rbindlist(pblapply(snow, function(x) {
  rbindlist(lapply(1:31, function(i) {
    # record format described here:
    start <- 17 + (i-1)*8
    list(coop_id=substr(x, 1, 6),
         date=sprintf("%s-%02d-%02d", substr(x, 7, 10), as.numeric(substr(x, 11, 12)), i),
         element=substr(x, 13, 16),
         value=as.numeric(substr(x, start, start+4)),
         mflag=substr(x, start+5, start+5),
         qflag=substr(x, start+6, start+6),
         sflag=substr(x, start+7, start+7))

Now, we’ll clean up the records even further by removing invalid entries (those with a value == -9999) and convert record dates to actual Date objects and filter out invalid dates:

snow_dat <- snow_dat %>% filter(value != -9999)
# since the data file has 31 days for each records regardless of whether
# that's valid or not we do a shortcut to remove invalid dates by doing the
# a vectorized Date conversion, then removing records with NA dates
snow_dat$date <- as.Date(snow_dat$date)
snow_dat <- snow_dat %>% filter(!
# having the year extracted is handy for filtering
snow_dat$year <- format(snow_dat$date, "%Y")

Given that Winter in the U.S. spans across two calendar years, we need a way to keep dates in January-May associated with the previous year (yes, that adds an inherent assumption that no first snow is in June, which might not hold true for Alaska). To facilitate this, we’ll convert each date to its corresponding day of year value then add the number of total days in the start year to those values for all months <= May. We really do need to do this, too, since there are many cases where the first snowfall will be in January-March for many states.

snow_dat$doy <- as.numeric(format(snow_dat$date, "%j"))
snow_dat$doy <- ifelse(snow_dat$doy<=180,
                       snow_dat$doy + as.numeric(format(as.Date(sprintf("%s-12-31", snow_dat$year)), "%j")),

Now, the fun begins. We use (mostly) dplyr to extract the first snowfall day from each year, then make a dot-line plot from the data:

first <- snow_dat %>%
  filter(value>0) %>%                           # ignore 0 values
  filter(date>=as.Date("1950-01-01")) %>%       # start at 1950 (arbitrary)
  merge(stations, by="coop_id", all.x=TRUE) %>% # merge station details
  group_by(coop_id, year) %>%                   # group by station and year
  arrange(doy) %>%                              # sort by our munged day of year
  filter(row_number(doy) == 1) %>%              # grab the first entry by group
  select(name, state, date, value, doy)         # we only need some variables
title_1 <- sprintf("First observed snowfall (historical) at %s, %s", stri_trans_totitle(unique(first$name)), unique(first$state))
gg <- ggplot(first, aes(y=year, x=doy))
gg <- gg + geom_segment(aes(xend=min(first$doy)-20, yend=year), color="#9ecae1", size=0.25)
gg <- gg + geom_point(aes(color=coop_id), shape=8, size=3, color="#3182bd")
gg <- gg + geom_text(aes(label=format(date, "%b-%d")), size=3, hjust=-0.2)
gg <- gg + scale_x_continuous(expand=c(0, 0), limits=c(min(first$doy)-20, max(first$doy)+20))
gg <- gg + labs(x=NULL, y=NULL, title=title_1)
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="none")
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.text.y=element_text(color="#08306b"))
by_year <- gg

While that will help us see the diversity across years, we have to do quite a bit of eye tracking to get the most likely date range for the first snowfall, so we’ll add a boxplot into the mix and use summary to figure out the quartiles (for labeling the chart) for the actual date values:

wx_range <- summary(as.Date(format(first$date, "2013-%m-%d")))
names(wx_range) <- NULL
min_wx <- gsub("2013-", "", wx_range[2])
max_wx <- gsub("2013-", "", wx_range[5])
title_2 <- sprintf("Most likely first snowfall will be between %s & %s", min_wx, max_wx)
gg <- ggplot(first %>% mutate(name="0000"), aes(name, doy))
gg <- gg + geom_boxplot(fill="#3182bd", color="#08306b", outlier.colour="#08306b")
gg <- gg + scale_y_continuous(expand=c(0, 0),
                              limits=c(min(first$doy)-20, max(first$doy)+20))
gg <- gg + coord_flip()
gg <- gg + labs(x=NULL, y=NULL, title=title_2)
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="none")
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg <- gg + theme(axis.ticks.y=element_line(color="white"))
gg <- gg + theme(axis.text.y=element_text(color="white"))
gg <- gg + theme(plot.title=element_text(size=11))
box_wx <- gg

Finally, we’ll combine them together to get the finished product:

grid.arrange(by_year, box_wx, nrow=2, heights=unit(c(0.9, 0.1), "npc"))


And, do the same for Portland:

bothClick for larger

There are many more analyses and visualizations that can be performed on these data sets, but be wary when creating them as I’ve seen a few files with fixed-width formatting errors and have also noticed missing records for some observer stations.

You can find the complete, commented code up on github.

Buy on AmazonDDS Blog
DDS PodcastAmazon Author Page

1 Comment Visualizing Historical & Most-likely First Snowfall Dates for U.S. Regions

  1. Pingback: Security-Vision » Visualizing Historical & Most-likely First Snowfall Dates for U.S. Regions

Leave a Reply