New R Package: cdcfluview — Retrieve Flu Data from CDC’s FluView Portal

NOTE If there’s a particular data set from http://www.cdc.gov/flu/weekly/fluviewinteractive.htm that you want and that isn’t in the pacakge, please file it as an issue and be as specific as you can (screen shot if possible).


Towards the end of 2014 I had been tinkering with flu data from the CDC’s FluView portal since flu reports began to look like this season was going to go the way of 2009.

While you can track the flu over at The Washington Post, I like to work with data on my own. However the CDC’s portal is Flash-driven and there was no obvious way to get the data files programmatically. This is unfortunate, since there are weekly updates to the data set.

As an information security professional, one of the tools in my arsenal is Burp Proxy, which is an application that—amongst other things—lets you configure a local proxy server for your browser and inspect all web requests. By using this tool, I was able to discern that the Flash portal calls out to http://gis.cdc.gov/grasp/fluview/FluViewPhase2CustomDownload.ashx with custom POST form parameters (that I also mapped out) to make the data sets it delivers back to the user.

With that information in hand, I whipped together a small R package: cdcfluview to interface with the same server the FluView Portal does. It has a singular function – get_flu_data that lets you choose between different region/sub-region breakdowns and also whether you want data from WHO, ILINet (or both). It also lets you pick which years you want data for.

One reason I wanted to work with the data was to see just how this season differs from previous ones. The view I’ll leave on the blog this time—mostly as an example of how to use the package—is a faceted chart, by CDC region and CDC week showing this season (in red) as it relates to previous ones.

# devtools::install_github("hrbrmstr/cdcfluview") # if necessary
library(cdcfluview)
library(magrittr)
library(dplyr)
library(ggplot2)
 
dat <- get_flu_data(region="hhs", 
                    sub_region=1:10, 
                    data_source="ilinet", 
                    years=2000:2014)
 
dat %<>%
  mutate(REGION=factor(REGION,
                       levels=unique(REGION),
                       labels=c("Boston", "New York",
                                "Philadelphia", "Atlanta",
                                "Chicago", "Dallas",
                                "Kansas City", "Denver",
                                "San Francisco", "Seattle"),
                       ordered=TRUE)) %>%
  mutate(season_week=ifelse(WEEK>=40, WEEK-40, WEEK),
         season=ifelse(WEEK<40,
                       sprintf("%d-%d", YEAR-1, YEAR),
                       sprintf("%d-%d", YEAR, YEAR+1)))
 
prev_years <- dat %>% filter(season != "2014-2015")
curr_year <- dat %>% filter(season == "2014-2015")
 
curr_week <- tail(dat, 1)$season_week
 
gg <- ggplot()
gg <- gg + geom_point(data=prev_years,
                      aes(x=season_week, y=X..WEIGHTED.ILI, group=season),
                      color="#969696", size=1, alpa=0.25)
gg <- gg + geom_point(data=curr_year,
                      aes(x=season_week, y=X..WEIGHTED.ILI, group=season),
                      color="red", size=1.25, alpha=1)
gg <- gg + geom_line(data=curr_year, 
                     aes(x=season_week, y=X..WEIGHTED.ILI, group=season),
                     size=1.25, color="#d7301f")
gg <- gg + geom_vline(xintercept=curr_week, color="#d7301f", size=0.5, linetype="dashed", alpha=0.5)
gg <- gg + facet_wrap(~REGION, ncol=3)
gg <- gg + labs(x=NULL, y="Weighted ILI Index", 
                title="ILINet - 1999-2015 year weighted flu index history by CDC region\nWeek Ending Jan 3, 2015 (Red == current season)\n")
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme(axis.ticks.x=element_blank())
gg <- gg + theme(axis.text.x=element_blank())
gg

flureport

(You can see an SVG version of that plot here)

Even without looking at the statistics, it’s pretty easy to tell that this is fixing to be a pretty bad season in many regions.

State-level data

Soon after this post I found the state-level API for the CDC FluView interface and added a get_state_data function for it:

library(statebins)
 
get_state_data() %>%
  filter(WEEKEND=="Jan-03-2015") %>%
  select(state=STATENAME, value=ACTIVITY.LEVEL) %>%
  filter(!(state %in% c("Puerto Rico", "New York City"))) %>% # need to add NYC & PR to statebins
  mutate(value=as.numeric(gsub("Level ", "", value))) %>%
  statebins(brewer_pal="RdPu", breaks=4,
            labels=c("Minimal", "Low", "Moderate", "High"),
            legend_position="bottom", legend_title="ILI Activity Level") +
  ggtitle("CDC State FluView (2014-01-03)")

state

As always, post bugs or feature requests on the github repo and drop a note here if you’ve found the package useful or have some other interesting views or analyses to share.

Data-Driven Security Podcast
Episode 29: With Great Power Law…
R World News Podcast — Episode 2
Buy on AmazonDDS Blog
DDS PodcastAmazon Author Page

9 Comments New R Package: cdcfluview — Retrieve Flu Data from CDC’s FluView Portal

  1. Pingback: New R Package: cdcfluview — Retrieve Flu Data from CDC’s FluView Portal | infopunk.org

  2. Pep

    Excellent post! I made the same approach for another institutional flash website that queried geoserver. Also made a small library that is not public yet because one thing that bothers me is that was somehow “reverse engineering”, and I’m not sure they would give permission to do it or change the server settings meanwhile. I believe webservices are the future (if not the present) to deliver data to the public, but some institutions might not be happy with it. What are your thoughts about that?

    Reply
  3. Nicholas Horton

    This is extremely cool: thanks for putting this together. Some minor notes:

    You might remind folks that they can install the package from github

    devtools::install_github("hrbrmstr/cdcfluview")

    There’s a missing “library(ggplot2)” in your code.

    Kudos for sharing such a timely and interesting post.

    Nick

    Reply
  4. Pingback: cdcfluview – On The Way to “CRAN 7K” | rud.is

      1. Thiago Smith

        Awesome ! Thank you for getting back to me. If you have any ideas where to track city data for example (Seattle ) I would definitely appreciate it. I am just learning R and I had no idea R code could do what your code did. Thats really cool

        Reply
  5. MJ

    I am a high school student who is looking into doing some research on flu occurrences. I am not familiar with R package. I was wondering if I get the package installed and working, whether I will be able to access data on flu occurrences at the state level.

    Reply

Leave a Reply