hrbrmstr, Author at rud.is

Author Archives: hrbrmstr

Don't look at me…I do what he does — just slower. #rstats avuncular • ?Resistance Fighter • Cook • Christian • [Master] Chef des Données de Sécurité @ @rapid7

Workout Wednesday Redux (2017 Week 3)

I had started a “52 Vis” initiative back in 2016 to encourage folks to get practice making visualizations since that’s the only way to get better at virtually anything. Life got crazy, 52 Vis fell to the wayside and now there are more visible alternatives such as Makeover Monday and Workout Wednesday. They’re geared towards the “T” crowd (I’m not giving a closed source and locked-in-data product any more marketing than two links) but that doesn’t mean R, Python or other open-tool/open-data communities can’t join in for the ride and learning experience.

This week’s workout is a challenge to reproduce or improve upon a chart by Matt Stiles. You should go to both (give them the clicks and eyeballs they both deserve since they did great work). They both chose a line chart, but the whole point of these exercises is to try out new things to help you learn how to communicate better. I chose to use geom_segment() to make mini-column charts since that:

eliminates the giant rose-coloured rectangles that end up everywhere
helps show the differences a bit better (IMO), and
also helps highlight some of the states that have had more difficulties than others

Click/tap to “embiggen”. I kept the same dimensions that Andy did but unlike Matt’s creation this is a plain ol’ PNG as I didn’t want to deal with web fonts (I’m on a Museo Sans Condensed kick at the moment but don’t have it in my TypeKit config yet). I went with official annual unemployment numbers as they may be calculated/adjusted differently (I didn’t check, but I knew that data source existed, so I used it).

One reason I’m doing this is a quote on the Workout Wednesday post:

This will be a very tedious exercise. To provide some context, this took me 2-3 hours to create. Don’t get discouraged and don’t feel like you have to do it all in one sitting. Basically, try to make yours look identical to mine.

This took me 10 minutes to create in R:

#' ---
#' output:
#'  html_document:
#'    keep_md: true
#' ---
#+ message=FALSE
library(ggplot2)
library(hrbrmisc)
library(readxl)
library(tidyverse)

# Use official BLS annual unemployment data vs manually calculating the average
# Source: https://data.bls.gov/timeseries/LNU04000000?years_option=all_years&periods_option=specific_periods&periods=Annual+Data
read_excel("~/Data/annual.xlsx", skip=10) %>%
  mutate(Year=as.character(as.integer(Year)), Annual=Annual/100) -> annual_rate


# The data source Andy Kriebel curated for you/us: https://1drv.ms/x/s!AhZVJtXF2-tD1UVEK7gYn2vN5Hxn #ty Andy!
read_excel("~/Data/staadata.xlsx") %>%
  left_join(annual_rate) %>%
  filter(State != "District of Columbia") %>%
  mutate(
    year = as.Date(sprintf("%s-01-01", Year)),
    pct = (Unemployed / `Civilian Labor Force Population`),
    us_diff = -(Annual-pct),
    col = ifelse(us_diff<0,
               "Better than U.S. National Average",
               "Worse than U.S. National Average")
  ) -> df

credits <- "Notes: Excludes the District of Columbia. 2016 figure represents October rate.\nData: U.S. Bureau of Labor Statistics <https://www.bls.gov/lau/staadata.txt>\nCredit: Matt Stiles/The Daily Viz <thedailyviz.com>"

#+ state_of_us, fig.height=21.5, fig.width=8.75, fig.retina=2
ggplot(df, aes(year, us_diff, group=State)) +
  geom_segment(aes(xend=year, yend=0, color=col), size=0.5) +
  scale_x_date(expand=c(0,0), date_labels="'%y") +
  scale_y_continuous(expand=c(0,0), label=scales::percent, limit=c(-0.09, 0.09)) +
  scale_color_manual(name=NULL, expand=c(0,0),
                     values=c(`Better than U.S. National Average`="#4575b4",
                              `Worse than U.S. National Average`="#d73027")) +
  facet_wrap(~State, ncol=5, scales="free_x") +
  labs(x=NULL, y=NULL, title="The State of U.S. Jobs: 1976-2016",
       subtitle="Percentage points below or above the national unemployment rate, by state. Negative values represent unemployment rates\nthat were lower — or better, from a jobs perspective — than the national rate.",
       caption=credits) +
  theme_hrbrmstr_msc(grid="Y", strip_text_size=9) +
  theme(panel.background=element_rect(color="#00000000", fill="#f0f0f055")) +
  theme(panel.spacing=unit(0.5, "lines")) +
  theme(plot.subtitle=element_text(family="MuseoSansCond-300")) +
  theme(legend.position="top")

Swap out ~/Data for where you stored the files.

The “weird” looking comments enable me to spin the script and is pretty much just the inverse markup for knitr R Markdown documents. As the comments say, you should really thank Andy for curating the BLS data for you/us.

If I really didn’t pine over aesthetics it would have taken me 5 minutes (most of that was waiting for re-rendering). Formatting the blog post took much longer. Plus, I can update the data source and re-run this in the future without clicking anything. This re-emphasizes a caution I tell my students: beware of dragon droppings (“drag-and-drop data science/visualization tools”).

Hopefully you presently follow or will start following Workout Wednesday and Makeover Monday and dedicate some time to hone your skills with those visualization katas.

The Devil’s in the [Davos] Details — A quick look at this year’s WEF Global Risks Report

It’s Davos time again. Each year the World Economic Forum (WEF) gathers the global elite together to discuss how they’re going to shape our collective future. WEF also releases their annual Global Risks Report at the same time. I read it every year and have, in the past, borrowed some risk communication visualization idioms from it since — in theory — senior executives are supposed to be able to understand them (the report is designed for consumption by senior leaders across the globe).

I also read it to gauge what the general public groks about risks associated with cybersecurity (since that’s a core part of my day job). One way to do that is to see how many “cyber”-related things appear in the report and get rated in their top 30 risks section. They intuit these risks from conducting surveys across samples of many populations (this is their visualization for the composition):

This post is less about “cyber” and more about bringing three “details” out of the report.

Detail #1 : Methodology Matters

Page 65 of the PDF report indicates they modified both the underlying foundations for the initial Impact and Likelihood scales and mentions that:

It is worth noting that, as a consequence of the scale modification, the impact results cannot be compared with those of previous years.

One more time: you can’t compare this year over year. They’ve done this re-jiggering before, so unless you’re just looking at relative positions of named big risk buckets you really can’t glean anything by looking at previous reports next to latter reports or previous risks to current risks in any detailed way. Remember: Always read the “methodology” section of any report. If it doesn’t have one, consider using it for kindling.

Detail #2 : CYA (Check Your Axes)

I generally end up giving the production team behind the Global Risks Report minor-to-major kudos each year as they usually produce a slick PDF (I’m sure the printed version looks spiffy as well) with occasional usefully-innovative visualization choices. One of their seminal charts is the global risks likelihood/impact scatterplot that I’m sure most folks familiar with risk-oriented visualizations have seen:

There’s a visible area “mini-map” with both the online and print versions:

I believe there’s going to be a very real tendency to overlook the minimap and to read the risk chart without looking at the axis values and interpret it as the risks in the lower left hand corder as being “low” and those in the upper right hand corner as being “high” (it’s how risk scatterplots are designed).

The zoomed in view will also likely cause readers to think that some risks are very widely separated from others. They’re not. They are pretty much in 3 buckets and are pseudo-squishy-medium-yellow-ish risks (which can mean that the risk estimators hedged their guesses). I realize they zoomed in to enable seeing the labels for the risks and possibly compare the diamond sizes; while not as pristine, we can show them with their names on the full risk board (tap/click it to focus only on the chart):

Remember, these are the top risks and they are squishy. They could be squishier and it’s worth noting that they also look much closer together in the macro view. Perhaps the world isn’t so risky after all.

Detail #3 : Group Responses

The methodology section of the report provides equations that document how they aggregated the risk estimates across their sample groups. The team behind the report also made interactive versions. You can & should go there to see how each group assessed the pool of risks. Thankfully, they didn’t use a closed and expensive system like Tableau to make those interactive graphics, which means we can peek behind the scenes and grab the data ourselves (github link for the cleaned & combined data at the end of the post). Here are the zoomed-in/zoomed-out breakdowns between the groups:

WEF brought “opposite” groups to the table to make risk estimates, so we can also use this data to compare their overall risk scores (with a lazy impact x likelihood valuation) between opposite groups:

Environmental risks bubbled to the top, possibly as a result of holding the survey in a time when drought and breaking ice-shelfs are top-of-mind. Keen observers will note that those are zoomed in views. Here are the same slopegraphs on the full range of risk values:

That paints an even better picture of the squishiness of said risks.

FIN

One could deep-dive into many of those charts to pull out some interesting things, especially the how various opposite groups rated various risks (though there were only a few semi-stark contrasts). Actually, you can deep dive into them as the full data from the risk rating interactive visualizations and the R code that generated the above charts are all at https://github.com/hrbrmstr/devils_in_the_davos_details.

Even if there is little efficacy to this annual event, you now have some ++gd data to practice with in ggplot2, dplyr and tidyr.

Knit directly to jupyter notebooks from RStudio

Did you know that you can completely replace the “knitting” engine in R Markdown documents? Well, you can!

Why would you want to do this? Well, in the case of this post, to commit the unpardonable sin of creating a clunky jupyter notebook from a pristine Rmd file.

I’m definitely not “a fan” of “notebook-style” interactive data science workflows (apologies to RStudio, but I don’t even like their take on the interactive notebook). However, if you work with folks who are more productive in jupyter-style environments, it can be handy to be able to move back and forth between the ipynb and Rmd formats.

The notedown module and command-line tool does just that. I came across that after seeing this notedown example. There’s a script there to do the conversion but it’s very Windows-specific and it’s a pretty manual endeavour if all you want to do is quickly generate both an ipynb file and a notebook preview html file from an Rmd you’re working on.

We can exploit the fact that you can specify a knit: parameter in the Rmd YAML header. Said parameter can be inline code or be a reference to a function in a package. When you use the “Knit” command from RStudio (button or key-cmd-shortcut) this parameter will cause the Rmd file to be passed to that function and bypass all pandoc processing. Your function has to do all the heavy lifting.

To that end, I modified my (github only for now) markdowntemplates package and added a to_jupyter() function. Provided you have jupyter setup correctly (despite what the python folk say said task is not always as easy as they’d like you to believe) and notedown installed properly, adding knit: markdowntemplates::to_jupyter to the YAML header of (in theory) any Rmd document and knitting it via RStudio will result in

an ipynb file being generated
an html file generated via nbconverting the notebook file, and
said HTML file being rendered in your system’s default browser

You can take this test Rmd:

---
knit: markdowntemplates::to_jupyter
---
## Notedown Test

Let's try a python block

```{r engine="python"}
def test(x):
  return x * x
test(2)
```

And a ggplot test

```{r}
suppressPackageStartupMessages(library(ggplot2))
```

We'll use an old friend

```{r}
head(mtcars)
```

and plot it:

```{r}
ggplot(mtcars, aes(wt, mpg)) + geom_point() + ggthemes::theme_fivethirtyeight()
```

and, after doing devtools::install_github("hrbrmstr/markdowntemplates") and ensuring you have notedown working, knit it in RStudio to generate the ipynb file and render an HTML file:

Note the python block is a fully functioning notebook cell. I haven’t tried other magic language cells, but they should work according to the notedown docs.

I’ve only performed light testing (on a MacBook Pro with jupyter running under python 3.x) and I’m sure there will be issues (it’s python, it’s jupyter and this is dark alchemy bridging those two universes), so when you run into errors, please file an issue. Also drop any feature requests to the same location.

2017-01 Authored Package Updates

The rest of the month is going to be super-hectic and it’s unlikely I’ll be able to do any more to help the push to CRAN 10K, so here’s a breakdown of CRAN and GitHub new packages & package updates that I felt were worth raising awareness on:

epidata

I mentioned this one last week but it wasn’t really a package announcement post. epidata is now on CRAN and is a package to pull data from the Economic Policy Institute (U.S. gov economic data, mostly). Their “hidden” API is well thought out and the data has been nicely curated (and seems to update monthly). It makes it super easy to do things like the following:

library(epidata)
library(tidyverse)
library(stringi)
library(hrbrmisc) # devtools::install_github("hrbrmstr/hrbrmisc")

us_unemp <- get_unemployment("e")

glimpse(us_unemp)
## Observations: 456
## Variables: 7
## $ date            <date> 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-0...
## $ all             <dbl> 0.061, 0.061, 0.060, 0.060, 0.059, 0.059, 0.05...
## $ less_than_hs    <dbl> 0.100, 0.100, 0.099, 0.099, 0.099, 0.099, 0.09...
## $ high_school     <dbl> 0.055, 0.055, 0.054, 0.054, 0.054, 0.053, 0.05...
## $ some_college    <dbl> 0.050, 0.050, 0.050, 0.049, 0.049, 0.049, 0.04...
## $ college         <dbl> 0.032, 0.031, 0.031, 0.030, 0.030, 0.029, 0.03...
## $ advanced_degree <dbl> 0.021, 0.020, 0.020, 0.020, 0.020, 0.020, 0.02...

us_unemp %>%
  gather(level, rate, -date) %>%
  mutate(level=stri_replace_all_fixed(level, "_", " ") %>%
           stri_trans_totitle() %>%
           stri_replace_all_regex(c("Hs$"), c("High School")),
         level=factor(level, levels=unique(level))) -> unemp_by_edu

col <- ggthemes::tableau_color_pal()(10)

ggplot(unemp_by_edu, aes(date, rate, group=level)) +
  geom_line(color=col[1]) +
  scale_y_continuous(labels=scales::percent, limits =c(0, 0.2)) +
  facet_wrap(~level, scales="free") +
  labs(x=NULL, y="Unemployment rate",
       title=sprintf("U.S. Monthly Unemployment Rate by Education Level (%s)", paste0(range(format(us_unemp$date, "%Y")), collapse=":")),
       caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
  theme_hrbrmstr(grid="XY")

us_unemp %>%
  select(date, high_school, college) %>%
  mutate(date_num=as.numeric(date)) %>%
  ggplot(aes(x=high_school, xend=college, y=date_num, yend=date_num)) +
  geom_segment(size=0.125, color=col[1]) +
  scale_x_continuous(expand=c(0,0), label=scales::percent, breaks=seq(0, 0.12, 0.02), limits=c(0, 0.125)) +
  scale_y_reverse(expand=c(0,100), label=function(x) format(as_date(x), "%Y")) +
  labs(x="Unemployment rate", y="Year ↓",
       title=sprintf("U.S. monthly unemployment rate gap (%s)", paste0(range(format(us_unemp$date, "%Y")), collapse=":")),
       subtitle="Segment width shows the gap between those with a high school\ndegree and those with a college degree",
       caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
  theme_hrbrmstr(grid="X") +
  theme(panel.ontop=FALSE) +
  theme(panel.grid.major.x=element_line(size=0.2, color="#2b2b2b25")) +
  theme(axis.title.x=element_text(family="Arial", face="bold")) +
  theme(axis.title.y=element_text(family="Arial", face="bold", angle=0, hjust=1, margin=margin(r=-14)))

(right edge is high school, left edge is college…I’ll annotate it better next time)

censys

Censys is a search engine by one of the cybersecurity research partners we publish data to at work (free for use by all). The API is moderately decent (it’s mostly a thin shim authentication layer to pass on Google BigQuery query strings to the back-end) and the R package to interface to it censys is now on CRAN.

waffle

The seminal square pie chart package waffle has been updated on CRAN to work better with recent ggplot2 2.x changes and has some additional parameters you may want to check out.

cdcfluview

The viral package cdcfluview has had some updates on the GitHub version to add saner behaviour when specifying dates and had to be updated as the CDC hidden API switched to all https URLs (major push in .gov-land to do that to get better scores on their cyber report cards). I’ll be adding some features before the next CRAN push to enable retrieval of additional mortality data.

sergeant

If you work with Apache Drill (if you don’t, you should), the sergeant package (GitHub) will help you whip it into shape. I’ve mentioned it before on the blog but it has a nigh-complete dplyr interface now that works pretty well. It also has a direct REST API interface and RJDBC interface plus many helper utilities that help you avoid typing SQL strings to get cluster status info. Once I add the ability to create parquet files with it I’ll push it up to CRAN.

The one thing I’d like to do with this package is support any user-defined functions (UDFs in Drill-speak) folks have written. So, if you have a UDF you’ve written or use and you want it wrapped in the package, just drop an issue and I’ll layer it in. I’ll be releasing some open source cybersecurity-related UDFs via the work github in a few weeks.

zkcmd

Drill (in non-standalone mode) relies on Apache Zookeeper to keep everything in sync and it’s sometimes necessary to peek at what’s happening inside the zookeeper cluster, so sergeant has a sister package zkcmd that provides an R interface to zookeeper instances.

ggalt

Some helpful folks tweaked ggalt for better ggplot2 2.x compatibility (#ty!) and I added a new geom_cartogram() (before you ask if it makes warped shapefiles: it doesn’t) that restores the old (and what I believe to be the correct/sane/proper) behaviour of geom_map(). I need to get this on CRAN soon as it has both fixes and many new geoms folks will want to play with in a non-GitHub context.

FIN

There have been some awesome packages released by others in the past month+ and you should add R Weekly to your RSS feeds if you aren’t following it already (there are other things you should have there for R updates as well, but that’s for another blog). I’m definitely looking forward to new packages, visualizations, services and utilities that will be coming this year to the R community.

The Most Important Commodity in 2017 is Data

2017-01-04 – 17:41
Posted in Data Analysis, Data Visualization, data wrangling, ggplot, R
Tagged post
Comments (5)

Despite being in cybersecurity nigh forever (a career that quickly turns one into a determined skeptic if you’re doing your job correctly) I have often trusted various (not to be named) news sources, reports and data sources to provide honest and as-unbiased-as-possible information. The debacle in the U.S. in late 2016 has proven (to me) that we’re all on our own when it comes to validating posited truth/fact. I’m teaching two data science college courses this Spring and one motivation for teaching is helping others on the path to data literacy so they don’t have to just believe what’s tossed at them.

It’s also one of the reasons I write R packages and blog/speak (when I can).

Today, I saw a chart in a WSJ article on the impending minimum wage changes (paywalled, so do the Google hack to read it) set to take effect in 19 states.

It’s a great chart and they cite the data source as coming from the Economic Policy Institute. I found some minimum wage data there and manually went through a bit of it to get enough of a feel that the WSJ wasn’t “mis-truthing” or truth-spinning. (As an aside, the minimum wage should definitely be higher, indexed and adjusted for inflation, but that’s a discussion for another time.)

While on EPI’s site, I did notice they had a “State of Working America Data Library” @ http://www.epi.org/data/. The data there is based on U.S. government published data and I validated that EPI isn’t fabricating anything (but you should not just take my word for it and do your own math from U.S. gov sources). I also noticed that the filtering interactions were delayed a bit and posited said condition was due to the site making AJAX/XHR calls to retrieve data. So, I peeked under the covers and discovered a robust, consistent, hidden API that is now wrapped in an R package dubbed epidata.

You can get the details of what’s available via the EPI site or through package docs.

What can you do with this data?

They seem to update the data (pretty much) monthly and it’s based on U.S. gov publications, so you can very easily validate “news” reports, potentially even easier than via packages that wrap the Bureau of Labor Statistics (BLS) API. On a lark, I wanted to compare the unemployment rate vs median wage over time (mostly to test the API and make a connected scatterplot). If you didn’t add the level of detail to the aesthetics I did, the following code is pretty small to do just that:

library(tidyverse)
library(epidata)
library(ggrepel)

unemployment <- get_unemployment()
wages <- get_median_and_mean_wages()

glimpse(wages)
## Observations: 43
## Variables: 3
## $ date    <int> 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, ...
## $ median  <dbl> 16.53, 16.17, 16.05, 16.15, 16.07, 16.36, 16.15, 16.07...
## $ average <dbl> 19.05, 18.67, 18.64, 18.87, 18.77, 18.83, 19.06, 18.66...

glimpse(unemployment)
## Observations: 456
## Variables: 2
## $ date <date> 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-01, 1979-04-...
## $ all  <dbl> 0.061, 0.061, 0.060, 0.060, 0.059, 0.059, 0.059, 0.058, 0...

group_by(unemployment, date=as.integer(lubridate::year(date))) %>%
  summarise(rate=mean(all)) %>%
  left_join(select(wages, date, median), by="date") %>%
  filter(!is.na(median)) %>%
  arrange(date) -> df

cols <- ggthemes::tableau_color_pal()(3)

ggplot(df, aes(rate, median)) +
  geom_path(color=cols[1], arrow=arrow(type="closed", length=unit(10, "points"))) +
  geom_point() +
  geom_label_repel(aes(label=date),
                   alpha=c(1, rep((4/5), (nrow(df)-2)), 1),
                   size=c(5, rep(3, (nrow(df)-2)), 5),
                   color=c(cols[2],
                           rep("#2b2b2b", (nrow(df)-2)),
                           cols[3]),
                   family="Hind Medium") +
  scale_x_continuous(name="Unemployment Rate", expand=c(0,0.001), label=scales::percent) +
  scale_y_continuous(name="Median Wage", expand=c(0,0.25), label=scales::dollar) +
  labs(title="U.S. Unemployment Rate vs Median Wage Since 1978",
       subtitle="Wage data is in 2015 USD",
       caption="Source: EPI analysis of Current Population Survey Outgoing Rotation Group microdata") +
  hrbrmisc::theme_hrbrmstr(grid="XY")

You can build your own “State of Working America Data Library” dashboard with this data (with flexdashboard and/or shiny) and be a bit of a citizen journalist, keeping the actual media in check and staying more informed an engaged.

As usual, drop issues & feature requests in the github repo. If you make some fun things with the data, drop a comment in the blog. Unless something major comes up the package should be on CRAN by the weekend.

Removing Personal Bias From Flu Severity Estimation (a.k.a. Misery Loves Data)

The family got hit pretty hard with the flu right as the Christmas festivities started and we were all pretty much bed-ridden zombies up until today (2017-01-02). When in the throes of a very bad ILI it’s easy to imagine that you’re a victim of a severe outbreak, especially with ancillary data from others that they, too, either just had/have the flu or know others who do. Thankfully, I didn’t have to accept this emotional opine and could turn to the cdcfluview package to see just how this year is measuring up.

Influenza cases are cyclical, and that’s super-easy to see with a longer-view of the CDC national data:

library(cdcfluview)
library(tidyverse)
library(stringi)

flu <- get_flu_data("national", years=2010:2016)

mutate(flu, week=as.Date(sprintf("%s %s 1", YEAR, WEEK), format="%Y %U %u")) %>% 
  select(-`AGE 25-64`) %>% 
  gather(age_group, count, starts_with("AGE")) %>% 
  mutate(age_group=stri_replace_all_fixed(age_group, "AGE", "Ages")) %>% 
  mutate(age_group=factor(age_group, levels=c("Ages 0-4", "Ages 5-24", "Ages 25-49", "Ages 50-64", "Ages 65"))) %>%
  ggplot(aes(week, count, group=age_group)) +
  geom_line(aes(color=age_group)) +
  scale_y_continuous(label=scales::comma, limits=c(0,20000)) +
  facet_wrap(~age_group, scales="free") +
  labs(x=NULL, y="Count of reported ILI cases", 
       title="U.S. National ILI Case Counts by Age Group (2010:2011 flu season through 2016:2017)",
       caption="Source: CDC ILInet via CRAN cdcfluview pacakge") +
  ggthemes::scale_color_tableau(name=NULL) +
  hrbrmisc::theme_hrbrmstr(grid="XY") +
  theme(legend.position="none")

We can use the same data to zoom in on this season:

mutate(flu, week=as.Date(sprintf("%s %s 1", YEAR, WEEK), format="%Y %U %u")) %>% 
  select(-`AGE 25-64`) %>% 
  gather(age_group, count, starts_with("AGE")) %>% 
  mutate(age_group=stri_replace_all_fixed(age_group, "AGE", "Ages")) %>% 
  mutate(age_group=factor(age_group, levels=c("Ages 0-4", "Ages 5-24", "Ages 25-49", "Ages 50-64", "Ages 65"))) %>%
  filter(week >= as.Date("2016-07-01")) %>% 
  ggplot(aes(week, count, group=age_group)) +
  geom_line(aes(color=age_group)) +
  scale_y_continuous(label=scales::comma, limits=c(0,20000)) +
  facet_wrap(~age_group, scales="free") +
  labs(x=NULL, y="Count of reported ILI cases", 
       title="U.S. National ILI Case Counts by Age Group (2016:2017 flu season)",
       caption="Source: CDC ILInet via CRAN cdcfluview pacakge") +
  ggthemes::scale_color_tableau(name=NULL) +
  hrbrmisc::theme_hrbrmstr(grid="XY") +
  theme(legend.position="none")

So, things are trending up, but how severe is this year compared to others? While looking at the number/percentage of ILI cases is one way to understand severity, another is to look at the mortality rate. The cdcfluview package has a get_mortality_surveillance_data() function, but it’s region-based and I’m really only looking at national data in this post. A helpful individual pointed out a new CSV file at https://www.cdc.gov/flu/weekly/index.htm#MS which we can reproducibly programmatically target (so we don’t have to track filename changes by hand) with:

library(rvest)

pg <- read_html("https://www.cdc.gov/flu/weekly/index.htm#MS")
html_nodes(pg, xpath=".//a[contains(@href, 'csv') and contains(@href, 'NCHS')]") %>% 
  html_attr("href") -> mort_ref
mort_url <- sprintf("https://www.cdc.gov%s", mort_ref)

df <- readr::read_csv(mort_url)

We can, then, take a look at the current “outbreak” status (when real-world mortality events exceed the model threshold):

mutate(df, week=as.Date(sprintf("%s %s 1", Year, Week), format="%Y %U %u")) %>% 
  select(week, Expected, Threshold, `Percent of Deaths Due to Pneumonia and Influenza`) %>% 
  gather(category, percent, -week) %>% 
  mutate(percent=percent/100) %>% 
  ggplot() +
  geom_line(aes(week, percent, group=category, color=category)) +
  scale_x_date(date_labels="%Y-%U") +
  scale_y_continuous(label=scales::percent) +
  ggthemes::scale_color_tableau(name=NULL) +
  labs(x=NULL, y=NULL, title="U.S. Pneumonia & Influenza Mortality",
       subtitle="Data through week ending December 10, 2016 as of December 28, 2016",
       caption="Source: National Center for Health Statistics Mortality Surveillance System") +
  hrbrmisc::theme_hrbrmstr(grid="XY") +
  theme(legend.position="bottom")

That view is for all mortality events from both influenza and pneumonia. We can look at the counts for just influenza as well:

mutate(df, week=as.Date(sprintf("%s %s 1", Year, Week), format="%Y %U %u")) %>% 
  select(week, `Influenza Deaths`) %>% 
  ggplot() +
  geom_line(aes(week, `Influenza Deaths`), color=ggthemes::tableau_color_pal()(1)) +
  scale_x_date(date_labels="%Y-%U") +
  scale_y_continuous(label=scales::comma) +
  ggthemes::scale_color_tableau(name=NULL) +
  labs(x=NULL, y=NULL, title="U.S. Influenza Mortality (count of mortality events)",
       subtitle="Data through week ending December 10, 2016 as of December 28, 2016",
       caption="Source: National Center for Health Statistics Mortality Surveillance System") +
  hrbrmisc::theme_hrbrmstr(grid="XY") +
  theme(legend.position="bottom")

It’s encouraging that the overall combined mortality rate is trending downwards and that the mortality rate for influenza is very low. Go. Science.

I’ll be adding a function to cdcfluview to retrieve this new data set a bit later in the year.

Hopefully you’ll avoid the flu and enjoy a healthy and prosperous 2017.

North Carolina’s Neighborhood

UPDATE: I’m glad I’m not the only one who was skeptical of this project: http://andrewgelman.com/2017/01/02/constructing-expert-indices-measuring-electoral-integrity-reply-pippa-norris/

When I saw the bombastic headline “North Carolina is no longer classified as a democracy” pop up in my RSS feeds today (article link: http://www.newsobserver.com/opinion/op-ed/article122593759.html) I knew it’d help feed polarization bear that’s been getting fat on ‘Murica for the past decade. Sure enough, others picked it up and ran with it. I can’t wait to see how the opposite extreme reacts (everybody’s gotta feed the bear).

As of this post, neither site linked to the actual data, so here’s an early Christmas present: The Electoral Integrity Project Data. I’m very happy this is public data since this is the new reality for “news” intake:

Read shocking headline
See no data, bad data, cherry-picked data or poorly-analyzed data
Look for the actual data
Validate data & findings
Possibly learn even more from the data that was deliberately left out or ignored

Data literacy is even more important than it has been.

Back to the title of the post: where exactly does North Carolina fall on the newly assessed electoral integrity spectrum in the U.S.? Right here (click to zoom in):

Focusing solely on North Carolina is pretty convenient (I know there’s quite a bit of political turmoil going on down there at the moment, but that’s no excuse for cherry picking) since — frankly — there isn’t much to be proud of on that entire chart. Here’s where the ‘States fit on the global rankings (we’re in the gray box):

You can page through the table to see where our ‘States fall (we’re between Guana & Latvia…srsly). We don’t always have the nicest neighbors:

This post isn’t a commentary on North Carolina, it’s a cautionary note to be very wary of scary headlines that talk about data but don’t really show it. It’s worth pointing out that I’m taking the PEI data as it stands. I haven’t validated the efficacy of their process or checked on how “activist-y” the researchers are outside the report. It’s somewhat sad that this is a necessary next step since there’s going to be quite a bit of lying with data and even more lying about-and/or-without data over the next 4+ years on both sides (more than in the past eight combined, probably).

The PEI folks provide methodology information and data. Read/study it. They provide raw and imputed confidence intervals (note how large some of those are in the two graphs) – do the same for your research. If their practices are sound, the ‘States chart is pretty damning. I would hope that all the U.S. states would be well above 75 on the rating scale and the fact that we aren’t is a suggestion that we all have work to do right “here” at home, beginning with ceasing to feed the polarization bear.

If you do download the data, here’s the R code that generated the charts:

library(tidyverse)

# u.s. ------------------------------------------------------------------------------

eip_state <- read_tsv("~/Data/eip_dataverse_files/PEI US 2016 state-level (PEI_US_1.0) 16-12-2018.tab")

arrange(eip_state, PEIIndexi) %>%
  mutate(state=factor(state, levels=state)) -> eip_state

ggplot() +
  geom_linerange(data=eip_state, aes(state, ymin=PEIIndexi_lci, ymax=PEIIndexi_hci), size=0.25, color="#2b2b2b00") +
  geom_segment(data=eip_state, aes(x="North Carolina", xend="North Carolina", y=-Inf, yend=Inf), size=5, color="#cccccc", alpha=1/10) +
  geom_linerange(data=eip_state, aes(state, ymin=PEIIndexi_lci, ymax=PEIIndexi_hci), size=0.25, color="#2b2b2b") +
  geom_point(data=eip_state, aes(state, PEIIndexi, fill=responserate), size=2, shape=21, color="#2b2b2b", stroke=0.5) +
  scale_y_continuous(expand=c(0,0.1), limits=c(0,100)) +
  viridis::scale_fill_viridis(name="Response rate\n", label=scales::percent) +
  labs(x="Vertical lines show upper & lower bounds of the 95% confidence interval\nSource: PEI Dataverse (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YXUV3W)\nNorris, Pippa; Nai, Alessandro; Grömping, Max, 2016, 'Perceptions of Electoral Integrity US 2016 (PEI_US_1.0)'\ndoi:10.7910/DVN/YXUV3W, Harvard Dataverse, V1, UNF:6:1cMrtJfvUs9uBoNewfUKqA==",
       y="PEI Index (imputed)",
       title="Perceptions of Electoral Integrity: U.S. 2016 POTUS State Ratings",
       subtitle="The PEI index is designed to provide an overall summary evaluation of expert perceptions that an election\nmeets international standards and global norms. It is generated at the individual level. Unlike the individual\nindex (PEIIndex) PEIIndexi is imputed and thus fully observed for all experts and states.") +
  hrbrmisc::theme_hrbrmstr(grid="Y", subtitle_family="Hind Light", subtitle_size=11) +
  theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1)) +
  theme(axis.title.x=element_text(margin=margin(t=15))) +
  theme(legend.position=c(0.8, 0.1)) +
  theme(legend.title.align=1) +
  theme(legend.title=element_text(size=8)) +
  theme(legend.key.size=unit(0.5, "lines")) +
  theme(legend.direction="horizontal") +
  theme(legend.key.width=unit(3, "lines"))

# global ----------------------------------------------------------------------------

eip_world <- read_csv("~/Data/eip_dataverse_files/PEI country-level data (PEI_4.5) 19-08-2016.csv")

arrange(eip_world, PEIIndexi) %>%
  mutate(country=factor(country, levels=country)) -> eip_world

ggplot() +
  geom_linerange(data=eip_world, aes(factor(country), ymin=PEIIndexi_lci, ymax=PEIIndexi_hci), size=0.25, color="#2b2b2b00") +
  geom_linerange(data=eip_world, aes(factor(country), ymin=PEIIndexi_lci, ymax=PEIIndexi_hci), size=0.25, color="#2b2b2b") +
  geom_point(data=eip_world, aes(country, PEIIndexi), size=2, shape=21, fill="steelblue", color="#2b2b2b", stroke=0.5) +
  scale_y_continuous(expand=c(0,0.1), limits=c(0,100)) +
  labs(x="Vertical lines show upper & lower bounds of the 95% confidence interval\nSource: PEI Dataverse (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LYO57K)\nNorris, Pippa; Nai, Alessandro; Grömping, Max, 2016, 'Perceptions of Electoral Integrity (PEI-4.5)\ndoi:10.7910/DVN/LYO57K, Harvard Dataverse, V2",
       y="PEI Index (imputed)",
       title="Perceptions of Electoral Integrity: 2016 Global Ratings",
       subtitle="The PEI index is designed to provide an overall summary evaluation of expert perceptions that an election\nmeets international standards and global norms. It is generated at the individual level. Unlike the individual\nindex (PEIIndex) PEIIndexi is imputed and thus fully observed for all experts and countries") +
  hrbrmisc::theme_hrbrmstr(grid="Y", subtitle_family="Hind Light", subtitle_size=11) +
  theme(axis.text.x=element_blank()) +
  theme(axis.title.x=element_text(margin=margin(t=15)))

Pipes (%>%) Everywhere

2016-12-22 – 15:09
Posted in Alfred, Apple, Chrome, macOS, R, RStudio
Tagged post
Comments (5)

An R user asked a question regarding whether it’s possible to have the RStudio pipe (%>%) shortcut (Cmd-Shift-M) available in other macOS applications. If you’re using Alfred then you can use this workflow for said task (IIRC this requires an Alfred license which is reasonably cheap).

When you add it to Alfred you must edit it to make Cmd-Shift-M the hotkey since Alfred strips the keys on import (for good reasons). I limited the workflow to a few apps (Safari, Chrome, Sublime Text, iTerm) and I think it makes sense to limit the apps it applies to (you don’t need the operator everywhere, at least IMO).

I can’t believe I didn’t do this earlier. I use R in the terminal a bit and mis-hit Cmd-Shift-M all the time when I do (since RStudio is my primary editor for R code and muscle memory is scarily powerful). I also have to use (ugh) Jupyter notebooks on occasion and this will help there, too. If you end up modifying or using the workflow, drop a note in the comments.