Despite being in cybersecurity nigh forever (a career that quickly turns one into a determined skeptic if you’re doing your job correctly) I have often trusted various (not to be named) news sources, reports and data sources to provide honest and as-unbiased-as-possible information. The debacle in the U.S. in late 2016 has proven (to me) that we’re all on our own when it comes to validating posited truth/fact. I’m teaching two data science college courses this Spring and one motivation for teaching is helping others on the path to data literacy so they don’t have to just believe what’s tossed at them.
It’s also one of the reasons I write R packages and blog/speak (when I can).
Today, I saw a chart in a WSJ article on the impending minimum wage changes (paywalled, so do the Google hack to read it) set to take effect in 19 states.
It’s a great chart and they cite the data source as coming from the Economic Policy Institute. I found some minimum wage data there and manually went through a bit of it to get enough of a feel that the WSJ wasn’t “mis-truthing” or truth-spinning. (As an aside, the minimum wage should definitely be higher, indexed and adjusted for inflation, but that’s a discussion for another time.)
While on EPI’s site, I did notice they had a “State of Working America Data Library” @ http://www.epi.org/data/. The data there is based on U.S. government published data and I validated that EPI isn’t fabricating anything (but you should not just take my word for it and do your own math from U.S. gov sources). I also noticed that the filtering interactions were delayed a bit and posited said condition was due to the site making AJAX/XHR calls to retrieve data. So, I peeked under the covers and discovered a robust, consistent, hidden API that is now wrapped in an R package dubbed epidata.
You can get the details of what’s available via the EPI site or through package docs.
What can you do with this data?
They seem to update the data (pretty much) monthly and it’s based on U.S. gov publications, so you can very easily validate “news” reports, potentially even easier than via packages that wrap the Bureau of Labor Statistics (BLS) API. On a lark, I wanted to compare the unemployment rate vs median wage over time (mostly to test the API and make a connected scatterplot). If you didn’t add the level of detail to the aesthetics I did, the following code is pretty small to do just that:
library(tidyverse)
library(epidata)
library(ggrepel)
unemployment <- get_unemployment()
wages <- get_median_and_mean_wages()
glimpse(wages)
## Observations: 43
## Variables: 3
## $ date <int> 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, ...
## $ median <dbl> 16.53, 16.17, 16.05, 16.15, 16.07, 16.36, 16.15, 16.07...
## $ average <dbl> 19.05, 18.67, 18.64, 18.87, 18.77, 18.83, 19.06, 18.66...
glimpse(unemployment)
## Observations: 456
## Variables: 2
## $ date <date> 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-01, 1979-04-...
## $ all <dbl> 0.061, 0.061, 0.060, 0.060, 0.059, 0.059, 0.059, 0.058, 0...
group_by(unemployment, date=as.integer(lubridate::year(date))) %>%
summarise(rate=mean(all)) %>%
left_join(select(wages, date, median), by="date") %>%
filter(!is.na(median)) %>%
arrange(date) -> df
cols <- ggthemes::tableau_color_pal()(3)
ggplot(df, aes(rate, median)) +
geom_path(color=cols[1], arrow=arrow(type="closed", length=unit(10, "points"))) +
geom_point() +
geom_label_repel(aes(label=date),
alpha=c(1, rep((4/5), (nrow(df)-2)), 1),
size=c(5, rep(3, (nrow(df)-2)), 5),
color=c(cols[2],
rep("#2b2b2b", (nrow(df)-2)),
cols[3]),
family="Hind Medium") +
scale_x_continuous(name="Unemployment Rate", expand=c(0,0.001), label=scales::percent) +
scale_y_continuous(name="Median Wage", expand=c(0,0.25), label=scales::dollar) +
labs(title="U.S. Unemployment Rate vs Median Wage Since 1978",
subtitle="Wage data is in 2015 USD",
caption="Source: EPI analysis of Current Population Survey Outgoing Rotation Group microdata") +
hrbrmisc::theme_hrbrmstr(grid="XY")
You can build your own “State of Working America Data Library” dashboard with this data (with flexdashboard and/or shiny) and be a bit of a citizen journalist, keeping the actual media in check and staying more informed an engaged.
As usual, drop issues & feature requests in the github repo. If you make some fun things with the data, drop a comment in the blog. Unless something major comes up the package should be on CRAN by the weekend.






North Carolina’s Neighborhood
When I saw the bombastic headline “North Carolina is no longer classified as a democracy” pop up in my RSS feeds today (article link: http://www.newsobserver.com/opinion/op-ed/article122593759.html) I knew it’d help feed polarization bear that’s been getting fat on ‘Murica for the past decade. Sure enough, others picked it up and ran with it. I can’t wait to see how the opposite extreme reacts (everybody’s gotta feed the bear).
As of this post, neither site linked to the actual data, so here’s an early Christmas present: The Electoral Integrity Project Data. I’m very happy this is public data since this is the new reality for “news” intake:
Data literacy is even more important than it has been.
Back to the title of the post: where exactly does North Carolina fall on the newly assessed electoral integrity spectrum in the U.S.? Right here (click to zoom in):
Focusing solely on North Carolina is pretty convenient (I know there’s quite a bit of political turmoil going on down there at the moment, but that’s no excuse for cherry picking) since — frankly — there isn’t much to be proud of on that entire chart. Here’s where the ‘States fit on the global rankings (we’re in the gray box):
You can page through the table to see where our ‘States fall (we’re between Guana & Latvia…srsly). We don’t always have the nicest neighbors:
This post isn’t a commentary on North Carolina, it’s a cautionary note to be very wary of scary headlines that talk about data but don’t really show it. It’s worth pointing out that I’m taking the PEI data as it stands. I haven’t validated the efficacy of their process or checked on how “activist-y” the researchers are outside the report. It’s somewhat sad that this is a necessary next step since there’s going to be quite a bit of lying with data and even more lying about-and/or-without data over the next 4+ years on both sides (more than in the past eight combined, probably).
The PEI folks provide methodology information and data. Read/study it. They provide raw and imputed confidence intervals (note how large some of those are in the two graphs) – do the same for your research. If their practices are sound, the ‘States chart is pretty damning. I would hope that all the U.S. states would be well above 75 on the rating scale and the fact that we aren’t is a suggestion that we all have work to do right “here” at home, beginning with ceasing to feed the polarization bear.
If you do download the data, here’s the R code that generated the charts: