The rest of the month is going to be super-hectic and it’s unlikely I’ll be able to do any more to help the push to CRAN 10K, so here’s a breakdown of CRAN and GitHub new packages & package updates that I felt were worth raising awareness on:
I mentioned this one last week but it wasn’t really a package announcement post. epidata
is now on CRAN and is a package to pull data from the Economic Policy Institute (U.S. gov economic data, mostly). Their “hidden” API is well thought out and the data has been nicely curated (and seems to update monthly). It makes it super easy to do things like the following:
library(hrbrmisc) # devtools::install_github("hrbrmstr/hrbrmisc")
us_unemp <- get_unemployment("e")
## Observations: 456
## Variables: 7
## $ date <date> 1978-12-01, 1979-01-01, 1979-02-01, 1979-03-0...
## $ all <dbl> 0.061, 0.061, 0.060, 0.060, 0.059, 0.059, 0.05...
## $ less_than_hs <dbl> 0.100, 0.100, 0.099, 0.099, 0.099, 0.099, 0.09...
## $ high_school <dbl> 0.055, 0.055, 0.054, 0.054, 0.054, 0.053, 0.05...
## $ some_college <dbl> 0.050, 0.050, 0.050, 0.049, 0.049, 0.049, 0.04...
## $ college <dbl> 0.032, 0.031, 0.031, 0.030, 0.030, 0.029, 0.03...
## $ advanced_degree <dbl> 0.021, 0.020, 0.020, 0.020, 0.020, 0.020, 0.02...
us_unemp %>%
gather(level, rate, -date) %>%
mutate(level=stri_replace_all_fixed(level, "_", " ") %>%
stri_trans_totitle() %>%
stri_replace_all_regex(c("Hs$"), c("High School")),
level=factor(level, levels=unique(level))) -> unemp_by_edu
col <- ggthemes::tableau_color_pal()(10)
ggplot(unemp_by_edu, aes(date, rate, group=level)) +
geom_line(color=col[1]) +
scale_y_continuous(labels=scales::percent, limits =c(0, 0.2)) +
facet_wrap(~level, scales="free") +
labs(x=NULL, y="Unemployment rate",
title=sprintf("U.S. Monthly Unemployment Rate by Education Level (%s)", paste0(range(format(us_unemp$date, "%Y")), collapse=":")),
caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
us_unemp %>%
select(date, high_school, college) %>%
mutate(date_num=as.numeric(date)) %>%
ggplot(aes(x=high_school, xend=college, y=date_num, yend=date_num)) +
geom_segment(size=0.125, color=col[1]) +
scale_x_continuous(expand=c(0,0), label=scales::percent, breaks=seq(0, 0.12, 0.02), limits=c(0, 0.125)) +
scale_y_reverse(expand=c(0,100), label=function(x) format(as_date(x), "%Y")) +
labs(x="Unemployment rate", y="Year ↓",
title=sprintf("U.S. monthly unemployment rate gap (%s)", paste0(range(format(us_unemp$date, "%Y")), collapse=":")),
subtitle="Segment width shows the gap between those with a high school\ndegree and those with a college degree",
caption="Source: EPI analysis of basic monthly Current Population Survey microdata.") +
theme_hrbrmstr(grid="X") +
theme(panel.ontop=FALSE) +
theme(panel.grid.major.x=element_line(size=0.2, color="#2b2b2b25")) +
theme(axis.title.x=element_text(family="Arial", face="bold")) +
theme(axis.title.y=element_text(family="Arial", face="bold", angle=0, hjust=1, margin=margin(r=-14)))
(right edge is high school, left edge is college…I’ll annotate it better next time)
Censys is a search engine by one of the cybersecurity research partners we publish data to at work (free for use by all). The API is moderately decent (it’s mostly a thin shim authentication layer to pass on Google BigQuery query strings to the back-end) and the R package to interface to it censys
is now on CRAN.
The seminal square pie chart package waffle
has been updated on CRAN to work better with recent ggplot2
2.x changes and has some additional parameters you may want to check out.
The viral package cdcfluview
has had some updates on the GitHub version to add saner behaviour when specifying dates and had to be updated as the CDC hidden API switched to all https
URLs (major push in .gov-land to do that to get better scores on their cyber report cards). I’ll be adding some features before the next CRAN push to enable retrieval of additional mortality data.
If you work with Apache Drill (if you don’t, you should), the sergeant
package (GitHub) will help you whip it into shape. I’ve mentioned it before on the blog but it has a nigh-complete dplyr
interface now that works pretty well. It also has a direct REST API interface and RJDBC interface plus many helper utilities that help you avoid typing SQL strings to get cluster status info. Once I add the ability to create parquet files with it I’ll push it up to CRAN.
The one thing I’d like to do with this package is support any user-defined functions (UDFs in Drill-speak) folks have written. So, if you have a UDF you’ve written or use and you want it wrapped in the package, just drop an issue and I’ll layer it in. I’ll be releasing some open source cybersecurity-related UDFs via the work github in a few weeks.
Drill (in non-standalone mode) relies on Apache Zookeeper to keep everything in sync and it’s sometimes necessary to peek at what’s happening inside the zookeeper cluster, so sergeant
has a sister package zkcmd
that provides an R interface to zookeeper instances.
Some helpful folks tweaked ggalt
for better ggplot2 2.x compatibility (#ty!) and I added a new geom_cartogram()
(before you ask if it makes warped shapefiles: it doesn’t) that restores the old (and what I believe to be the correct/sane/proper) behaviour of geom_map()
. I need to get this on CRAN soon as it has both fixes and many new geom
s folks will want to play with in a non-GitHub context.
There have been some awesome packages released by others in the past month+ and you should add R Weekly to your RSS feeds if you aren’t following it already (there are other things you should have there for R updates as well, but that’s for another blog). I’m definitely looking forward to new packages, visualizations, services and utilities that will be coming this year to the R community.
North Carolina’s Neighborhood
When I saw the bombastic headline “North Carolina is no longer classified as a democracy” pop up in my RSS feeds today (article link: I knew it’d help feed polarization bear that’s been getting fat on ‘Murica for the past decade. Sure enough, others picked it up and ran with it. I can’t wait to see how the opposite extreme reacts (everybody’s gotta feed the bear).
As of this post, neither site linked to the actual data, so here’s an early Christmas present: The Electoral Integrity Project Data. I’m very happy this is public data since this is the new reality for “news” intake:
Data literacy is even more important than it has been.
Back to the title of the post: where exactly does North Carolina fall on the newly assessed electoral integrity spectrum in the U.S.? Right here (click to zoom in):
Focusing solely on North Carolina is pretty convenient (I know there’s quite a bit of political turmoil going on down there at the moment, but that’s no excuse for cherry picking) since — frankly — there isn’t much to be proud of on that entire chart. Here’s where the ‘States fit on the global rankings (we’re in the gray box):
You can page through the table to see where our ‘States fall (we’re between Guana & Latvia…srsly). We don’t always have the nicest neighbors:
This post isn’t a commentary on North Carolina, it’s a cautionary note to be very wary of scary headlines that talk about data but don’t really show it. It’s worth pointing out that I’m taking the PEI data as it stands. I haven’t validated the efficacy of their process or checked on how “activist-y” the researchers are outside the report. It’s somewhat sad that this is a necessary next step since there’s going to be quite a bit of lying with data and even more lying about-and/or-without data over the next 4+ years on both sides (more than in the past eight combined, probably).
The PEI folks provide methodology information and data. Read/study it. They provide raw and imputed confidence intervals (note how large some of those are in the two graphs) – do the same for your research. If their practices are sound, the ‘States chart is pretty damning. I would hope that all the U.S. states would be well above 75 on the rating scale and the fact that we aren’t is a suggestion that we all have work to do right “here” at home, beginning with ceasing to feed the polarization bear.
If you do download the data, here’s the R code that generated the charts: