Skip navigation

Tag Archives: post

>_The parent proposed a committee made up of parents and teachers of different cultural backgrounds come up with a list of books that are inclusive for all students._

Where’s the [ACLU](https://www.aclu.org/issues/free-speech/artistic-expression/banned-books)? They’d be right there if this was an alt-right’er asking to ban books with salacious content they deem (rightly or wrongly) “inappropriate” or “harmful” to teens. I hope they step up and fight the fight the good fight here.

We’re rapidly losing — or may have already lost (as a society) — the concept of building resilience & strength through adversity. I’m hopeful that the diversity we bring into this country with immigrants and refugees (if the borders don’t close shut or we scare them away) that know what true adversity is will eventually counteract this downward spiral.

Refs:



(mp3 backup in the event they take down the audio record)

When something pops up in the news, on Twitter, Facebook or even (ugh) Gab claiming or teasing findings based on data I believe it’s more important than ever to reply with some polite text and a `#showmethedata` hashtag. We desperately needed it this year and we’re absolutely going to need it in 2017 and beyond.

The catalyst for this is a recent [New Yorker story](http://nymag.com/daily/intelligencer/2016/11/activists-urge-hillary-clinton-to-challenge-election-results.html) about “computer scientists” playing off of heated election emotions and making claims in a non-public, partisan meeting with no release of data to the public or to independent, non-partisan groups with credible, ethical data analysis teams.

I believe agents of parties in power and agents of the parties who want to be in power are going to be misusing, abusing and fabricating data at a velocity and volume we’ve not seen before. If you care about the truth (the real truth, not a “necessary truth” based on an agenda) and are a capable data-worker it’s nigh your civic duty to keep in check those that are want to deceive.

**UPDATE** 2016-11-23 10:30:00 EST

Haderman has an (ugh) [Medium post](https://medium.com/@jhalderm/want-to-know-if-the-election-was-hacked-look-at-the-ballots-c61a6113b0ba#.vvcd9bguw) (ugh for using Medium vs the post content) and, as usual, the media causes more controversy than necessary. He has the best intentions and future confidentiality, integrity and availability of our electoral infrastructure at heart.

The [first public informational video](https://www.greatagain.gov/news/message-president-elect-donald-j-trump.html) from the PEOTUS didn’t add a full transcript of the video to the web site and did not provide (at least as of 0700 EST on 2016-11-22) their own text annotations/captions to the video.

Google’s (YouTube’s) auto-captioning (for the most part) worked and it’s most likely “just good enough” to enable the PETOUS’s team to avoid an all-out A.D.A. violation (which is probably the least of their upcoming legal worries). This is a forgotten small detail in an ever-growing list of forgotten small and large details. I’m also surprised that no progressive web site bothered to put up a transcription for those that need it.

Since “the press” did a terrible job holding the outgoing POTUS accountable during his two terms and also woefully botched their coverage of the 2016 election (NOTE: I didn’t vote for either major party candidate, but did write-in folks for POTUS & veep), I guess it’s up to all of us to help document history.

Here’s a moderately cleaned up version from the auto-generated Google SRT stream, presented without any commentary:

A Message from President-Elect Donald J. Trump

Today I would like to provide the American people with an update on the White House transition and our policy plans for the first 100 days.

Our transition team is working very smoothly, efficiently and effectively. Truly great and talented men and women – patriots — indeed are being brought in and many will soon be a part of our government helping us to make America great again.

My agenda will be based on a simple core principle: putting America first whether it’s producing steel, building cars or curing disease. I want the next generation of production and innovation to happen right here in our great homeland America creating wealth and jobs for American workers.

As part of this plan I’ve asked my transition team to develop a list of executive actions we can take on day one to restore our laws and bring back our jobs (about time) these include the following on trade: I’m going to issue our notification of intent to withdraw from the trans-pacific partnership — a potential disaster for our country. Instead we will negotiate fair bilateral trade deals that bring jobs and industry back onto American shores.

On energy: I will cancel job-killing restrictions on the production of American energy including shale energy and clean coal creating many millions of high-paying jobs. That’s what we want, that’s what we’ve been waiting for.

On regulation I will formulate a role which says that for every one new regulation two old regulations must be eliminated (so important).

For national security I will ask the Department of Defense and the chairman of the Joint Chiefs of Staff to develop a comprehensive plan to protect America’s vital infrastructure from cyberattacks and all other form of attacks.

On immigration: I will direct the Department of Labor to investigate all abuses of visa programs that undercut the American worker.

On ethics reform: as part of our plan to “drain the swamp” we will impose a five-year ban executive officials becoming lobbyists after they leave the administration and a lifetime ban on executive officials lobbying on behalf of a foreign government.

These are just a few of the steps we will take to reform Washington and rebuild our middle class I will provide more updates in the coming days as we work together to make America great again for everyone and I mean everyone.

For those technically inclined, you can grab that feed using the following R code. You’ll first need to ensure Developer Tools is open in your browser and tick the “CC” button in the video before starting the video. Look for a network request that begins with `https://www.youtube.com/api/timedtext` and use the “Copy as cURL” feature (all three major, cross-platform browsers support it), then run the code immediately after it (the `straighten()` function will take the data from the clipboard).

library(curlconverter)
library(httr)
library(xml2)

straighten() %>% 
  make_req() -> req

res <- req[[1]]()

content(res) %>% 
  xml_find_all(".//body/p") %>% 
  xml_text() %>% 
  paste0(collapse="") %>% 
  writeLines("speech.txt")

In this emergent “post-truth” world, we’re all going to need to be citizen data journalists and it’s going to be even more important than ever to ensure any data-driven pieces we write are honest, well-sourced, open to review & civil commentary, plus be fully reproducible. We all will also need to hold those in power — including the media — accountable. Despite whatever leanings you have, cherry-picking data, slicing a bit off the edges of the truth and injecting a bias because of your passionate views that you believe are right are only going to exacerbate the existing problems.

First it was OpenDNS selling their souls (and, [y]our data) to Cisco (whom I don’t trust at all with my data).

Now, it’s Dyn — — doing something even worse (purely my own opinion).

I’m currently evaluating offerings by [FoolDNS](http://www.fooldns.com/fooldns-community/english-version/) & [GreenTeam](http://members.greentm.co.uk/) as alternatives and I’ll post updates as I review & test them.

I’m also in search of an open source, RPi-able DNS server with regularly updated Squid-like categorical lists and the ability to white list domains (suggestions welcome in the comments).

I’m a cybersecurity data scientist who knows just what can be done with this type of data when handed to `$BIGCORP`, and I’m far more concerned with Oracle than Cisco, but I’d rather work with a smaller company who has more reason to not sell me out.

[Bulbs](https://www.youtube.com/watch?v=ROEIKn8OsGU).

If those were real, functional bulbs that were destroyed…spreading real, irreclaimable refuse…all to shill a far less than revolutionary “professional” laptop…then, just how “enlightened” is Apple, really?

But, I guess it’s fine for the intelligentsia class to violate their own prescribed norms if it furthers their own causes.

>_Stop making people suicidal. Stop telling people they’re going to be killed. Stop terrifying children. Stop giving racism free advertising. Stop trying to convince Americans that all the other Americans hate them. Stop. Stop. Stop._

Just. [Stop](http://slatestarcodex.com/2016/11/16/you-are-still-crying-wolf/).

2016-08-13 UPDATE: Fortune has a story on this and it does seem to be tax-related vs ideology. @thosjleeper suggested something similar as well about a week ago.

If you’re even remotely following the super insane U.S. 2016 POTUS circus election you’ve no doubt seen a resurgence of _”if X gets elected, I’m moving to Y”_ claims by folks who are “anti” one candidate or another. The [Washington Examiner](http://www.washingtonexaminer.com/americans-renouncing-citizenship-near-record-highs/article/2598074) did a story on last quarter’s U.S. expatriation numbers. I didn’t realize we had a department in charge of tracking and posting that data, but we do thanks to inane bureaucratic compliance laws.

I should have put _”posting that data”_ in quotes as it’s collected quarterly and posted ~2 months later in non-uniform HTML and PDF form across individual posts in a unique/custom Federal Register publishing system. How’s that hope and change in “open government data” working out for y’all?

The data is organized enough that we can take a look at the history of expatriation with some help from R. Along the way we’ll:

– see how to make parameterized web requests a bit cleaner with `httr`
– get even _more_ practice using the `purrr` package
– perhaps learn a new “trick” when using the `stringi` package
– show how we can “make do” living in a non-XPath 2 world (it’s actually pretty much at XPath 3 now, too #sigh)

A manual hunt on that system will eventually reveal a search URL that you can use in a `read.csv()` (to grab a list of URLs with the data, not the data itself #ugh). Those URLs are _gnarly_ (you’ll see what I mean if you do the hunt) but we can take advantage of the standardized URL query parameter that are used in the egregiously long URLs in a far more readable fashion if we use `httr::GET()` directly, especially since `httr::content()` will auto-convert the resultant CSV to a `tibble` for us since the site sets the response MIME type appropriately.

Unfortunately, when using the `6039G` search parameter (the expatriate tracking form ID) we do need to filter out non-quarterly report documents since the bureaucrats must have their ancillary TPS reports.

library(dplyr)
library(httr)
library(rvest)
library(purrr)
library(lubridate)
library(ggplot2) # devtools::install_github("hadley/ggplot2")
library(hrbrmisc) # devtools::install_github("hrbrmstr/hrbrmisc")
library(ggalt)
library(grid)
library(scales)
library(magrittr)
library(stringi)

GET("https://www.federalregister.gov/articles/search.csv",
    query=list(`conditions[agency_ids][]`=254,
               `conditions[publication_date][gte]`="01/01/2006",
               `conditions[publication_date][lte]`="7/29/2016",
               `conditions[term]`="6039G",
               `conditions[type][]`="NOTICE")) %>%
  content("parsed") %>%
  filter(grepl("^Quarterly", title)) -> register

glimpse(register)
## Observations: 44
## Variables: 9
## $ citation         <chr> "81 FR 50058", "81 FR 27198", "81 FR 65...
## $ document_number  <chr> "2016-18029", "2016-10578", "2016-02312...
## $ title            <chr> "Quarterly Publication of Individuals, ...
## $ publication_date <chr> "07/29/2016", "05/05/2016", "02/08/2016...
## $ type             <chr> "Notice", "Notice", "Notice", "Notice",...
## $ agency_names     <chr> "Treasury Department; Internal Revenue ...
## $ html_url         <chr> "https://www.federalregister.gov/articl...
## $ page_length      <int> 9, 17, 16, 20, 8, 20, 16, 12, 9, 15, 8,...
## $ qtr              <date> 2016-06-30, 2016-03-31, 2015-12-31, 20...

Now, we grab the content at each of the `html_url`s and save them off to be kind to bandwidth and/or folks with slow connections (so you don’t have to re-grab the HTML):

docs <- map(register$html_url, read_html)
saveRDS(docs, file="deserters.rds")

That generates a list of parsed HTML documents.

The reporting dates aren’t 100% consistent (i.e. not always “n” weeks from the collection date), but the data collection dates _embedded textually in the report_ are (mostly…some vary in the use of upper/lower case). So, we use the fact that these are boring legal documents that use the same language for various phrases and extract the “quarter ending” dates so we know what year/quarter the data is relevant for:

register %<>%
  mutate(qtr=map_chr(docs, ~stri_match_all_regex(html_text(.), "quarter ending ([[:alnum:], ]+)\\.",
                                                     opts_regex=stri_opts_regex(case_insensitive=TRUE))[[1]][,2]),
         qtr=mdy(qtr))

I don’t often use that particular `magrittr` pipe, but it “feels right” in this case and is handy in a pinch.

If you visit some of the URLs directly, you’ll see that there are tables and/or lists of names of the expats. However, there are woefully inconsistent naming & formatting conventions for these lists of names *and* (as I noted earlier) there’s no XPath 2 support in R. Therefore, we have to make a slightly more verbose XPath query to target the necessary table for scraping since we need to account for vastly different column name structures for the tables we are targeting.

NOTE: Older HTML pages may not have HTML tables at all and some only reference PDFs, so don’t rely on this code working beyond these particular dates (at least consistently).

We’ll also tidy up the data into a neat `tibble` for plotting.

map(docs, ~html_nodes(., xpath=".//table[contains(., 'First name') or
                                         contains(., 'FIRST NAME') or
                                         contains(., 'FNAME')]")) %>%
  map(~html_table(.)[[1]]) -> tabs

data_frame(date=register$qtr, count=map_int(tabs, nrow)) %>%
  filter(format(as.Date(date), "%Y") >= 2006) -> left

With the data wrangling work out of the way, we can tally up the throngs of folks desperate for greener pastures. First, by quarter:

gg <- ggplot(left, aes(date, count))
gg <- gg + geom_lollipop()
gg <- gg + geom_label(data=data.frame(),
                      aes(x=min(left$date), y=1500, label="# individuals"),
                      family="Arial Narrow", fontface="italic", size=3, label.size=0, hjust=0)
gg <- gg + scale_x_date(expand=c(0,14), limits=range(left$date))
gg <- gg + scale_y_continuous(expand=c(0,0), label=comma, limits=c(0,1520))
gg <- gg + labs(x=NULL, y=NULL,
                title="A Decade of Desertion",
                subtitle="Quarterly counts of U.S. individuals who have chosen to expatriate (2006-2016)",
                caption="Source: https://www.federalregister.gov/")
gg <- gg + theme_hrbrmstr_an(grid="Y")
gg

RStudio

and, then annually:

left %>%
  mutate(year=format(date, "%Y")) %>%
  count(year, wt=count) %>%
  ggplot(aes(year, n)) -> gg

gg <- gg + geom_bar(stat="identity", width=0.6)
gg <- gg + geom_label(data=data.frame(), aes(x=0, y=5000, label="# individuals"),
                      family="Arial Narrow", fontface="italic", size=3, label.size=0, hjust=0)
gg <- gg + scale_y_continuous(expand=c(0,0), label=comma, limits=c(0,5100))
gg <- gg + labs(x=NULL, y=NULL,
                title="A Decade of Desertion",
                subtitle="Annual counts of U.S. individuals who have chosen to expatriate (2006-2016)",
                caption="Source: https://www.federalregister.gov/")
gg <- gg + theme_hrbrmstr_an(grid="Y")
gg

RStudio

The exodus isn’t _massive_ but it’s actually more than I expected. It’d be interesting to track various US tax code laws, enactment of other compliance regulations and general news events to see if there are underlying reasons for the overall annual increases but also the dips in some quarters (which could just be data collection hiccups by the Feds…after all, this is government work). If you want to do all the math for correcting survey errors, it’d also be interesting to normalize this by population and track all the data back to 1996 (when HIPPA mandated the creation & publication of this quarterly list) and then see if you can predict where we’ll be at the end of this year (though I suspect political events are a motivator for at least a decent fraction of some of the quarters).

I had tried to convert my data-saving workflows to [`feather`](https://github.com/wesm/feather/tree/master/R) but there have been [issues](https://github.com/wesm/feather/issues/155) with it supporting large files (that seem to be near resolution), so I’ve been continuing to use R Data files for local saving of processed/cleaned data.

I make _many_ of these files and sometimes I do it as a one-off effort, thinking that I’ll come back to it quickly. Inevitably, I don’t do that and also end up naming those one-offs badly. I made a small [R helper package](https://github.com/hrbrmstr/rdatainfo) to make it easier to wrap up checking out these files at the command-line (via a `bash` function) but it hit me that it’d be even easier if there was a way to use the macOS Quick Look feature (hitting `` on a file icon) to see the previews.

Thus, [`QuickLookR`](https://github.com/hrbrmstr/QuickLookR) was born.

You need to [download the ZIP file](https://github.com/hrbrmstr/QuickLookR/releases/tag/v0.1.0), unzip it and save the `QuickLookR.qlgenerator` component into `~/Library/QuickLook`. Then `devtools::install_github(‘hrbrmstr/rdatainfo’)` in an R session. If you’ve got R/Rscript in the standard `/usr/local/bin` location, then you should be able to hit `` on any `.rdata`, `.rda` or `.rds` file and see a `str()` preview like this:

Blank_Skitch_Document

I haven’t cracked open Xcode in a while and my Objective-C is super-rusty, but this works on my El Capitan MacBook Pro (though I’m trying to see why some `.rds` files embedded in packages on my system have no previews).

If you have suggestions or issues, please use [github](https://github.com/hrbrmstr/QuickLookR/issues) to file them. For issues, it’d be really helpful if you included a copy of or link to files that don’t work well.

For the next revision, I plan on generating prettier HTML-based previews and linking against `R.framework` to avoid a call out to the system.

If Wes/Hadley really have fixed `feather`, I’ll be making a QuickLook plugin for that file format as well in the very near future.