Skip navigation

Dear Leader has made good on his campaign promise to “crack down” on immigration from “dangerous” countries. I wanted to both see one side of the impact of that decree — how many potential immigrants per year might this be impacting — and show toss up some code that shows how to free data from PDF documents using the @rOpenSci tabulizer package — authored by (@thosjleeper) — (since knowing how to find, free and validate the veracity of U.S. gov data is kinda ++paramount now).

This is just one view and I encourage others to find, grab and blog other visa-related data and other government data in general.

So, the data is locked up in this PDF document:

As PDF documents go, it’s not horribad since the tables are fairly regular. But I’m not transcribing that and traditional PDF text extracting tools on the command-line or in R would also require writing more code than I have time for right now.

Enter: tabulizer — an R package that wraps tabula Java functions and makes them simple to use. I’m only showing one aspect of it here and you should check out the aforelinked tutorial to see all the features.

First, we need to setup our environment, download the PDF and extract the tables with tabulizer:

library(tabulizer)
library(hrbrmisc)
library(ggalt)
library(stringi)
library(tidyverse)

URL <- "https://travel.state.gov/content/dam/visas/Statistics/AnnualReports/FY2016AnnualReport/FY16AnnualReport-TableIII.pdf"
fil <- sprintf("%s", basename(URL))
if (!file.exists(fil)) download.file(URL, fil)

tabs <- tabulizer::extract_tables("FY16AnnualReport-TableIII.pdf")

You should str(tabs) in your R session. It found all our data, but put it into a list with 7 elements. You actually need to peruse this list to see where it mis-aligned columns. In the “old days”, reading this in and cleaning it up would have taken the form of splitting & replacing elements in character vectors. Now, after our inspection, we can exclude rows we don’t want, move columns around and get a nice tidy data frame with very little effort:

bind_rows(
  tbl_df(tabs[[1]][-1,]),
  tbl_df(tabs[[2]][-c(12,13),]),
  tbl_df(tabs[[3]][-c(7, 10:11), -2]),
  tbl_df(tabs[[4]][-21,]),
  tbl_df(tabs[[5]]),
  tbl_df(tabs[[6]][-c(6:7, 30:32),]),
  tbl_df(tabs[[7]][-c(11:12, 25:27),])
) %>%
  setNames(c("foreign_state", "immediate_relatives",  "special_mmigrants",
             "family_preference", "employment_preference", "diversity_immigrants","total")) %>% 
  mutate_each(funs(make_numeric), -foreign_state) %>%
  mutate(foreign_state=trimws(foreign_state)) -> total_visas_2016

I’ve cleaned up PDFs before and that code was a joy to write compared to previous efforts. No use of purrr since I was referencing the list structure in the console as I entered in the various matrix coordinates to edit out.

Finally, we can extract the target “bad” countries and see how many human beings could be impacted this year by referencing immigration stats for last year:

filter(foreign_state %in% c("Iran", "Iraq", "Libya", "Somalia", "Sudan", "Syria", "Yemen")) %>%
  gather(preference, value, -foreign_state) %>%
  mutate(preference=stri_replace_all_fixed(preference, "_", " " )) %>%
  mutate(preference=stri_trans_totitle(preference)) -> banned_visas

ggplot(banned_visas, aes(foreign_state, value)) +
  geom_col(width=0.65) +
  scale_y_continuous(expand=c(0,5), label=scales::comma) +
  facet_wrap(~preference, scales="free_y") +
  labs(x="# Visas", y=NULL, title="Immigrant Visas Issued (2016)",
       subtitle="By Foreign State of Chargeability or Place of Birth; Fiscal Year 2016; [Total n=31,804] — Note free Y scales",
       caption="Visa types explanation: https://travel.state.gov/content/visas/en/general/all-visa-categories.html\nSource: https://travel.state.gov/content/visas/en/law-and-policy/statistics/annual-reports/report-of-the-visa-office-2016.html") +
  theme_hrbrmstr_msc(grid="Y") +
  theme(axis.text=element_text(size=12))

~32,000 human beings potentially impacted, many who will remain separated from family (“family preference”); plus, the business impact of losing access to skilled labor (“employment preference”).

Go forth and find more US gov data to free (before it disappears)!

10 Comments

  1. That’s good work, thank very much!
    Apropos ‘freeing before it disappears’: Wouldn’t it make sense to make the complete but cleaned/tidied data available from “here” (the blog) in a better way than the pdf?

    • Aye. Prbly shld do that but this way lots of folks can jump in. Sadly, we’re going to need to develop vetting mechanisms for post-gov repositories to ensure the veracity of the data. Brave new world ahead.

  2. You probably should divide all the numbers by four, since the country ban is in effect for 90 days, to be replaced by a vetting system that is more extensive than the one put in place by the previous President, when he and Congress decided to single out those seven countries. Also instructive would be a comparison of the numbers from four of those countries (Iran, Iraq, Sudan and Syria), before and after December 2015 (when the Visa Waiver … and Terrorist Travel Prevention Act was signed into law) and before and after February 2016, when the State Department added Libya, Somalia and Yemen to the restricted list. Thanks for the instructive post!

    • Thanks for opposing partisan rhetoric (you’ve got that memorized pretty well, too, so it’s likely cut/paste from one of the “blog comment army” members). Both parties really need better operatives.

      Dividing by four may not be sufficient since the “shock and awe” factor of a the flawed executive order — which the naive administration ended up rolling parts of back, though not all TSA stormtroopers seem to have gotten the message and even detain folks from outside that list based on malicious whim — probably accomplished a secondary goal of curtailing desire to come here (frankly, I don’t blame those would would feel that way now, either).

      We’ll see how the next 90 days goes.

      • Hi Bob,

        Inspired by your “Resistance” post (and as a regular reader of both of y’all’s blogs), I pointed my Dad (Allan Haley, comment above) here in the spirit of engaging and encouraging folks who think critically about the same topics to be more aware of each other – I’ve never known him to cut and paste (usually quite the opposite ;), but I will always look forward to seeing where respectful dialogue leads between two smart people.

        • Definitely my bad and a reply with apology is in-comment-stream. I am not defending my reply, just noting that your dad’s detraction came after 8 seethingly bad comments with similar tone and text (but far worse language & sentiment) were deleted from the moderation queue. I should have waited longer to approve & reply.

  3. Well, I don’t know about any “blog comment army”. I’m just a lawyer who likes to stick to the text, since that defines what the courts will examine. Your analysis ignores the 90-day provision in the text, and reads in subjective factors, like the reactions of TSA employees to (admittedly imperfect) initial instructions, which were clarified as soon as the problems with them became evident. (“Storm troopers”, really? Aren’t you validating Godwin’s Law on your own blog?)

    You are certainly entitled to take subjective factors into account; don’t get me wrong. Please just don’t ascribe your prejudgments to me. I learned a lot by coming here and studying your coding and techniques; I simply disagreed (respectfully) with your assumptions, and tried to explain why.

    Again, you may disagree, as you have indicated. But I would like to feel free to comment here again without having to anticipate your reaction in advance.

    You don’t have to post this reply; it’s enough that you read it. Grace and peace to you, and may God bless all your endeavors.

    • It just really looked like you were supporting the extremely bad executive order and were the victim (by me — with apologies) of being the only articulate detracting comment I could approve & show (much of your text aligned with 8 comments that had similar statements but also highly offensive material that I could not in any good conscience allow on the blog).

      Only since you noted the reading of the text, there are (were? not sure with the limbo state since the stay) both 90 and 120 day provisions in the order. Can we at least agree on dividing by 3 (trying a recovery with a bit of levity). I’m not sure I can do the proper math to project the possible lack of desire to immigrate here and how that will impact the overall view. The following is more subjectivity as I’m ascribing unstated motive to the POTUS, but it’s likely scaring ppl away from trying to immigrate was a big part of the shock-and-awe exec order.

      At least the slow — but still there — checks and balances of our system seem to have prevailed.

      Many thanks for the kind words about the blog and for raising a great daughter (I have yet to meet her but she’s wicked smart).

  4. I can’t even imagine the vitriol you both must deal with behind the moderation scenes…thank you for fighting the good fight through the muck! It’s not that I’ve explicitly dispensed with comments on my own blog, as much as it is I just haven’t figured out how to enable them in Jekyll at all. Like you said, wicked smart ;)

  5. Hi

    The banned_visas didn’t work for me :(

    I got up to total_visas_2016without problems


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.