I’ve mentioned @stiles before on the blog but for those new to my blatherings, Matt is a top-notch data journalist with the @latimes and currently stationed in South Korea. I can only imagine how much busier his life has gotten since that fateful, awful November 2016 Tuesday, but I’m truly glad his eyes, pen and R console are covering the important events there.
When I finally jumped on Twitter today, I saw this:
@hrbrmstr Do you have an R rig to convert this large PDF to csv? I tried xpdf and @TabulaPDF but I don't trust the results.
— Matt Stiles (@stiles) July 2, 2018
and went into action and figured I should blog the results as one can never have too many “convert this PDF to usable data” examples.
The Problem
The U.S. Defense POW/MIA Accounting Agency maintains POW/MIA data for all our nation’s service members. Matt is working with data from Korea (the “All US Unaccounted-For” PDF direct link is in the code below) and needed to get the PDF into a usable form and (as you can see if you read through the Twitter thread) both Tabulizer and other tools were introducing sufficient errors that the resultant extracted data was either not complete or trustworthy enough to rely on (hand-checking nearly 8,000 records is not fun).
The PDF in question was pretty uniform, save for the first and last pages. Here’s a sample:
We just need a reproducible way to extract the data with sufficient veracity to ensure we can use it faithfully.
The Solution
We’ll need some packages and the file itself, so let’s get that bit out of the way first:
library(stringi)
library(pdftools)
library(hrbrthemes)
library(ggpomological)
library(tidyverse)
# grab the PDF text
mia_url <- "http://www.dpaa.mil/portals/85/Documents/KoreaAccounting/pmkor_una_all.pdf"
mia_fil <- "~/Data/pmkor_una_all.pdf"
if (!file.exists(mia_fil)) download.file(mia_url, mia_fil)
# read it in
doc <- pdf_text(mia_fil)
Let's look at those three example pages:
cat(doc[[1]])
## Defense POW/MIA Accounting Agency
## Personnel Missing - Korea (PMKOR)
## (Reported for ALL Unaccounted For)
## Total Unaccounted: 7,699
## Name Rank/Rate Branch Year State/Territory
## ABBOTT, RICHARD FRANK M/Sgt UNITED STATES ARMY 1950 VERMONT
## ABEL, DONALD RAYMOND Pvt UNITED STATES ARMY 1950 PENNSYLVANIA
## ...
## AKERS, HERBERT DALE Cpl UNITED STATES ARMY 1950 INDIANA
## AKERS, JAMES FRANCIS Cpl UNITED STATES MARINE CORPS 1950 VIRGINIA
cat(doc[[2]])
## Name Rank/Rate Branch Year State/Territory
## AKERS, RICHARD ALLEN 1st Lt UNITED STATES ARMY 1951 PENNSYLVANIA
## AKI, CLARENCE HALONA Sgt UNITED STATES ARMY 1950 HAWAII
...
## AMIDON, DONALD PRENTICE PFC UNITED STATES MARINE CORPS 1950 TEXAS
## AMOS, CHARLES GEARL Cpl UNITED STATES ARMY 1951 NORTH CAROLINA
cat(doc[[length(doc)]])
## Name Rank/Rate Branch Year State/Territory
## ZAVALA, FREDDIE Cpl UNITED STATES ARMY 1951 CALIFORNIA
## ZAWACKI, FRANK JOHN Sgt UNITED STATES ARMY 1950 OHIO
## ...
## ZUVER, ROBERT LEONARD Pfc UNITED STATES ARMY 1950 CALIFORNIA
## ZWILLING, LOUIS JOSEPH Cpl UNITED STATES ARMY 1951 ILLINOIS
## This list of Korean War missing personnel was prepared by the Defense POW/MIA Accounting Agency (DPAA).
## Please visit our web site at http://www.dpaa.mil/Our-Missing/Korean-War-POW-MIA-List/ for updates to this list and other official missing personnel data lists.
## Report Prepared: 06/19/2018 11:25
The poppler
library's "layout" mode (which pdftools
uses brilliantly) combined with the author of the PDF not being evil will help us make short work of this since:
- there's a uniform header on each page
- the "layout" mode returned uniform per-page, fixed-width columns
- there's no "special column tweaks" that some folks use to make PDFs more readable by humans
There are plenty of comments in the code, so I'll refrain from too much blathering about it, but the general plan is to go through each of the 119 pages and:
- convert the text to lines
- find the header line
- find the column start/end positions from the header on the page (since they are different for each page)
- reading it in with
readr::read_fwf()
- remove headers, preamble and epilogue cruft
- turn it all into one data frame
# we're going to process each page and read_fwf will complain violently
# when it hits header/footer rows vs data rows and we rly don't need to
# see all those warnings
read_fwf_q <- quietly(read_fwf)
# go through each page
map_df(doc, ~{
stri_split_lines(.x) %>%
flatten_chr() -> lines # want the lines from each page
# find the header on the page and get the starting locations for each column
keep(lines, stri_detect_regex, "^Name") %>%
stri_locate_all_fixed(c("Name", "Rank", "Branch", "Year", "State")) %>%
map(`[`, 1) %>%
flatten_int() -> starts
# now get the ending locations; cheating and using `NA` for the last column
ends <- c(starts[-1] - 1, NA)
# since each page has a lovely header and poppler's "layout" mode creates
# a surprisingly usable fixed-width table, the core idiom is to find the start/end
# of each column using the header as a canary
cols <- fwf_positions(starts, ends, col_names = c("name", "rank", "branch", "year", "state"))
paste0(lines, collapse="\n") %>% # turn it into something read_fwf() can read
read_fwf_q(col_positions = cols) %>% # read it!
.$result %>% # need to do this b/c of `quietly()`
filter(!is.na(name)) %>% # non-data lines
filter(name != "Name") %>% # remove headers from each page
filter(!stri_detect_regex(name, "^(^This|Please|Report)")) # non-data lines (the last pg footer, rly)
}) -> xdf
xdf
## # A tibble: 7,699 x 5
## name rank branch year state
##
## 1 ABBOTT, RICHARD FRANK M/Sgt UNITED STATES ARMY 1950 VERMONT
## 2 ABEL, DONALD RAYMOND Pvt UNITED STATES ARMY 1950 PENNSYLVANIA
## 3 ABELE, FRANCIS HOWARD Sfc UNITED STATES ARMY 1950 CONNECTICUT
## 4 ABELES, GEORGE ELLIS Pvt UNITED STATES ARMY 1950 CALIFORNIA
## 5 ABERCROMBIE, AARON RICHARD 1st Lt UNITED STATES AIR FORCE 1950 ALABAMA
## 6 ABREU, MANUEL Jr. Pfc UNITED STATES ARMY 1950 MASSACHUSETTS
## 7 ACEVEDO, ISAAC Sgt UNITED STATES ARMY 1952 PUERTO RICO
## 8 ACINELLI, BILL JOSEPH Pfc UNITED STATES ARMY 1951 MISSOURI
## 9 ACKLEY, EDWIN FRANCIS Pfc UNITED STATES ARMY 1950 NEW YORK
## 10 ACKLEY, PHILIP WARREN Pfc UNITED STATES ARMY 1950 NEW HAMPSHIRE
## # ... with 7,689 more rows
Now the data is both usable and sobering:
title <- "Defense POW/MIA Accounting Agency Personnel Missing - Korea"
subtitle <- "Reported for ALL Unaccounted For"
caption <- "Source: http://www.dpaa.mil/portals/85/Documents/KoreaAccounting/pmkor_una_all.pdf"
mutate(xdf, year = factor(year)) %>%
mutate(branch = stri_trans_totitle(branch)) -> xdf
ordr <- count(xdf, branch, sort=TRUE)
mutate(xdf, branch = factor(branch, levels = rev(ordr$branch))) %>%
ggplot(aes(year)) +
geom_bar(aes(fill = branch), width=0.65) +
scale_y_comma(name = "# POW/MIA") +
scale_fill_pomological(name=NULL, ) +
labs(x = NULL, title = title, subtitle = subtitle) +
theme_ipsum_rc(grid="Y") +
theme(plot.background = element_rect(fill = "#fffeec", color = "#fffeec")) +
theme(panel.background = element_rect(fill = "#fffeec", color = "#fffeec"))
4 Trackbacks/Pingbacks
[…] Click to view slideshow. […]
[…] article was first published on R – rud.is, and kindly contributed to […]
[…] Freeing PDF Data to Account for the Unaccounted […]
[…] R can do even more. It can read directly from PDFs (a lifesaver if you’ve ever, say, received data locked in tables in a […]