{"id":10966,"date":"2018-07-02T06:39:59","date_gmt":"2018-07-02T11:39:59","guid":{"rendered":"https:\/\/rud.is\/b\/?p=10966"},"modified":"2018-07-02T21:07:17","modified_gmt":"2018-07-03T02:07:17","slug":"freeing-pdf-data-to-account-for-the-unaccounted","status":"publish","type":"post","link":"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/","title":{"rendered":"Freeing PDF Data to Account for the Unaccounted"},"content":{"rendered":"<p>I&#8217;ve mentioned @stiles before on the blog but for those new to my blatherings, Matt is a top-notch data journalist with the @latimes and currently stationed in South Korea. I can only imagine how much busier his life has gotten since that fateful, awful November 2016 Tuesday, but I&#8217;m truly glad his eyes, pen and R console are covering the important events there.<\/p>\n<p>When I finally jumped on Twitter today, I saw this:<\/p>\n<blockquote class=\"twitter-tweet\" data-lang=\"en\">\n<p lang=\"en\" dir=\"ltr\"><a href=\"https:\/\/twitter.com\/hrbrmstr?ref_src=twsrc%5Etfw\">@hrbrmstr<\/a> Do you have an R rig to convert this large PDF to csv? I tried xpdf and <a href=\"https:\/\/twitter.com\/TabulaPDF?ref_src=twsrc%5Etfw\">@TabulaPDF<\/a> but I don&#39;t trust the results.<\/p>\n<p>&mdash; Matt Stiles (@stiles) <a href=\"https:\/\/twitter.com\/stiles\/status\/1013587088908808192?ref_src=twsrc%5Etfw\">July 2, 2018<\/a><\/p><\/blockquote>\n<p><script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/p>\n<p>and went into action and figured I should blog the results as one can never have too many &#8220;convert this PDF to usable data&#8221; examples.<\/p>\n<h3>The Problem<\/h3>\n<p>The U.S. Defense POW\/MIA Accounting Agency maintains POW\/MIA data for all our nation&#8217;s service members. Matt is working with data <a href=\"http:\/\/www.dpaa.mil\/Our-Missing\/Korean-War\/Korean-War-POW-MIA-List\/\">from Korea<\/a> (the &#8220;All US Unaccounted-For&#8221; PDF direct link is in the code below) and needed to get the PDF into a usable form and (as you can see if you read through the Twitter thread) both Tabulizer and other tools were introducing sufficient errors that the resultant extracted data was either not complete or trustworthy enough to rely on (hand-checking nearly 8,000 records is not fun).<\/p>\n<p>The PDF in question was pretty uniform, save for the first and last pages. Here&#8217;s a sample:<\/p>\n<p class=\"jetpack-slideshow-noscript robots-nocontent\">This slideshow requires JavaScript.<\/p><div id=\"gallery-10966-1-slideshow\" class=\"jetpack-slideshow-window jetpack-slideshow jetpack-slideshow-black\" data-trans=\"fade\" data-autostart=\"1\" data-gallery=\"[{&quot;src&quot;:&quot;https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2018\\\/07\\\/pg1.png?fit=510%2C659\\u0026ssl=1&quot;,&quot;id&quot;:&quot;10967&quot;,&quot;title&quot;:&quot;POW\\\/MIA Reference Pages&quot;,&quot;alt&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;itemprop&quot;:&quot;image&quot;},{&quot;src&quot;:&quot;https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2018\\\/07\\\/pg2.png?fit=510%2C659\\u0026ssl=1&quot;,&quot;id&quot;:&quot;10968&quot;,&quot;title&quot;:&quot;pg2&quot;,&quot;alt&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;itemprop&quot;:&quot;image&quot;},{&quot;src&quot;:&quot;https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2018\\\/07\\\/pg3.png?fit=510%2C659\\u0026ssl=1&quot;,&quot;id&quot;:&quot;10969&quot;,&quot;title&quot;:&quot;pg3&quot;,&quot;alt&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;itemprop&quot;:&quot;image&quot;}]\" itemscope itemtype=\"https:\/\/schema.org\/ImageGallery\"><\/div>\n<p>We just need a reproducible way to extract the data with sufficient veracity to ensure we can use it faithfully.<\/p>\n<h2>The Solution<\/h2>\n<p>We&#8217;ll need some packages and the file itself, so let&#8217;s get that bit out of the way first:<\/p>\n<pre><code class=\"language-r\">library(stringi)\nlibrary(pdftools)\nlibrary(hrbrthemes)\nlibrary(ggpomological)\nlibrary(tidyverse)\n\n# grab the PDF text\nmia_url <- \"http:\/\/www.dpaa.mil\/portals\/85\/Documents\/KoreaAccounting\/pmkor_una_all.pdf\"\nmia_fil <- \"~\/Data\/pmkor_una_all.pdf\"\nif (!file.exists(mia_fil)) download.file(mia_url, mia_fil)\n\n# read it in\ndoc <- pdf_text(mia_fil) \n<\/code><\/pre>\n<p>Let's look at those three example pages:<\/p>\n<pre><code class=\"language-r\">cat(doc[[1]])\n##                                   Defense POW\/MIA Accounting Agency\n##                                       Personnel Missing - Korea (PMKOR)\n##                                        (Reported for ALL Unaccounted For)\n##                                                                                                Total Unaccounted: 7,699\n## Name                       Rank\/Rate     Branch                           Year State\/Territory\n## ABBOTT, RICHARD FRANK      M\/Sgt         UNITED STATES ARMY               1950 VERMONT\n## ABEL, DONALD RAYMOND       Pvt           UNITED STATES ARMY               1950 PENNSYLVANIA\n## ...\n## AKERS, HERBERT DALE        Cpl           UNITED STATES ARMY               1950 INDIANA\n## AKERS, JAMES FRANCIS       Cpl           UNITED STATES MARINE CORPS       1950 VIRGINIA\n\ncat(doc[[2]])\n## Name                          Rank\/Rate Branch                     Year State\/Territory\n## AKERS, RICHARD ALLEN          1st Lt    UNITED STATES ARMY         1951 PENNSYLVANIA\n## AKI, CLARENCE HALONA          Sgt       UNITED STATES ARMY         1950 HAWAII\n...\n## AMIDON, DONALD PRENTICE       PFC       UNITED STATES MARINE CORPS 1950 TEXAS\n## AMOS, CHARLES GEARL           Cpl       UNITED STATES ARMY         1951 NORTH CAROLINA\n\ncat(doc[[length(doc)]])\n## Name                                                Rank\/Rate           Branch                                              Year         State\/Territory\n## ZAVALA, FREDDIE                                     Cpl                 UNITED STATES ARMY                                  1951         CALIFORNIA\n## ZAWACKI, FRANK JOHN                                 Sgt                 UNITED STATES ARMY                                  1950         OHIO\n## ...\n## ZUVER, ROBERT LEONARD                               Pfc                 UNITED STATES ARMY                                  1950         CALIFORNIA\n## ZWILLING, LOUIS JOSEPH                              Cpl                 UNITED STATES ARMY                                  1951         ILLINOIS\n##                                       This list of Korean War missing personnel was prepared by the Defense POW\/MIA Accounting Agency (DPAA).\n##                Please visit our web site at http:\/\/www.dpaa.mil\/Our-Missing\/Korean-War-POW-MIA-List\/ for updates to this list and other official missing personnel data lists.\n## Report Prepared: 06\/19\/2018 11:25\n<\/code><\/pre>\n<p>The <code>poppler<\/code> library's \"layout\" mode (which <code>pdftools<\/code> uses brilliantly) combined with the author of the PDF not being evil will help us make short work of this since:<\/p>\n<ul>\n<li>there's a uniform header on each page<\/li>\n<li>the \"layout\" mode returned uniform per-page, fixed-width columns<\/li>\n<li>there's no \"special column tweaks\" that some folks use to make PDFs more readable by humans<\/li>\n<\/ul>\n<p>There are plenty of comments in the code, so I'll refrain from too much blathering about it, but the general plan is to go through each of the 119 pages and:<\/p>\n<ul>\n<li>convert the text to lines<\/li>\n<li>find the header line<\/li>\n<li>find the column start\/end positions from the header on the page (since they are different for each page)<\/li>\n<li>reading it in with <code>readr::read_fwf()<\/code><\/li>\n<li>remove headers, preamble and epilogue cruft <\/li>\n<li>turn it all into one data frame<\/li>\n<\/ul>\n<pre><code class=\"language-r\"># we're going to process each page and read_fwf will complain violently\n# when it hits header\/footer rows vs data rows and we rly don't need to\n# see all those warnings\nread_fwf_q <- quietly(read_fwf)\n\n# go through each page\nmap_df(doc, ~{\n  \n  stri_split_lines(.x) %>% \n    flatten_chr() -> lines # want the lines from each page\n  \n  # find the header on the page and get the starting locations for each column\n  keep(lines, stri_detect_regex, \"^Name\") %>% \n    stri_locate_all_fixed(c(\"Name\", \"Rank\", \"Branch\", \"Year\", \"State\")) %>% \n    map(`[`, 1) %>% \n    flatten_int() -> starts\n  \n  # now get the ending locations; cheating and using `NA` for the last column  \n  ends <- c(starts[-1] - 1, NA)\n\n  # since each page has a lovely header and poppler's \"layout\" mode creates \n  # a surprisingly usable fixed-width table, the core idiom is to find the start\/end\n  # of each column using the header as a canary\n  cols <- fwf_positions(starts, ends, col_names = c(\"name\", \"rank\", \"branch\", \"year\", \"state\"))\n\n  paste0(lines, collapse=\"\\n\") %>%        # turn it into something read_fwf() can read \n    read_fwf_q(col_positions = cols) %>%  # read it!\n    .$result %>%                          # need to do this b\/c of `quietly()`\n    filter(!is.na(name)) %>%              # non-data lines\n    filter(name != \"Name\") %>%            # remove headers from each page\n    filter(!stri_detect_regex(name, \"^(^This|Please|Report)\")) # non-data lines (the last pg footer, rly)\n  \n}) -> xdf\n\nxdf\n## # A tibble: 7,699 x 5\n##    name                       rank   branch                  year  state        \n##    <chr>                      <chr>  <chr>                   <chr> <chr>        \n##  1 ABBOTT, RICHARD FRANK      M\/Sgt  UNITED STATES ARMY      1950  VERMONT      \n##  2 ABEL, DONALD RAYMOND       Pvt    UNITED STATES ARMY      1950  PENNSYLVANIA \n##  3 ABELE, FRANCIS HOWARD      Sfc    UNITED STATES ARMY      1950  CONNECTICUT  \n##  4 ABELES, GEORGE ELLIS       Pvt    UNITED STATES ARMY      1950  CALIFORNIA   \n##  5 ABERCROMBIE, AARON RICHARD 1st Lt UNITED STATES AIR FORCE 1950  ALABAMA      \n##  6 ABREU, MANUEL Jr.          Pfc    UNITED STATES ARMY      1950  MASSACHUSETTS\n##  7 ACEVEDO, ISAAC             Sgt    UNITED STATES ARMY      1952  PUERTO RICO  \n##  8 ACINELLI, BILL JOSEPH      Pfc    UNITED STATES ARMY      1951  MISSOURI     \n##  9 ACKLEY, EDWIN FRANCIS      Pfc    UNITED STATES ARMY      1950  NEW YORK     \n## 10 ACKLEY, PHILIP WARREN      Pfc    UNITED STATES ARMY      1950  NEW HAMPSHIRE\n## # ... with 7,689 more rows\n<\/code><\/pre>\n<p>Now the data is both usable and sobering:<\/p>\n<pre><code class=\"language-r\">title <- \"Defense POW\/MIA Accounting Agency Personnel Missing - Korea\"\nsubtitle <- \"Reported for ALL Unaccounted For\"\ncaption <-  \"Source: http:\/\/www.dpaa.mil\/portals\/85\/Documents\/KoreaAccounting\/pmkor_una_all.pdf\"\n\nmutate(xdf, year = factor(year)) %>% \n  mutate(branch = stri_trans_totitle(branch)) -> xdf\n\nordr <- count(xdf, branch, sort=TRUE)\n\nmutate(xdf, branch = factor(branch, levels = rev(ordr$branch))) %>% \n  ggplot(aes(year)) +\n  geom_bar(aes(fill = branch), width=0.65) +\n  scale_y_comma(name = \"# POW\/MIA\") +\n  scale_fill_pomological(name=NULL, ) +\n  labs(x = NULL, title = title, subtitle = subtitle) +\n  theme_ipsum_rc(grid=\"Y\") +\n  theme(plot.background = element_rect(fill = \"#fffeec\", color = \"#fffeec\")) +\n  theme(panel.background = element_rect(fill = \"#fffeec\", color = \"#fffeec\"))\n<\/code><\/pre>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"510\" height=\"289\" data-attachment-id=\"10979\" data-permalink=\"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/plot_zoom_png-12\/\" data-orig-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/07\/plot_zoom_png.png?fit=1636%2C928&amp;ssl=1\" data-orig-size=\"1636,928\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"plot_zoom_png\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/07\/plot_zoom_png.png?fit=510%2C289&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/07\/plot_zoom_png.png?resize=510%2C289&#038;ssl=1\" alt=\"\" class=\"aligncenter size-full wp-image-10979\" \/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve mentioned @stiles before on the blog but for those new to my blatherings, Matt is a top-notch data journalist with the @latimes and currently stationed in South Korea. I can only imagine how much busier his life has gotten since that fateful, awful November 2016 Tuesday, but I&#8217;m truly glad his eyes, pen and [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":10979,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":3,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":"","jetpack_post_was_ever_published":false},"categories":[91],"tags":[],"class_list":["post-10966","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-r"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v28.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Freeing PDF Data to Account for the Unaccounted - rud.is<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Freeing PDF Data to Account for the Unaccounted - rud.is\" \/>\n<meta property=\"og:description\" content=\"I&#8217;ve mentioned @stiles before on the blog but for those new to my blatherings, Matt is a top-notch data journalist with the @latimes and currently stationed in South Korea. I can only imagine how much busier his life has gotten since that fateful, awful November 2016 Tuesday, but I&#8217;m truly glad his eyes, pen and [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/\" \/>\n<meta property=\"og:site_name\" content=\"rud.is\" \/>\n<meta property=\"article:published_time\" content=\"2018-07-02T11:39:59+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-07-03T02:07:17+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/07\/plot_zoom_png.png?fit=1636%2C928&ssl=1\" \/>\n\t<meta property=\"og:image:width\" content=\"1636\" \/>\n\t<meta property=\"og:image:height\" content=\"928\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"hrbrmstr\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"hrbrmstr\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2018\\\/07\\\/02\\\/freeing-pdf-data-to-account-for-the-unaccounted\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2018\\\/07\\\/02\\\/freeing-pdf-data-to-account-for-the-unaccounted\\\/\"},\"author\":{\"name\":\"hrbrmstr\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"headline\":\"Freeing PDF Data to Account for the Unaccounted\",\"datePublished\":\"2018-07-02T11:39:59+00:00\",\"dateModified\":\"2018-07-03T02:07:17+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2018\\\/07\\\/02\\\/freeing-pdf-data-to-account-for-the-unaccounted\\\/\"},\"wordCount\":450,\"commentCount\":4,\"publisher\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"image\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2018\\\/07\\\/02\\\/freeing-pdf-data-to-account-for-the-unaccounted\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2018\\\/07\\\/plot_zoom_png.png?fit=1636%2C928&ssl=1\",\"articleSection\":[\"R\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/rud.is\\\/b\\\/2018\\\/07\\\/02\\\/freeing-pdf-data-to-account-for-the-unaccounted\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2018\\\/07\\\/02\\\/freeing-pdf-data-to-account-for-the-unaccounted\\\/\",\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/2018\\\/07\\\/02\\\/freeing-pdf-data-to-account-for-the-unaccounted\\\/\",\"name\":\"Freeing PDF Data to Account for the Unaccounted - rud.is\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2018\\\/07\\\/02\\\/freeing-pdf-data-to-account-for-the-unaccounted\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2018\\\/07\\\/02\\\/freeing-pdf-data-to-account-for-the-unaccounted\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2018\\\/07\\\/plot_zoom_png.png?fit=1636%2C928&ssl=1\",\"datePublished\":\"2018-07-02T11:39:59+00:00\",\"dateModified\":\"2018-07-03T02:07:17+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2018\\\/07\\\/02\\\/freeing-pdf-data-to-account-for-the-unaccounted\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/rud.is\\\/b\\\/2018\\\/07\\\/02\\\/freeing-pdf-data-to-account-for-the-unaccounted\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2018\\\/07\\\/02\\\/freeing-pdf-data-to-account-for-the-unaccounted\\\/#primaryimage\",\"url\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2018\\\/07\\\/plot_zoom_png.png?fit=1636%2C928&ssl=1\",\"contentUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2018\\\/07\\\/plot_zoom_png.png?fit=1636%2C928&ssl=1\",\"width\":\"1636\",\"height\":\"928\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2018\\\/07\\\/02\\\/freeing-pdf-data-to-account-for-the-unaccounted\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/rud.is\\\/b\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Freeing PDF Data to Account for the Unaccounted\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#website\",\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/\",\"name\":\"rud.is\",\"description\":\"&quot;In God we trust. All others must bring data&quot;\",\"publisher\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/rud.is\\\/b\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\",\"name\":\"hrbrmstr\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"url\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"contentUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"width\":460,\"height\":460,\"caption\":\"hrbrmstr\"},\"logo\":{\"@id\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\"},\"description\":\"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7\",\"sameAs\":[\"http:\\\/\\\/rud.is\"],\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/author\\\/hrbrmstr\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Freeing PDF Data to Account for the Unaccounted - rud.is","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/","og_locale":"en_US","og_type":"article","og_title":"Freeing PDF Data to Account for the Unaccounted - rud.is","og_description":"I&#8217;ve mentioned @stiles before on the blog but for those new to my blatherings, Matt is a top-notch data journalist with the @latimes and currently stationed in South Korea. I can only imagine how much busier his life has gotten since that fateful, awful November 2016 Tuesday, but I&#8217;m truly glad his eyes, pen and [&hellip;]","og_url":"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/","og_site_name":"rud.is","article_published_time":"2018-07-02T11:39:59+00:00","article_modified_time":"2018-07-03T02:07:17+00:00","og_image":[{"width":1636,"height":928,"url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/07\/plot_zoom_png.png?fit=1636%2C928&ssl=1","type":"image\/png"}],"author":"hrbrmstr","twitter_card":"summary_large_image","twitter_misc":{"Written by":"hrbrmstr","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/#article","isPartOf":{"@id":"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/"},"author":{"name":"hrbrmstr","@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"headline":"Freeing PDF Data to Account for the Unaccounted","datePublished":"2018-07-02T11:39:59+00:00","dateModified":"2018-07-03T02:07:17+00:00","mainEntityOfPage":{"@id":"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/"},"wordCount":450,"commentCount":4,"publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"image":{"@id":"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/07\/plot_zoom_png.png?fit=1636%2C928&ssl=1","articleSection":["R"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/","url":"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/","name":"Freeing PDF Data to Account for the Unaccounted - rud.is","isPartOf":{"@id":"https:\/\/rud.is\/b\/#website"},"primaryImageOfPage":{"@id":"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/#primaryimage"},"image":{"@id":"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/#primaryimage"},"thumbnailUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/07\/plot_zoom_png.png?fit=1636%2C928&ssl=1","datePublished":"2018-07-02T11:39:59+00:00","dateModified":"2018-07-03T02:07:17+00:00","breadcrumb":{"@id":"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/#primaryimage","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/07\/plot_zoom_png.png?fit=1636%2C928&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/07\/plot_zoom_png.png?fit=1636%2C928&ssl=1","width":"1636","height":"928"},{"@type":"BreadcrumbList","@id":"https:\/\/rud.is\/b\/2018\/07\/02\/freeing-pdf-data-to-account-for-the-unaccounted\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/rud.is\/b\/"},{"@type":"ListItem","position":2,"name":"Freeing PDF Data to Account for the Unaccounted"}]},{"@type":"WebSite","@id":"https:\/\/rud.is\/b\/#website","url":"https:\/\/rud.is\/b\/","name":"rud.is","description":"&quot;In God we trust. All others must bring data&quot;","publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/rud.is\/b\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886","name":"hrbrmstr","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","width":460,"height":460,"caption":"hrbrmstr"},"logo":{"@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1"},"description":"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7","sameAs":["http:\/\/rud.is"],"url":"https:\/\/rud.is\/b\/author\/hrbrmstr\/"}]}},"jetpack_featured_media_url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2018\/07\/plot_zoom_png.png?fit=1636%2C928&ssl=1","jetpack_shortlink":"https:\/\/wp.me\/p23idr-2QS","jetpack_likes_enabled":true,"jetpack-related-posts":[{"id":6115,"url":"https:\/\/rud.is\/b\/2017\/07\/25\/r%e2%81%b6-general-attys-distributions\/","url_meta":{"origin":10966,"position":0},"title":"R\u2076 \u2014 General (Attys) Distributions","author":"hrbrmstr","date":"2017-07-25","format":false,"excerpt":"Matt @stiles is a spiffy data journalist at the @latimes and he posted an interesting chart on U.S. Attorneys General longevity (given that the current US AG is on thin ice): Only Watergate and the Civil War have prompted shorter tenures as AG (if Sessions were to leave now). A\u2026","rel":"","context":"In &quot;Data Visualization&quot;","block_context":{"text":"Data Visualization","link":"https:\/\/rud.is\/b\/category\/data-visualization\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/plot_zoom_png-2.png?fit=1200%2C1076&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/plot_zoom_png-2.png?fit=1200%2C1076&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/plot_zoom_png-2.png?fit=1200%2C1076&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/plot_zoom_png-2.png?fit=1200%2C1076&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/07\/plot_zoom_png-2.png?fit=1200%2C1076&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":4916,"url":"https:\/\/rud.is\/b\/2017\/01\/18\/workout-wednesday-redux-2017-week-3\/","url_meta":{"origin":10966,"position":1},"title":"Workout Wednesday Redux (2017 Week 3)","author":"hrbrmstr","date":"2017-01-18","format":false,"excerpt":"I had started a \"52 Vis\" initiative back in 2016 to encourage folks to get practice making visualizations since that's the only way to get better at virtually anything. Life got crazy, 52 Vis fell to the wayside and now there are more visible alternatives such as Makeover Monday and\u2026","rel":"","context":"In &quot;ggplot&quot;","block_context":{"text":"ggplot","link":"https:\/\/rud.is\/b\/category\/ggplot\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/01\/state_of_us_2.png?fit=1200%2C548&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/01\/state_of_us_2.png?fit=1200%2C548&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/01\/state_of_us_2.png?fit=1200%2C548&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/01\/state_of_us_2.png?fit=1200%2C548&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/01\/state_of_us_2.png?fit=1200%2C548&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":12626,"url":"https:\/\/rud.is\/b\/2020\/01\/13\/convert-apple-card-pdf-statements-to-tidy-data-i-e-for-csv-excel-database-export\/","url_meta":{"origin":10966,"position":2},"title":"Convert Apple Card PDF Statements to Tidy Data (i.e. for CSV\/Excel\/database export)","author":"hrbrmstr","date":"2020-01-13","format":false,"excerpt":"UPDATE 2020-02-11 Apple now supports downloading transactions as CSV or OFX! (via MacObserver). I saw this CNBC article on an in-theory browser client-side-only conversion utility for taking Apple Card PDF statements and turning them into CSV files. Since I (a) never trust any browser or site and (b) the article\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":9579,"url":"https:\/\/rud.is\/b\/2018\/04\/12\/convert-epub-to-text-for-processing-in-r\/","url_meta":{"origin":10966,"position":3},"title":"Convert epub to Text for Processing in R","author":"hrbrmstr","date":"2018-04-12","format":false,"excerpt":"@RMHoge asked the following on Twitter: Hello #rstats hyve mind! Is there a package that reads epub into R? I can not find any, I now convert to text and parse the text but you sort of lose the structure of the text. Pinging @dataandme @hrbrmstr\u2014 Roel (@RMHoge) April 12,\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":4942,"url":"https:\/\/rud.is\/b\/2017\/01\/26\/one-view-of-the-impact-of-the-new-immigration-ban-freeing-pdf-data-with-tabulizer\/","url_meta":{"origin":10966,"position":4},"title":"One View of the Impact of the New Immigration Ban (+ freeing PDF data with tabulizer)","author":"hrbrmstr","date":"2017-01-26","format":false,"excerpt":"Dear Leader has made good on his campaign promise to \"crack down\" on immigration from \"dangerous\" countries. I wanted to both see one side of the impact of that decree \u2014 how many potential immigrants per year might this be impacting \u2014 and show toss up some code that shows\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/01\/RStudio.png?fit=1200%2C530&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/01\/RStudio.png?fit=1200%2C530&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/01\/RStudio.png?fit=1200%2C530&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/01\/RStudio.png?fit=1200%2C530&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/01\/RStudio.png?fit=1200%2C530&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":6215,"url":"https:\/\/rud.is\/b\/2017\/09\/04\/readability-redux\/","url_meta":{"origin":10966,"position":5},"title":"Readability Redux","author":"hrbrmstr","date":"2017-09-04","format":false,"excerpt":"I recently posted about using a Python module to convert HTML to usable text. Since then, a new package has hit CRAN dubbed htm2txt that is 100% R and uses regular expressions to strip tags from text. I gave it a spin so folks could compare some basic output, but\u2026","rel":"","context":"In &quot;data wrangling&quot;","block_context":{"text":"data wrangling","link":"https:\/\/rud.is\/b\/category\/data-wrangling\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/10966","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/comments?post=10966"}],"version-history":[{"count":11,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/10966\/revisions"}],"predecessor-version":[{"id":10983,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/10966\/revisions\/10983"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/media\/10979"}],"wp:attachment":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/media?parent=10966"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/categories?post=10966"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/tags?post=10966"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}