New Pacakge “docxtractr” – Easily Extract Tables From Microsoft Word Docs

UPDATE: docxtractr is now on CRAN


This is more of a follow-up from yesterday’s post. The hack and function in said post was fine, but it was limited to uniform tables and made you do more work than you had to. So, there’s now a devtools-installable package on github that makes it way easier to get information about the tables in a Word document and extract them—uniform or not.

There are plenty of examples in the GitHub README and also in the package examples. But, I will show the basic functionality here.

The package ships with four example Word documents, but we’ll work with the last one: complex.doc. It has five tables and the last two have varying columns and rows and look like:

complex

Let’s read those two in:

complx <- read_docx(system.file("examples/complex.docx", package="docxtractr"))
 
docx_tbl_count(complx)
#> [1] 5
 
docx_describe_tbls(complx)
#> Word document [/Library/Frameworks/R.framework/Versions/3.2/Resources/library/docxtractr/examples/complex.docx]
#> 
#> Table 1
#>   total cells: 16
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [This, Is, A, Column]
#> 
#> Table 2
#>   total cells: 12
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [Foo, Bar, Baz]
#> 
#> Table 3
#>   total cells: 14
#>   row count  : 7
#>   uniform    : likely!
#>   has header : likely! => possibly [Foo, Bar]
#> 
#> Table 4
#>   total cells: 11
#>   row count  : 4
#>   uniform    : unlikely => found differing cell counts (3, 2) across some rows 
#>   has header : likely! => possibly [Foo, Bar, Baz]
#> 
#> Table 5
#>   total cells: 21
#>   row count  : 7
#>   uniform    : likely!
#>   has header : unlikely
 
 
docx_extract_tbl(complx, 4, header=TRUE)
#> Source: local data frame [3 x 3]
#> 
#>   Foo  Bar Baz
#> 1  Aa BbCc  NA
#> 2  Dd   Ee  Ff
#> 3  Gg   Hh  ii
 
docx_extract_tbl(complx, 5, header=TRUE)
#> Source: local data frame [6 x 3]
#> 
#>    Foo Bar Baz
#> 1   Aa  Bb  Cc
#> 2   Dd  Ee  Ff
#> 3   Gg  Hh  Ii
#> 4 Jj88  Kk  Ll
#> 5       Uu  Ii
#> 6   Hh  Ii   h

It reads in “uniform” tables properly and will warn you if there is a header marked in Word but not asked for in the extraction.

Next steps are to both allow specifying column types and try to guess column types (readr has some nice functions for this) and perhaps return more metadata (if possible).

Feature requests & bug reports are most welcome on GitHub.

Data-Driven Security Podcast
Episode 29: With Great Power Law…
R World News Podcast — Episode 2
Buy on AmazonDDS Blog
DDS PodcastAmazon Author Page

3 Comments New Pacakge “docxtractr” – Easily Extract Tables From Microsoft Word Docs

  1. Pingback: Using R To Get Data *Out Of* Word Docs | rud.is

  2. Pingback: Distilled News | Data Analytics & R

Leave a Reply