New Pacakge “docxtractr” – Easily Extract Tables From Microsoft Word Docs

UPDATE: `docxtractr` is now [on CRAN](https://cran.rstudio.com/web/packages/docxtractr/index.html)

———————

This is more of a follow-up from [yesterday’s post](http://rud.is/b/2015/08/23/using-r-to-get-data-out-of-word-docs/). The hack and function in said post was fine, but it was limited to uniform tables and made you do more work than you had to. So, there’s now a `devtools`-installable package [on github](https://github.com/hrbrmstr/docxtractr) that makes it way easier to get information about the tables in a Word document and extract them—uniform or not.

There are plenty of examples in the GitHub README and also in the package examples. But, I will show the basic functionality here.

The package ships with four example Word documents, but we’ll work with the last one: `complex.doc`. It has five tables and the last two have varying columns and rows and look like:

complex

Let’s read those two in:

complx <- read_docx(system.file("examples/complex.docx", package="docxtractr"))

docx_tbl_count(complx)
#> [1] 5

docx_describe_tbls(complx)
#> Word document [/Library/Frameworks/R.framework/Versions/3.2/Resources/library/docxtractr/examples/complex.docx]
#> 
#> Table 1
#>   total cells: 16
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [This, Is, A, Column]
#> 
#> Table 2
#>   total cells: 12
#>   row count  : 4
#>   uniform    : likely!
#>   has header : likely! => possibly [Foo, Bar, Baz]
#> 
#> Table 3
#>   total cells: 14
#>   row count  : 7
#>   uniform    : likely!
#>   has header : likely! => possibly [Foo, Bar]
#> 
#> Table 4
#>   total cells: 11
#>   row count  : 4
#>   uniform    : unlikely => found differing cell counts (3, 2) across some rows 
#>   has header : likely! => possibly [Foo, Bar, Baz]
#> 
#> Table 5
#>   total cells: 21
#>   row count  : 7
#>   uniform    : likely!
#>   has header : unlikely


docx_extract_tbl(complx, 4, header=TRUE)
#> Source: local data frame [3 x 3]
#> 
#>   Foo  Bar Baz
#> 1  Aa BbCc  NA
#> 2  Dd   Ee  Ff
#> 3  Gg   Hh  ii

docx_extract_tbl(complx, 5, header=TRUE)
#> Source: local data frame [6 x 3]
#> 
#>    Foo Bar Baz
#> 1   Aa  Bb  Cc
#> 2   Dd  Ee  Ff
#> 3   Gg  Hh  Ii
#> 4 Jj88  Kk  Ll
#> 5       Uu  Ii
#> 6   Hh  Ii   h

It reads in “uniform” tables properly and will warn you if there is a header marked in Word but not asked for in the extraction.

Next steps are to both allow specifying column types and try to guess column types (`readr` has some nice functions for this) and perhaps return more metadata (if possible).

Feature requests & bug reports are most welcome [on GitHub](https://github.com/hrbrmstr/docxtractr/issues).

Cover image from Data-Driven Security
Amazon Author Page

3 Comments New Pacakge “docxtractr” – Easily Extract Tables From Microsoft Word Docs

  1. Pingback: Using R To Get Data *Out Of* Word Docs | rud.is

  2. Pingback: Distilled News | Data Analytics & R

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.