UPDATE: `docxtractr` is now [on CRAN](https://cran.rstudio.com/web/packages/docxtractr/index.html)
———————
This is more of a follow-up from [yesterday’s post](http://rud.is/b/2015/08/23/using-r-to-get-data-out-of-word-docs/). The hack and function in said post was fine, but it was limited to uniform tables and made you do more work than you had to. So, there’s now a `devtools`-installable package [on github](https://github.com/hrbrmstr/docxtractr) that makes it way easier to get information about the tables in a Word document and extract them—uniform or not.
There are plenty of examples in the GitHub README and also in the package examples. But, I will show the basic functionality here.
The package ships with four example Word documents, but we’ll work with the last one: `complex.doc`. It has five tables and the last two have varying columns and rows and look like:
Let’s read those two in:
complx <- read_docx(system.file("examples/complex.docx", package="docxtractr"))
docx_tbl_count(complx)
#> [1] 5
docx_describe_tbls(complx)
#> Word document [/Library/Frameworks/R.framework/Versions/3.2/Resources/library/docxtractr/examples/complex.docx]
#>
#> Table 1
#> total cells: 16
#> row count : 4
#> uniform : likely!
#> has header : likely! => possibly [This, Is, A, Column]
#>
#> Table 2
#> total cells: 12
#> row count : 4
#> uniform : likely!
#> has header : likely! => possibly [Foo, Bar, Baz]
#>
#> Table 3
#> total cells: 14
#> row count : 7
#> uniform : likely!
#> has header : likely! => possibly [Foo, Bar]
#>
#> Table 4
#> total cells: 11
#> row count : 4
#> uniform : unlikely => found differing cell counts (3, 2) across some rows
#> has header : likely! => possibly [Foo, Bar, Baz]
#>
#> Table 5
#> total cells: 21
#> row count : 7
#> uniform : likely!
#> has header : unlikely
docx_extract_tbl(complx, 4, header=TRUE)
#> Source: local data frame [3 x 3]
#>
#> Foo Bar Baz
#> 1 Aa BbCc NA
#> 2 Dd Ee Ff
#> 3 Gg Hh ii
docx_extract_tbl(complx, 5, header=TRUE)
#> Source: local data frame [6 x 3]
#>
#> Foo Bar Baz
#> 1 Aa Bb Cc
#> 2 Dd Ee Ff
#> 3 Gg Hh Ii
#> 4 Jj88 Kk Ll
#> 5 Uu Ii
#> 6 Hh Ii h
It reads in “uniform” tables properly and will warn you if there is a header marked in Word but not asked for in the extraction.
Next steps are to both allow specifying column types and try to guess column types (`readr` has some nice functions for this) and perhaps return more metadata (if possible).
Feature requests & bug reports are most welcome [on GitHub](https://github.com/hrbrmstr/docxtractr/issues).
One Comment
Congrats on the new package Bob!
2 Trackbacks/Pingbacks
[…] after reading this post head on over to this new one as it has wrapped this functionality (and more!) into a […]
[…] New Pacakge “docxtractr” – Easily Extract Tables From Microsoft Word Docs […]