

{"id":3633,"date":"2015-08-23T09:14:04","date_gmt":"2015-08-23T14:14:04","guid":{"rendered":"http:\/\/rud.is\/b\/?p=3633"},"modified":"2018-03-10T07:54:36","modified_gmt":"2018-03-10T12:54:36","slug":"using-r-to-get-data-out-of-word-docs","status":"publish","type":"post","link":"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/","title":{"rendered":"Using R To Get Data *Out Of* Word Docs"},"content":{"rendered":"<p>NOTE: after reading this post head on over to <a href=\"https:\/\/rud.is\/b\/2015\/08\/24\/new-pacakge-docxtractr-easily-extract-tables-from-microsoft-word-docs\/\">this new one<\/a> as it has wrapped this functionality (and more!) into a package.<\/p>\n<p>Also: <code>docxtractr<\/code> is now <a href=\"https:\/\/cran.rstudio.com\/web\/packages\/docxtractr\/index.html\">on CRAN<\/a><\/p>\n<hr \/>\n<p>This was asked on twitter recently:<\/p>\n<blockquote class=\"twitter-tweet\" lang=\"en\">\n<p lang=\"en\" dir=\"ltr\">Is it possible to import data entered in MS Word into R &#8211; I have multiple tables in 235 files that need importing <a href=\"https:\/\/mobile.twitter.com\/hashtag\/rstats?src=hash\">#rstats<\/a><\/p>\n<p>&mdash; Richard Telford (@richardjtelford) <a href=\"https:\/\/mobile.twitter.com\/richardjtelford\/status\/635362926904373248\">August 23, 2015<\/a><\/p><\/blockquote>\n<p><script async src=\"\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/p>\n<p>The answer is a <em>very cautious<\/em> &#8220;yes&#8221;. Much depends on how well-formed and un-formatted the table is.<\/p>\n<p>Take this really simple <code>docx<\/code> file: <a href=\"https:\/\/www.dropbox.com\/s\/or2yjy4mje03t5b\/data.docx?dl=0\">data.docx<\/a>.<\/p>\n<p>It has a single table in it:<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"3634\" data-permalink=\"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/data_docx\/\" data-orig-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2015\/08\/data_docx.png?fit=512%2C163&amp;ssl=1\" data-orig-size=\"512,163\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"data_docx\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2015\/08\/data_docx.png?fit=510%2C162&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2015\/08\/data_docx.png?resize=510%2C162&#038;ssl=1\" alt=\"data_docx\" width=\"510\" height=\"162\" class=\"aligncenter size-full wp-image-3634\" \/><\/p>\n<p>Now, <code>.docx<\/code> files are just zipped directories, so rename that to <code>data.zip<\/code>, unzip it and navigate to <code>data\/word\/document.xml<\/code> and you&#8217;ll see something like this (though it&#8217;ll be more compressed):<\/p>\n<pre id=\"xml-doc\"><code class=\"language-xml\">&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; standalone=&quot;yes&quot;?&gt;\r\n&lt;w:document xmlns:wpc=&quot;http:\/\/schemas.microsoft.com\/office\/word\/2010\/wordprocessingCanvas&quot; xmlns:mo=&quot;http:\/\/schemas.microsoft.com\/office\/mac\/office\/2008\/main&quot; xmlns:mc=&quot;http:\/\/schemas.openxmlformats.org\/markup-compatibility\/2006&quot; xmlns:mv=&quot;urn:schemas-microsoft-com:mac:vml&quot; xmlns:o=&quot;urn:schemas-microsoft-com:office:office&quot; xmlns:r=&quot;http:\/\/schemas.openxmlformats.org\/officeDocument\/2006\/relationships&quot; xmlns:m=&quot;http:\/\/schemas.openxmlformats.org\/officeDocument\/2006\/math&quot; xmlns:v=&quot;urn:schemas-microsoft-com:vml&quot; xmlns:wp14=&quot;http:\/\/schemas.microsoft.com\/office\/word\/2010\/wordprocessingDrawing&quot; xmlns:wp=&quot;http:\/\/schemas.openxmlformats.org\/drawingml\/2006\/wordprocessingDrawing&quot; xmlns:w10=&quot;urn:schemas-microsoft-com:office:word&quot; xmlns:w=&quot;http:\/\/schemas.openxmlformats.org\/wordprocessingml\/2006\/main&quot; xmlns:w14=&quot;http:\/\/schemas.microsoft.com\/office\/word\/2010\/wordml&quot; xmlns:w15=&quot;http:\/\/schemas.microsoft.com\/office\/word\/2012\/wordml&quot; xmlns:wpg=&quot;http:\/\/schemas.microsoft.com\/office\/word\/2010\/wordprocessingGroup&quot; xmlns:wpi=&quot;http:\/\/schemas.microsoft.com\/office\/word\/2010\/wordprocessingInk&quot; xmlns:wne=&quot;http:\/\/schemas.microsoft.com\/office\/word\/2006\/wordml&quot; xmlns:wps=&quot;http:\/\/schemas.microsoft.com\/office\/word\/2010\/wordprocessingShape&quot; mc:Ignorable=&quot;w14 w15 wp14&quot;&gt;\r\n&lt;w:body&gt;\r\n    &lt;w:tbl&gt;\r\n        &lt;w:tblPr&gt;\r\n            &lt;w:tblStyle w:val=&quot;TableGrid&quot;\/&gt;\r\n            &lt;w:tblW w:w=&quot;0&quot; w:type=&quot;auto&quot;\/&gt;\r\n            &lt;w:tblLook w:val=&quot;04A0&quot; w:firstRow=&quot;1&quot; w:lastRow=&quot;0&quot; w:firstColumn=&quot;1&quot; w:lastColumn=&quot;0&quot; w:noHBand=&quot;0&quot; w:noVBand=&quot;1&quot;\/&gt;\r\n        &lt;\/w:tblPr&gt;\r\n        &lt;w:tblGrid&gt;\r\n            &lt;w:gridCol w:w=&quot;2337&quot;\/&gt;\r\n            &lt;w:gridCol w:w=&quot;2337&quot;\/&gt;\r\n            &lt;w:gridCol w:w=&quot;2338&quot;\/&gt;\r\n            &lt;w:gridCol w:w=&quot;2338&quot;\/&gt;\r\n        &lt;\/w:tblGrid&gt;\r\n        &lt;w:tr w:rsidR=&quot;00244D8A&quot; w14:paraId=&quot;6808A6FE&quot; w14:textId=&quot;77777777&quot; w:rsidTr=&quot;00244D8A&quot;&gt;\r\n            &lt;w:tc&gt;\r\n                &lt;w:tcPr&gt;\r\n                    &lt;w:tcW w:w=&quot;2337&quot; w:type=&quot;dxa&quot;\/&gt;\r\n                &lt;\/w:tcPr&gt;\r\n                &lt;w:p w14:paraId=&quot;7D006905&quot; w14:textId=&quot;77777777&quot; w:rsidR=&quot;00244D8A&quot; w:rsidRDefault=&quot;00244D8A&quot;&gt;\r\n                    &lt;w:r&gt;\r\n                        &lt;w:t&gt;This&lt;\/w:t&gt;\r\n                    &lt;\/w:r&gt;\r\n                &lt;\/w:p&gt;\r\n            &lt;\/w:tc&gt;\r\n            &lt;w:tc&gt;\r\n                &lt;w:tcPr&gt;\r\n                    &lt;w:tcW w:w=&quot;2337&quot; w:type=&quot;dxa&quot;\/&gt;\r\n                &lt;\/w:tcPr&gt;\r\n                &lt;w:p w14:paraId=&quot;13C9E52C&quot; w14:textId=&quot;77777777&quot; w:rsidR=&quot;00244D8A&quot; w:rsidRDefault=&quot;00244D8A&quot;&gt;\r\n                    &lt;w:r&gt;\r\n                        &lt;w:t&gt;Is&lt;\/w:t&gt;\r\n                    &lt;\/w:r&gt;\r\n                &lt;\/w:p&gt;\r\n            &lt;\/w:tc&gt;\r\n...<\/code><\/pre>\n<p>We can easily make out a table structure with rows and columns. In the simplest cases (which is all I&#8217;ll cover in this post) where the rows and columns are uniform it&#8217;s pretty easy to grab the data:<\/p>\n<pre id=\"core-r\"><code class=\"language-r\">library(xml2)\r\n\r\n# read in the XML file\r\ndoc &lt;- read_xml(&quot;data\/word\/document.xml&quot;)\r\n\r\n# there is an egregious use of namespaces in these files\r\nns &lt;- xml_ns(doc)\r\n\r\n# extract all the table cells (this is assuming one table in the document)\r\ncells &lt;- xml_find_all(doc, &quot;.\/\/w:tbl\/w:tr\/w:tc&quot;, ns=ns)\r\n\r\n# convert the cells to a matrix then to a data.frame)\r\ndat &lt;- data.frame(matrix(xml_text(cells), ncol=4, byrow=TRUE), \r\n                  stringsAsFactors=FALSE)\r\n\r\n# if there are column headers, make them the column name and remove that line\r\ncolnames(dat) &lt;- dat[1,]\r\ndat &lt;- dat[-1,]\r\nrownames(dat) &lt;- NULL\r\n\r\ndat\r\n\r\n##   This      Is     A   Column\r\n## 1    1     Cat   3.4      Dog\r\n## 2    3    Fish 100.3     Bird\r\n## 3    5 Pelican   -99 Kangaroo<\/code><\/pre>\n<p>You&#8217;ll need to clean up the column types, but you have at least freed the data from the evil file format it was in.<\/p>\n<p>If there is more than one table you can use XML node targeting to process each one separately or into a list. I&#8217;ve wrapped that functionality into a rudimentary function that will:<\/p>\n<ul>\n<li>auto-copy a Word doc to a temporary location<\/li>\n<li>rename it to a zip<\/li>\n<li>unzip it to a temporary location<\/li>\n<li>read in the <code>document.xml<\/code><\/li>\n<li>auto-determine the number of tables in the document<\/li>\n<li>auto-calculate # rows &amp; # columns per table<\/li>\n<li>convert each table<\/li>\n<li>return all the tables into a list<\/li>\n<li>clean up the temporarily created items<\/li>\n<\/ul>\n<pre id=\"final-r\"><code class=\"language-r\">library(xml2)\r\n\r\nget_tbls &lt;- function(word_doc) {\r\n  \r\n  tmpd &lt;- tempdir()\r\n  tmpf &lt;- tempfile(tmpdir=tmpd, fileext=&quot;.zip&quot;)\r\n  \r\n  file.copy(word_doc, tmpf)\r\n  unzip(tmpf, exdir=sprintf(&quot;%s\/docdata&quot;, tmpd))\r\n  \r\n  doc &lt;- read_xml(sprintf(&quot;%s\/docdata\/word\/document.xml&quot;, tmpd))\r\n  \r\n  unlink(tmpf)\r\n  unlink(sprintf(&quot;%s\/docdata&quot;, tmpd), recursive=TRUE)\r\n\r\n  ns &lt;- xml_ns(doc)\r\n  \r\n  tbls &lt;- xml_find_all(doc, &quot;.\/\/w:tbl&quot;, ns=ns)\r\n  \r\n  lapply(tbls, function(tbl) {\r\n    \r\n    cells &lt;- xml_find_all(tbl, &quot;.\/w:tr\/w:tc&quot;, ns=ns)\r\n    rows &lt;- xml_find_all(tbl, &quot;.\/w:tr&quot;, ns=ns)\r\n    dat &lt;- data.frame(matrix(xml_text(cells), \r\n                             ncol=(length(cells)\/length(rows)), \r\n                             byrow=TRUE), \r\n                      stringsAsFactors=FALSE)\r\n    colnames(dat) &lt;- dat[1,]\r\n    dat &lt;- dat[-1,]\r\n    rownames(dat) &lt;- NULL\r\n    dat\r\n    \r\n  })\r\n  \r\n}<\/code><\/pre>\n<p>Using this multi-table Word doc &#8211; <a href=\"https:\/\/www.dropbox.com\/s\/h4onduw8wbrb24e\/data3.docx?dl=0\">doc3<\/a>:<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"3639\" data-permalink=\"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/data3\/\" data-orig-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2015\/08\/data3.png?fit=503%2C341&amp;ssl=1\" data-orig-size=\"503,341\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"data3\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2015\/08\/data3.png?fit=503%2C341&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2015\/08\/data3.png?resize=503%2C341&#038;ssl=1\" alt=\"data3\" width=\"503\" height=\"341\" class=\"aligncenter size-full wp-image-3639\" \/><\/p>\n<p>we can extract the three tables thusly:<\/p>\n<pre id=\"final-final-r\"><code class=\"language-r\">get_tbls(&quot;~\/Dropbox\/data3.docx&quot;)\r\n\r\n## [[1]]\r\n##   This      Is     A   Column\r\n## 1    1     Cat   3.4      Dog\r\n## 2    3    Fish 100.3     Bird\r\n## 3    5 Pelican   -99 Kangaroo\r\n## \r\n## [[2]]\r\n##   Foo Bar Baz\r\n## 1  Aa  Bb  Cc\r\n## 2  Dd  Ee  Ff\r\n## 3  Gg  Hh  ii\r\n## \r\n## [[3]]\r\n##   Foo Bar\r\n## 1  Aa  Bb\r\n## 2  Dd  Ee\r\n## 3  Gg  Hh\r\n## 4  1    2\r\n## 5  Zz  Jj\r\n## 6  Tt  ii<\/code><\/pre>\n<p>This function tries to calculate the rows\/columns per table but it does rely on a uniform table structure.<\/p>\n<p>Have an alternate method or more feature-complete way of handling Word docs as tabular data sources? Then definitely drop a note in the comments.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>NOTE: after reading this post head on over to this new one as it has wrapped this functionality (and more!) into a package. Also: docxtractr is now on CRAN This was asked on twitter recently: Is it possible to import data entered in MS Word into R &#8211; I have multiple tables in 235 files [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":true,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":3,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":""},"categories":[91,732],"tags":[810],"class_list":["post-3633","post","type-post","status-publish","format-standard","hentry","category-r","category-xml","tag-post"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Using R To Get Data *Out Of* Word Docs - rud.is<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Using R To Get Data *Out Of* Word Docs - rud.is\" \/>\n<meta property=\"og:description\" content=\"NOTE: after reading this post head on over to this new one as it has wrapped this functionality (and more!) into a package. Also: docxtractr is now on CRAN This was asked on twitter recently: Is it possible to import data entered in MS Word into R &#8211; I have multiple tables in 235 files [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/\" \/>\n<meta property=\"og:site_name\" content=\"rud.is\" \/>\n<meta property=\"article:published_time\" content=\"2015-08-23T14:14:04+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-03-10T12:54:36+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/rud.is\/b\/wp-content\/uploads\/2015\/08\/data_docx.png\" \/>\n<meta name=\"author\" content=\"hrbrmstr\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"hrbrmstr\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2015\\\/08\\\/23\\\/using-r-to-get-data-out-of-word-docs\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2015\\\/08\\\/23\\\/using-r-to-get-data-out-of-word-docs\\\/\"},\"author\":{\"name\":\"hrbrmstr\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"headline\":\"Using R To Get Data *Out Of* Word Docs\",\"datePublished\":\"2015-08-23T14:14:04+00:00\",\"dateModified\":\"2018-03-10T12:54:36+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2015\\\/08\\\/23\\\/using-r-to-get-data-out-of-word-docs\\\/\"},\"wordCount\":357,\"commentCount\":5,\"publisher\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"image\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2015\\\/08\\\/23\\\/using-r-to-get-data-out-of-word-docs\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2015\\\/08\\\/data_docx.png\",\"keywords\":[\"post\"],\"articleSection\":[\"R\",\"xml\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/rud.is\\\/b\\\/2015\\\/08\\\/23\\\/using-r-to-get-data-out-of-word-docs\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2015\\\/08\\\/23\\\/using-r-to-get-data-out-of-word-docs\\\/\",\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/2015\\\/08\\\/23\\\/using-r-to-get-data-out-of-word-docs\\\/\",\"name\":\"Using R To Get Data *Out Of* Word Docs - rud.is\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2015\\\/08\\\/23\\\/using-r-to-get-data-out-of-word-docs\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2015\\\/08\\\/23\\\/using-r-to-get-data-out-of-word-docs\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2015\\\/08\\\/data_docx.png\",\"datePublished\":\"2015-08-23T14:14:04+00:00\",\"dateModified\":\"2018-03-10T12:54:36+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2015\\\/08\\\/23\\\/using-r-to-get-data-out-of-word-docs\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/rud.is\\\/b\\\/2015\\\/08\\\/23\\\/using-r-to-get-data-out-of-word-docs\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2015\\\/08\\\/23\\\/using-r-to-get-data-out-of-word-docs\\\/#primaryimage\",\"url\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2015\\\/08\\\/data_docx.png?fit=512%2C163&ssl=1\",\"contentUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2015\\\/08\\\/data_docx.png?fit=512%2C163&ssl=1\",\"width\":512,\"height\":163},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2015\\\/08\\\/23\\\/using-r-to-get-data-out-of-word-docs\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/rud.is\\\/b\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Using R To Get Data *Out Of* Word Docs\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#website\",\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/\",\"name\":\"rud.is\",\"description\":\"&quot;In God we trust. All others must bring data&quot;\",\"publisher\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/rud.is\\\/b\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\",\"name\":\"hrbrmstr\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"url\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"contentUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"width\":460,\"height\":460,\"caption\":\"hrbrmstr\"},\"logo\":{\"@id\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\"},\"description\":\"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7\",\"sameAs\":[\"http:\\\/\\\/rud.is\"],\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/author\\\/hrbrmstr\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Using R To Get Data *Out Of* Word Docs - rud.is","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/","og_locale":"en_US","og_type":"article","og_title":"Using R To Get Data *Out Of* Word Docs - rud.is","og_description":"NOTE: after reading this post head on over to this new one as it has wrapped this functionality (and more!) into a package. Also: docxtractr is now on CRAN This was asked on twitter recently: Is it possible to import data entered in MS Word into R &#8211; I have multiple tables in 235 files [&hellip;]","og_url":"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/","og_site_name":"rud.is","article_published_time":"2015-08-23T14:14:04+00:00","article_modified_time":"2018-03-10T12:54:36+00:00","og_image":[{"url":"https:\/\/rud.is\/b\/wp-content\/uploads\/2015\/08\/data_docx.png","type":"","width":"","height":""}],"author":"hrbrmstr","twitter_card":"summary_large_image","twitter_misc":{"Written by":"hrbrmstr","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/#article","isPartOf":{"@id":"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/"},"author":{"name":"hrbrmstr","@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"headline":"Using R To Get Data *Out Of* Word Docs","datePublished":"2015-08-23T14:14:04+00:00","dateModified":"2018-03-10T12:54:36+00:00","mainEntityOfPage":{"@id":"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/"},"wordCount":357,"commentCount":5,"publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"image":{"@id":"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/#primaryimage"},"thumbnailUrl":"https:\/\/rud.is\/b\/wp-content\/uploads\/2015\/08\/data_docx.png","keywords":["post"],"articleSection":["R","xml"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/","url":"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/","name":"Using R To Get Data *Out Of* Word Docs - rud.is","isPartOf":{"@id":"https:\/\/rud.is\/b\/#website"},"primaryImageOfPage":{"@id":"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/#primaryimage"},"image":{"@id":"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/#primaryimage"},"thumbnailUrl":"https:\/\/rud.is\/b\/wp-content\/uploads\/2015\/08\/data_docx.png","datePublished":"2015-08-23T14:14:04+00:00","dateModified":"2018-03-10T12:54:36+00:00","breadcrumb":{"@id":"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/#primaryimage","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2015\/08\/data_docx.png?fit=512%2C163&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2015\/08\/data_docx.png?fit=512%2C163&ssl=1","width":512,"height":163},{"@type":"BreadcrumbList","@id":"https:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/rud.is\/b\/"},{"@type":"ListItem","position":2,"name":"Using R To Get Data *Out Of* Word Docs"}]},{"@type":"WebSite","@id":"https:\/\/rud.is\/b\/#website","url":"https:\/\/rud.is\/b\/","name":"rud.is","description":"&quot;In God we trust. All others must bring data&quot;","publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/rud.is\/b\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886","name":"hrbrmstr","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","width":460,"height":460,"caption":"hrbrmstr"},"logo":{"@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1"},"description":"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7","sameAs":["http:\/\/rud.is"],"url":"https:\/\/rud.is\/b\/author\/hrbrmstr\/"}]}},"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p23idr-WB","jetpack_likes_enabled":true,"jetpack-related-posts":[{"id":2782,"url":"https:\/\/rud.is\/b\/2013\/11\/13\/visual-anatomy-of-r-packages-used-in-data-driven-security\/","url_meta":{"origin":3633,"position":0},"title":"Visual Anatomy Of R Packages Used in Data Driven Security","author":"hrbrmstr","date":"2013-11-13","format":false,"excerpt":"Since @jayjacobs & I are down to the home stretch on Data Driven Security, I thought it would be interesting to do some post-writing pseudo-analyses of the book itself. I won't have exact page or word counts for a bit, but I wanted to see how many R packages we\u2026","rel":"","context":"In &quot;Data Analysis&quot;","block_context":{"text":"Data Analysis","link":"https:\/\/rud.is\/b\/category\/data-analysis-2\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":4547,"url":"https:\/\/rud.is\/b\/2016\/07\/24\/mid-year-r-packages-update-summary\/","url_meta":{"origin":3633,"position":1},"title":"Mid-year R Packages Update Summary","author":"hrbrmstr","date":"2016-07-24","format":false,"excerpt":"I been updating some existing packages and github-releasing new ones (before a CRAN push). Most are \"cyber\"-related, but there are some general purpose ones. Here's a quick overview: docxtractr (CRAN, now, v0.2.0) was initially designed to make it easy to get data tables out of MS Word (docx) documents. The\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":6788,"url":"https:\/\/rud.is\/b\/2017\/10\/22\/a-call-to-tweets-blog-posts\/","url_meta":{"origin":3633,"position":2},"title":"A Call to Tweets (&#038; Blog Posts)!","author":"hrbrmstr","date":"2017-10-22","format":false,"excerpt":"Way back in July of 2009, the first version of the twitteR package was published by Geoff Jentry in CRAN. Since then it has seen 28 updates, finally breaking the 0.x.y barrier into 1.x.y territory in March of 2013 and receiving it's last update in July of 2015. For a\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":3642,"url":"https:\/\/rud.is\/b\/2015\/08\/24\/new-pacakge-docxtractr-easily-extract-tables-from-microsoft-word-docs\/","url_meta":{"origin":3633,"position":3},"title":"New Pacakge &#8220;docxtractr&#8221; &#8211; Easily Extract Tables From Microsoft Word Docs","author":"hrbrmstr","date":"2015-08-24","format":false,"excerpt":"UPDATE: `docxtractr` is now [on CRAN](https:\/\/cran.rstudio.com\/web\/packages\/docxtractr\/index.html) --------------------- This is more of a follow-up from [yesterday's post](http:\/\/rud.is\/b\/2015\/08\/23\/using-r-to-get-data-out-of-word-docs\/). The hack and function in said post was fine, but it was limited to uniform tables and made you do more work than you had to. So, there's now a `devtools`-installable package [on github](https:\/\/github.com\/hrbrmstr\/docxtractr)\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":5801,"url":"https:\/\/rud.is\/b\/2017\/04\/13\/come-fly-with-me-well-not-really-comparing-involuntary-disembarking-rates-across-u-s-airlines-in-r\/","url_meta":{"origin":3633,"position":4},"title":"Come Fly With Me (well, not really) \u2014 Comparing Involuntary Disembarking Rates Across U.S. Airlines in R","author":"hrbrmstr","date":"2017-04-13","format":false,"excerpt":"By now, word of the forcible deplanement of a medical professional by United has reached even the remotest of outposts in the #rstats universe. Since the news brought this practice to global attention, I found some aggregate U.S. Gov data made a quick, annual, aggregate look at this soon after\u2026","rel":"","context":"In &quot;data wrangling&quot;","block_context":{"text":"data wrangling","link":"https:\/\/rud.is\/b\/category\/data-wrangling\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":2938,"url":"https:\/\/rud.is\/b\/2014\/03\/15\/guardian-words-visualized\/","url_meta":{"origin":3633,"position":5},"title":"Guardian Words: Visualized","author":"hrbrmstr","date":"2014-03-15","format":false,"excerpt":"Andy Kirk (@visualisingdata) & Lynn Cherny (@arnicas) tweeted about the Guardian Word Count service\/archive site, lamenting the lack of visualizations: Want to know num of words written in each day's Guardian paper by section + approx reading time? http:\/\/t.co\/wP4W1EzUsx via @bengoldacre\u2014 Andy Kirk (@visualisingdata) March 15, 2014 This gave me\u2026","rel":"","context":"In &quot;Data Visualization&quot;","block_context":{"text":"Data Visualization","link":"https:\/\/rud.is\/b\/category\/data-visualization\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/3633","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/comments?post=3633"}],"version-history":[{"count":0,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/3633\/revisions"}],"wp:attachment":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/media?parent=3633"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/categories?post=3633"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/tags?post=3633"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}