@RMHoge asked the following on Twitter:
Hello #rstats hyve mind! Is there a package that reads epub into R? I can not find any, I now convert to text and parse the text but you sort of lose the structure of the text. Pinging @dataandme @hrbrmstr
— Roel (@RMHoge) April 12, 2018
Here’s one way to do that which doesn’t rely on pandoc
(pandoc
can easily do this and ships with RStudio but shelling out for this is cheating :-)
We’ll need some help (NOTE that 2 of these are “GitHub” packages)
library(archive) # install_github("jimhester/archive") + 3rd party library
library(hgr) # install_github("hrbrmstr/hgr")
library(stringi)
library(tidyverse)
We’ll use one of @hadleywickham’s books since it’s O’Reilly and they do epubs well. The archive
package lets us treat the epub (which is really just a ZIP file) as a mini-filesystem and embraces “tidy” so we have lovely data frames to work with:
bk_src <- "~/Data/R Packages.epub"
bk <- archive::archive(bk_src)
bk
## # A tibble: 92 x 3
## path size date
## <chr> <dbl> <dttm>
## 1 mimetype 20. 2015-03-24 21:49:16
## 2 OEBPS/assets/cover.png 211616. 2015-06-03 16:16:56
## 3 OEBPS/content.opf 10193. 2015-03-24 21:49:16
## 4 OEBPS/toc.ncx 30037. 2015-03-24 21:49:16
## 5 OEBPS/cover.html 315. 2015-03-24 21:49:16
## 6 OEBPS/titlepage01.html 466. 2015-03-24 21:49:16
## 7 OEBPS/copyright-page01.html 3286. 2015-03-24 21:49:16
## 8 OEBPS/toc01.html 17557. 2015-03-24 21:49:16
## 9 OEBPS/preface01.html 17784. 2015-03-24 21:49:16
## 10 OEBPS/part01.html 444. 2015-03-24 21:49:16
## # ... with 82 more rows
We care not about crufty bits and only want HTML files (NOTE: I use html
for the pattern since they can be .xhtml
files as well):
## # A tibble: 26 x 3
## path size date
## <chr> <dbl> <dttm>
## 1 OEBPS/cover.html 315. 2015-03-24 21:49:16
## 2 OEBPS/titlepage01.html 466. 2015-03-24 21:49:16
## 3 OEBPS/copyright-page01.html 3286. 2015-03-24 21:49:16
## 4 OEBPS/toc01.html 17557. 2015-03-24 21:49:16
## 5 OEBPS/preface01.html 17784. 2015-03-24 21:49:16
## 6 OEBPS/part01.html 444. 2015-03-24 21:49:16
## 7 OEBPS/ch01.html 12007. 2015-03-24 21:49:16
## 8 OEBPS/ch02.html 28633. 2015-03-24 21:49:18
## 9 OEBPS/part02.html 454. 2015-03-24 21:49:18
## 10 OEBPS/ch03.html 28629. 2015-03-24 21:49:18
## # ... with 16 more rows
Let’s read in one file (as a test) and convert it to text and show the first few lines of it:
archive::archive_read(bk, "OEBPS/preface01.html") %>%
read_lines() %>%
paste0(collapse = "\n") -> chapter
hgr::clean_text(chapter) %>%
stri_sub(1, 1000) %>%
cat()
## Preface
##
##
## In This Book
##
## This book will guide you from being a user of R packages to being a creator of R packages. In , you’ll learn why mastering this skill is so important, and why it’s easier than you think. Next, you’ll learn about the basic structure of a package, and the forms it can take, in . The subsequent chapters go into more detail about each component. They’re roughly organized in order of importance:
##
##
## The most important directory is R/, where your R code lives. A package with just this directory is still a useful package. (And indeed, if you stop reading the book after this chapter, you’ll have still learned some useful new skills.)
##
## The DESCRIPTION lets you describe what your package needs to work. If you’re sharing your package, you’ll also use the DESCRIPTION to describe what it does, who can use it (the license), and who to contact if things go wrong.
##
## If you want other people (including “future you”!) to understand how to use the functions in your package, you’
hgr::clean_text()
uses some XSLT magic to pull text. My jericho
? can often do a better job but it’s rJava
-based so a bit painful for some folks to get running.
Now, we’ll convert all the files:
filter(bk, stri_detect_fixed(path, "html")) %>%
mutate(content = map_chr(path, ~{
archive::archive_read(bk, .x) %>%
read_lines() %>%
paste0(collapse = "\n") %>%
hgr::clean_text()
})) %>%
print(n=27)
## # A tibble: 26 x 4
## path size date content
## <chr> <dbl> <dttm> <chr>
## 1 OEBPS/cover.html 315. 2015-03-24 21:49:16 Cover
## 2 OEBPS/titlepage01.html 466. 2015-03-24 21:49:16 "R Packages\n\n…
## 3 OEBPS/copyright-page01.html 3286. 2015-03-24 21:49:16 "R Packages\n\n…
## 4 OEBPS/toc01.html 17557. 2015-03-24 21:49:16 "navPrefaceIn T…
## 5 OEBPS/preface01.html 17784. 2015-03-24 21:49:16 "Preface\n\n\nI…
## 6 OEBPS/part01.html 444. 2015-03-24 21:49:16 Getting Started
## 7 OEBPS/ch01.html 12007. 2015-03-24 21:49:16 "Introduction\n…
## 8 OEBPS/ch02.html 28633. 2015-03-24 21:49:18 "Package Struct…
## 9 OEBPS/part02.html 454. 2015-03-24 21:49:18 Package Compone…
## 10 OEBPS/ch03.html 28629. 2015-03-24 21:49:18 "R Code\n\nThe …
## 11 OEBPS/ch04.html 31275. 2015-03-24 21:49:18 "Package Metada…
## 12 OEBPS/ch05.html 42089. 2015-03-24 21:49:18 "Object Documen…
## 13 OEBPS/ch06.html 31484. 2015-03-24 21:49:18 "Vignettes: Lon…
## 14 OEBPS/ch07.html 28594. 2015-03-24 21:49:18 "Testing\n\nTes…
## 15 OEBPS/ch08.html 30808. 2015-03-24 21:49:18 "Namespace\n\nT…
## 16 OEBPS/ch09.html 12125. 2015-03-24 21:49:18 "External Data\…
## 17 OEBPS/ch10.html 42013. 2015-03-24 21:49:18 "Compiled Code\…
## 18 OEBPS/ch11.html 8933. 2015-03-24 21:49:18 "Installed File…
## 19 OEBPS/ch12.html 3897. 2015-03-24 21:49:18 "Other Componen…
## 20 OEBPS/part03.html 446. 2015-03-24 21:49:18 Best Practices
## 21 OEBPS/ch13.html 59493. 2015-03-24 21:49:18 "Git and GitHub…
## 22 OEBPS/ch14.html 44702. 2015-03-24 21:49:18 "Automated Chec…
## 23 OEBPS/ch15.html 39450. 2015-03-24 21:49:18 "Releasing a Pa…
## 24 OEBPS/ix01.html 75277. 2015-03-24 21:49:20 IndexAad hoc te…
## 25 OEBPS/colophon01.html 974. 2015-03-24 21:49:20 "About the Auth…
## 26 OEBPS/colophon02.html 1653. 2015-03-24 21:49:20 "Colophon\n\nTh…
I’m not wrapping this into a package anytime soon but this is also a pretty basic flow that may not require a package. This has been wrapped into a small package dubbed pubcrawl
?.
Drop a note in the comments with your hints/workflows on converting epub to plaintext!