I recently posted about using a Python module to convert HTML to usable text. Since then, a new package has hit CRAN dubbed htm2txt
that is 100% R and uses regular expressions to strip tags from text.
I gave it a spin so folks could compare some basic output, but you should definitely give htm2txt
a try on your own conversion needs since each method produces different results.
On my macOS systems, the htm2txt
calls ended up invoking XQuartz (the X11 environment on macOS) and they felt kind of sluggish (base R regular expressions don’t have a “compile” feature and can be sluggish compared to other types of regular expression computations).
I decided to spend some of Labor Day (in the U.S.) laboring (not for long, though) on a (currently small) rJava
-based R package dubbed jericho
which builds upon work created by Martin Jericho which is used in at-scale initiatives like the Internet Archive. Yes, I’m trading Java for Python, but the combination of Java+R has been around for much longer and there are many solved problems in Java-space that don’t need to be re-invented (if you do know a header-only, cross-platform, C++ HTML-to-text library, definitely leave a comment).
Is it worth it to get rJava
up and running to use jericho
vs htm2txt
? Let’s take a look:
library(jericho) # devtools::install_github("hrbrmstr/jericho")
library(microbenchmark)
library(htm2txt)
library(tidyverse)
c(
"https://medium.com/starts-with-a-bang/science-knows-if-a-nation-is-testing-nuclear-bombs-ec5db88f4526",
"https://en.wikipedia.org/wiki/Timeline_of_antisemitism",
"http://www.healthsecuritysolutions.com/2017/09/04/watch-out-more-ransomware-attacks-incoming/"
) -> urls
map_chr(urls, ~paste0(read_lines(.x), collapse="\n")) -> sites_html
microbenchmark(
jericho_txt = {
a <- html_to_text(sites_html[1])
},
jericho_render = {
a <- render_html_to_text(sites_html[1])
},
htm2txt = {
a <- htm2txt(sites_html[1])
},
times = 10
) -> mb1
# microbenchmark(
# jericho_txt = {
# a <- html_to_text(sites_html[2])
# },
# jericho_render = {
# a <- render_html_to_text(sites_html[2])
# },
# htm2txt = {
# a <- htm2txt(sites_html[2])
# },
# times = 10
# ) -> mb2
microbenchmark(
jericho_txt = {
a <- html_to_text(sites_html[3])
},
jericho_render = {
a <- render_html_to_text(sites_html[3])
},
htm2txt = {
a <- htm2txt(sites_html[3])
},
times = 10
) -> mb3
The second benchmark is commented out because I really didn’t have time wait for it to complete (FWIW jericho
goes fast in that test). Here’s what the other two look like:
mb1
## Unit: milliseconds
## expr min lq mean median uq max neval
## jericho_txt 4.121872 4.294953 4.567241 4.405356 4.734923 5.621142 10
## jericho_render 5.446296 5.564006 5.927956 5.719971 6.357465 6.785791 10
## htm2txt 1014.858678 1021.575316 1035.342729 1029.154451 1042.642065 1082.340132 10
mb3
## Unit: milliseconds
## expr min lq mean median uq max neval
## jericho_txt 2.641352 2.814318 3.297543 3.034445 3.488639 5.437411 10
## jericho_render 3.034765 3.143431 4.708136 3.746157 5.953550 8.931072 10
## htm2txt 417.429658 437.493406 446.907140 445.622242 451.443907 484.563958 10
You should run the conversion functions on your own systems to compare the results (they’re somewhat large to incorporate here). I’m fairly certain they do a comparable — if not better — job of extracting clean, pertinent text.
I need to separate the package into two (one for the base JAR and the other for the conversion functions) and add some more tests before a CRAN submission, but I think this would be a good addition to the budding arsenal of HTML-to-text conversion options in R.