Readability Redux

I recently posted about using a Python module to convert HTML to usable text. Since then, a new package has hit CRAN dubbed htm2txt that is 100% R and uses regular expressions to strip tags from text.

I gave it a spin so folks could compare some basic output, but you should definitely give htm2txt a try on your own conversion needs since each method produces different results.

On my macOS systems, the htm2txt calls ended up invoking XQuartz (the X11 environment on macOS) and they felt kind of sluggish (base R regular expressions don’t have a “compile” feature and can be sluggish compared to other types of regular expression computations).

I decided to spend some of Labor Day (in the U.S.) laboring (not for long, though) on a (currently small) rJava-based R package dubbed jericho which builds upon work created by Martin Jericho which is used in at-scale initiatives like the Internet Archive. Yes, I’m trading Java for Python, but the combination of Java+R has been around for much longer and there are many solved problems in Java-space that don’t need to be re-invented (if you do know a header-only, cross-platform, C++ HTML-to-text library, definitely leave a comment).

Is it worth it to get rJava up and running to use jericho vs htm2txt? Let’s take a look:

library(jericho) # devtools::install_github("hrbrmstr/jericho")
library(microbenchmark)
library(htm2txt)
library(tidyverse)

c(
  "https://medium.com/starts-with-a-bang/science-knows-if-a-nation-is-testing-nuclear-bombs-ec5db88f4526",
  "https://en.wikipedia.org/wiki/Timeline_of_antisemitism",
  "http://www.healthsecuritysolutions.com/2017/09/04/watch-out-more-ransomware-attacks-incoming/"
) -> urls

map_chr(urls, ~paste0(read_lines(.x), collapse="\n")) -> sites_html

microbenchmark(
  jericho_txt = {
    a <- html_to_text(sites_html[1])
  },
  jericho_render = {
    a <- render_html_to_text(sites_html[1])
  },
  htm2txt = {
    a <- htm2txt(sites_html[1])
  },
  times = 10
) -> mb1

# microbenchmark(
#   jericho_txt = {
#     a <- html_to_text(sites_html[2])
#   },
#   jericho_render = {
#     a <- render_html_to_text(sites_html[2])
#   },
#   htm2txt = {
#     a <- htm2txt(sites_html[2])
#   },
#   times = 10
# ) -> mb2

microbenchmark(
  jericho_txt = {
    a <- html_to_text(sites_html[3])
  },
  jericho_render = {
    a <- render_html_to_text(sites_html[3])
  },
  htm2txt = {
    a <- htm2txt(sites_html[3])
  },
  times = 10
) -> mb3

The second benchmark is commented out because I really didn’t have time wait for it to complete (FWIW jericho goes fast in that test). Here’s what the other two look like:

mb1
## Unit: milliseconds
##            expr         min          lq        mean      median          uq         max neval
##     jericho_txt    4.121872    4.294953    4.567241    4.405356    4.734923    5.621142    10
##  jericho_render    5.446296    5.564006    5.927956    5.719971    6.357465    6.785791    10
##         htm2txt 1014.858678 1021.575316 1035.342729 1029.154451 1042.642065 1082.340132    10

mb3
## Unit: milliseconds
##            expr        min         lq       mean     median         uq        max neval
##     jericho_txt   2.641352   2.814318   3.297543   3.034445   3.488639   5.437411    10
##  jericho_render   3.034765   3.143431   4.708136   3.746157   5.953550   8.931072    10
##         htm2txt 417.429658 437.493406 446.907140 445.622242 451.443907 484.563958    10

You should run the conversion functions on your own systems to compare the results (they’re somewhat large to incorporate here). I’m fairly certain they do a comparable — if not better — job of extracting clean, pertinent text.

I need to separate the package into two (one for the base JAR and the other for the conversion functions) and add some more tests before a CRAN submission, but I think this would be a good addition to the budding arsenal of HTML-to-text conversion options in R.

Buy on AmazonDDS Blog
DDS PodcastAmazon Author Page

4 Comments Readability Redux

  1. Pingback: Readability Redux – Cloud Data Architect

  2. Pingback: Readability Redux – Cyber Security

  3. Pingback: Readability Redux – Mubashir Qasim

  4. Pingback: Readability Redux | A bunch of data

Leave a Reply