I needed to clean some web HTML content for a project and I usually use
hgr::clean_text() for it and that generally works pretty well. The
clean_text() function uses an XSLT stylesheet to try to remove all non-“main text content” from an HTML document and it usually does a good job but there are some pages that it fails miserably on since it’s more of a brute-force method than one that uses any real “intelligence” when performing the text node targeting.
Most modern browsers have inherent or plugin-able “readability” capability, and most of those are based — at least in part — on the seminal Arc90 implementation. Many programming languages have a package or module that use a similar methodology, but I’m not aware of any R ports.
What do I mean by “clean txt”? Well, I can’t show the URL I was having trouble processing but I can show an example using a recent rOpenSci blog post. Here’s what the raw HTML looks like after retrieving it:
library(xml2) library(httr) library(reticulate) library(magrittr) res <- GET("https://ropensci.org/blog/blog/2017/08/22/visdat") content(res, as="text", endoding="UTF-8")
(it goes on for a bit, best to run the code locally)
We can use the
reticulate package to load the Python
readability module to just get the clean, article text:
readability <- import("readability") # pip install readability-lxml doc <- readability$Document(httr::content(res, as="text", endoding="UTF-8")) doc$summary() %>% read_xml() %>% xml_text()
(again, it goes on for a bit, best to run the code locally)
That text is now in good enough shape to tidy.
Here’s the same version with
# devtools::install_github("hrbrmstr/hgr") hgr::clean_text(content(res, as="text", endoding="UTF-8"))
(lastly, it goes on for a bit, best to run the code locally)
As you can see, even though that version is usable,
readability does a much smarter job of cleaning the text.
The Python code is quite — heh — readable, and R could really use a native port (i.e. this would be a ++gd project or an aspiring package author to take on).