I needed to clean some web HTML content for a project and I usually use hgr::clean_text()
for it and that generally works pretty well. The clean_text()
function uses an XSLT stylesheet to try to remove all non-“main text content” from an HTML document and it usually does a good job but there are some pages that it fails miserably on since it’s more of a brute-force method than one that uses any real “intelligence” when performing the text node targeting.
Most modern browsers have inherent or plugin-able “readability” capability, and most of those are based — at least in part — on the seminal Arc90 implementation. Many programming languages have a package or module that use a similar methodology, but I’m not aware of any R ports.
What do I mean by “clean txt”? Well, I can’t show the URL I was having trouble processing but I can show an example using a recent rOpenSci blog post. Here’s what the raw HTML looks like after retrieving it:
library(xml2)
library(httr)
library(reticulate)
library(magrittr)
res <- GET("https://ropensci.org/blog/blog/2017/08/22/visdat")
content(res, as="text", endoding="UTF-8")
(it goes on for a bit, best to run the code locally)
We can use the reticulate
package to load the Python readability
module to just get the clean, article text:
readability <- import("readability") # pip install readability-lxml
doc <- readability$Document(httr::content(res, as="text", endoding="UTF-8"))
doc$summary() %>%
read_xml() %>%
xml_text()
(again, it goes on for a bit, best to run the code locally)
That text is now in good enough shape to tidy.
Here’s the same version with clean_text()
:
# devtools::install_github("hrbrmstr/hgr")
hgr::clean_text(content(res, as="text", endoding="UTF-8"))
(lastly, it goes on for a bit, best to run the code locally)
As you can see, even though that version is usable, readability
does a much smarter job of cleaning the text.
The Python code is quite — heh — readable, and R could really use a native port (i.e. this would be a ++gd project or an aspiring package author to take on).
Pingback: Reticulating Readability – Mubashir Qasim
Pingback: Reticulating Readability | A bunch of data
Pingback: Readability Redux | rud.is
Pingback: Readability Redux – Cloud Data Architect
Pingback: Readability Redux - biva