Skip navigation

I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non-“main text content” from an HTML document and it usually does a good job but there are some pages that it fails miserably on since it’s more of a brute-force method than one that uses any real “intelligence” when performing the text node targeting.

Most modern browsers have inherent or plugin-able “readability” capability, and most of those are based — at least in part — on the seminal Arc90 implementation. Many programming languages have a package or module that use a similar methodology, but I’m not aware of any R ports.

What do I mean by “clean txt”? Well, I can’t show the URL I was having trouble processing but I can show an example using a recent rOpenSci blog post. Here’s what the raw HTML looks like after retrieving it:

library(xml2)
library(httr)
library(reticulate)
library(magrittr)

res <- GET("https://ropensci.org/blog/blog/2017/08/22/visdat")

content(res, as="text", endoding="UTF-8")
## [1] "\n \n<!DOCTYPE html>\n<html lang=\"en\">\n <head>\n <meta charset=\"utf-8\">\n <meta name=\"apple-mobile-web-app-capable\" content=\"yes\" />\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\" />\n <meta name=\"apple-mobile-web-app-status-bar-style\" content=\"black\" />\n <link rel=\"shortcut icon\" href=\"/assets/flat-ui/images/favicon.ico\">\n\n <link rel=\"alternate\" type=\"application/rss+xml\" title=\"RSS\" href=\"http://ropensci.org/feed.xml\" />\n\n <link rel=\"stylesheet\" href=\"/assets/flat-ui/bootstrap/css/bootstrap.css\">\n <link rel=\"stylesheet\" href=\"/assets/flat-ui/css/flat-ui.css\">\n\n <link rel=\"stylesheet\" href=\"/assets/common-files/css/icon-font.css\">\n <link rel=\"stylesheet\" href=\"/assets/common-files/css/animations.css\">\n <link rel=\"stylesheet\" href=\"/static/css/style.css\">\n <link href=\"/assets/css/ss-social/webfonts/ss-social.css\" rel=\"stylesheet\" />\n <link href=\"/assets/css/ss-standard/webfonts/ss-standard.css\" rel=\"stylesheet\"/>\n <link rel=\"stylesheet\" href=\"/static/css/github.css\">\n <script type=\"text/javascript\" src=\"//use.typekit.net/djn7rbd.js\"></script>\n <script type=\"text/javascript\">try{Typekit.load();}catch(e){}</script>\n <script src=\"/static/highlight.pack.js\"></script>\n <script>hljs.initHighlightingOnLoad();</script>\n\n <title>Onboarding visdat, a tool for preliminary visualisation of whole dataframes</title>\n <meta name=\"keywords\" content=\"R, software, package, review, community, visdat, data-visualisation\" />\n <meta name=\"description\" content=\"\" />\n <meta name=\"resource_type\" content=\"website\"/>\n <!– RDFa Metadata (in DublinCore) –>\n <meta property=\"dc:title\" content=\"Onboarding visdat, a tool for preliminary visualisation of whole dataframes\" />\n <meta property=\"dc:creator\" content=\"\" />\n <meta property=\"dc:date\" content=\"\" />\n <meta property=\"dc:format\" content=\"text/html\" />\n <meta property=\"dc:language\" content=\"en\" />\n <meta property=\"dc:identifier\" content=\"/blog/blog/2017/08/22/visdat\" />\n <meta property=\"dc:rights\" content=\"CC0\" />\n <meta property=\"dc:source\" content=\"\" />\n <meta property=\"dc:subject\" content=\"Ecology\" />\n <meta property=\"dc:type\" content=\"website\" />\n <!– RDFa Metadata (in OpenGraph) –>\n <meta property=\"og:title\" content=\"Onboarding visdat, a tool for preliminary visualisation of whole dataframes\" />\n <meta property=\"og:author\" content=\"/index.html#me\" /> <!– Should be Liquid? URI? –>\n <meta property=\"http://ogp.me/ns/profile#first_name\" content=\"\"/>\n <meta property=\"http://ogp.me/ns/profile#last_name\" content=\"\"/>\n

(it goes on for a bit, best to run the code locally)

We can use the reticulate package to load the Python readability module to just get the clean, article text:

readability <- import("readability") # pip install readability-lxml

doc <- readability$Document(httr::content(res, as="text", endoding="UTF-8"))

doc$summary() %>%
  read_xml() %>%
  xml_text()
# [1] "Take a look at the dataThis is a phrase that comes up when you first get a dataset.It is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?Starting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.These growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.The visdat package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".Making visdat was fun, and it was easy to use. But I couldn't help but think that maybe visdat could be more. I felt like the code was a little sloppy, and that it could be better. I wanted to know whether others found it useful.What I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.Too much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci onboarding process provides.rOpenSci onboarding basicsOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with JOSS.What's in it for the author?Feedback on your packageSupport from rOpenSci membersMaintain ownership of your packagePublicity from it being under rOpenSciContribute something to rOpenSciPotentially a publicationWhat can rOpenSci do that CRAN cannot?The rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN 1. Here's what rOpenSci does that CRAN cannot:Assess documentation readability / usabilityProvide a code review to find weak points / points of improvementDetermine whether a package is overlapping with another.

(again, it goes on for a bit, best to run the code locally)

That text is now in good enough shape to tidy.

Here’s the same version with clean_text():

# devtools::install_github("hrbrmstr/hgr")
hgr::clean_text(content(res, as="text", endoding="UTF-8"))
## [1] "Onboarding visdat, a tool for preliminary visualisation of whole dataframes\n \n \n \n \n  \n \n \n \n \n August 22, 2017 \n \n \n \n \nTake a look at the data\n\n\nThis is a phrase that comes up when you first get a dataset.\n\nIt is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?\n\nStarting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let's not forget everyone's favourite: missing data.\n\nThese growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can't even start to take a look at the data, and that is frustrating.\n\nThe package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \"get a look at the data\".\n\nMaking was fun, and it was easy to use. But I couldn't help but think that maybe could be more.\n\n I felt like the code was a little sloppy, and that it could be better.\n I wanted to know whether others found it useful.\nWhat I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.\n\nToo much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci provides.\n\nrOpenSci onboarding basics\n\nOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with .\n\nWhat's in it for the author?\n\nFeedback on your package\nSupport from rOpenSci members\nMaintain ownership of your package\nPublicity from it being under rOpenSci\nContribute something to rOpenSci\nPotentially a publication\nWhat can rOpenSci do that CRAN cannot?\n\nThe rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN . Here's what rOpenSci does that CRAN cannot:\n\nAssess documentation readability / usability\nProvide a code review to find weak points / points of improvement\nDetermine whether a package is overlapping with another.

(lastly, it goes on for a bit, best to run the code locally)

As you can see, even though that version is usable, readability does a much smarter job of cleaning the text.

The Python code is quite — heh — readable, and R could really use a native port (i.e. this would be a ++gd project or an aspiring package author to take on).

5 Trackbacks/Pingbacks

  1. By Reticulating Readability – Mubashir Qasim on 24 Aug 2017 at 6:19 pm

    […] article was first published on R – rud.is, and kindly contributed to […]

  2. By Reticulating Readability | A bunch of data on 25 Aug 2017 at 12:49 pm

    […] article was first published on R – rud.is, and kindly contributed to […]

  3. By Readability Redux | rud.is on 04 Sep 2017 at 7:20 pm

    […] recently posted about using a Python module to convert HTML to usable text. Since then, a new package has hit CRAN […]

  4. By Readability Redux – Cloud Data Architect on 04 Sep 2017 at 9:00 pm

    […] recently posted about using a Python module to convert HTML to usable text. Since then, a new package has hit CRAN […]

  5. By Readability Redux - biva on 04 Sep 2017 at 9:09 pm

    […] recently posted about using a Python module to convert HTML to usable text. Since then, a new package has hit CRAN […]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.