

{"id":6172,"date":"2017-08-24T11:01:41","date_gmt":"2017-08-24T16:01:41","guid":{"rendered":"https:\/\/rud.is\/b\/?p=6172"},"modified":"2018-03-10T07:54:26","modified_gmt":"2018-03-10T12:54:26","slug":"reticulating-readability","status":"publish","type":"post","link":"https:\/\/rud.is\/b\/2017\/08\/24\/reticulating-readability\/","title":{"rendered":"Reticulating Readability"},"content":{"rendered":"<p>I needed to clean some web HTML content for a project and I usually use <code>hgr::clean_text()<\/code> for it and that generally works pretty well. The <code>clean_text()<\/code> function uses an XSLT stylesheet to try to remove all non-&#8220;main text content&#8221; from an HTML document and it usually does a good job but there are some pages that it fails miserably on since it&#8217;s more of a brute-force method than one that uses any real &#8220;intelligence&#8221; when performing the text node targeting.<\/p>\n<p>Most modern browsers have inherent or <a href=\"https:\/\/github.com\/MHordecki\/readability-redux\">plugin-able<\/a> &#8220;readability&#8221; capability, and most of those are based &#8212; at least in part &#8212; on the seminal <a href=\"https:\/\/github.com\/masukomi\/ar90-readability\">Arc90 implementation<\/a>. Many programming languages have a package or module that use a similar methodology, but I&#8217;m not aware of any R ports.<\/p>\n<p>What do I mean by &#8220;clean txt&#8221;? Well, I can&#8217;t show the URL I was having trouble processing but I <em>can<\/em> show an example using a recent <a href=\"https:\/\/ropensci.org\/blog\/2017\/08\/22\/visdat\/\">rOpenSci blog post<\/a>. Here&#8217;s what the raw HTML looks like after retrieving it:<\/p>\n<pre id=\"retread01\"><code class=\"language-r\">library(xml2)\r\nlibrary(httr)\r\nlibrary(reticulate)\r\nlibrary(magrittr)\r\n\r\nres &lt;- GET(&quot;https:\/\/ropensci.org\/blog\/blog\/2017\/08\/22\/visdat&quot;)\r\n\r\ncontent(res, as=&quot;text&quot;, endoding=&quot;UTF-8&quot;)<\/code><\/pre>\n<div style=\"font-family:monospace; height:300px; width=100%; overflow-y:auto\">\n## [1] &#x22;\\n \\n&#x3C;!DOCTYPE html&#x3E;\\n&#x3C;html lang=\\&#x22;en\\&#x22;&#x3E;\\n    &#x3C;head&#x3E;\\n        &#x3C;meta charset=\\&#x22;utf-8\\&#x22;&#x3E;\\n        &#x3C;meta name=\\&#x22;apple-mobile-web-app-capable\\&#x22; content=\\&#x22;yes\\&#x22; \/&#x3E;\\n        &#x3C;meta name=\\&#x22;viewport\\&#x22; content=\\&#x22;width=device-width, initial-scale=1.0\\&#x22; \/&#x3E;\\n        &#x3C;meta name=\\&#x22;apple-mobile-web-app-status-bar-style\\&#x22; content=\\&#x22;black\\&#x22; \/&#x3E;\\n        &#x3C;link rel=\\&#x22;shortcut icon\\&#x22; href=\\&#x22;\/assets\/flat-ui\/images\/favicon.ico\\&#x22;&#x3E;\\n\\n        &#x3C;link rel=\\&#x22;alternate\\&#x22; type=\\&#x22;application\/rss+xml\\&#x22; title=\\&#x22;RSS\\&#x22; href=\\&#x22;http:\/\/ropensci.org\/feed.xml\\&#x22; \/&#x3E;\\n\\n        &#x3C;link rel=\\&#x22;stylesheet\\&#x22; href=\\&#x22;\/assets\/flat-ui\/bootstrap\/css\/bootstrap.css\\&#x22;&#x3E;\\n        &#x3C;link rel=\\&#x22;stylesheet\\&#x22; href=\\&#x22;\/assets\/flat-ui\/css\/flat-ui.css\\&#x22;&#x3E;\\n\\n        &#x3C;link rel=\\&#x22;stylesheet\\&#x22; href=\\&#x22;\/assets\/common-files\/css\/icon-font.css\\&#x22;&#x3E;\\n        &#x3C;link rel=\\&#x22;stylesheet\\&#x22; href=\\&#x22;\/assets\/common-files\/css\/animations.css\\&#x22;&#x3E;\\n        &#x3C;link rel=\\&#x22;stylesheet\\&#x22; href=\\&#x22;\/static\/css\/style.css\\&#x22;&#x3E;\\n        &#x3C;link href=\\&#x22;\/assets\/css\/ss-social\/webfonts\/ss-social.css\\&#x22; rel=\\&#x22;stylesheet\\&#x22; \/&#x3E;\\n        &#x3C;link href=\\&#x22;\/assets\/css\/ss-standard\/webfonts\/ss-standard.css\\&#x22; rel=\\&#x22;stylesheet\\&#x22;\/&#x3E;\\n        &#x3C;link rel=\\&#x22;stylesheet\\&#x22; href=\\&#x22;\/static\/css\/github.css\\&#x22;&#x3E;\\n        &#x3C;script type=\\&#x22;text\/javascript\\&#x22; src=\\&#x22;\/\/use.typekit.net\/djn7rbd.js\\&#x22;&#x3E;&#x3C;\/script&#x3E;\\n        &#x3C;script type=\\&#x22;text\/javascript\\&#x22;&#x3E;try{Typekit.load();}catch(e){}&#x3C;\/script&#x3E;\\n        &#x3C;script src=\\&#x22;\/static\/highlight.pack.js\\&#x22;&#x3E;&#x3C;\/script&#x3E;\\n        &#x3C;script&#x3E;hljs.initHighlightingOnLoad();&#x3C;\/script&#x3E;\\n\\n        &#x3C;title&#x3E;Onboarding visdat, a tool for preliminary visualisation of whole dataframes&#x3C;\/title&#x3E;\\n        &#x3C;meta name=\\&#x22;keywords\\&#x22; content=\\&#x22;R, software, package, review, community, visdat, data-visualisation\\&#x22; \/&#x3E;\\n        &#x3C;meta name=\\&#x22;description\\&#x22; content=\\&#x22;\\&#x22; \/&#x3E;\\n        &#x3C;meta name=\\&#x22;resource_type\\&#x22; content=\\&#x22;website\\&#x22;\/&#x3E;\\n        &#x3C;!&#8211; RDFa Metadata (in DublinCore) &#8211;&#x3E;\\n        &#x3C;meta property=\\&#x22;dc:title\\&#x22; content=\\&#x22;Onboarding visdat, a tool for preliminary visualisation of whole dataframes\\&#x22; \/&#x3E;\\n        &#x3C;meta property=\\&#x22;dc:creator\\&#x22; content=\\&#x22;\\&#x22; \/&#x3E;\\n        &#x3C;meta property=\\&#x22;dc:date\\&#x22; content=\\&#x22;\\&#x22; \/&#x3E;\\n        &#x3C;meta property=\\&#x22;dc:format\\&#x22; content=\\&#x22;text\/html\\&#x22; \/&#x3E;\\n        &#x3C;meta property=\\&#x22;dc:language\\&#x22; content=\\&#x22;en\\&#x22; \/&#x3E;\\n        &#x3C;meta property=\\&#x22;dc:identifier\\&#x22; content=\\&#x22;\/blog\/blog\/2017\/08\/22\/visdat\\&#x22; \/&#x3E;\\n        &#x3C;meta property=\\&#x22;dc:rights\\&#x22; content=\\&#x22;CC0\\&#x22; \/&#x3E;\\n        &#x3C;meta property=\\&#x22;dc:source\\&#x22; content=\\&#x22;\\&#x22; \/&#x3E;\\n        &#x3C;meta property=\\&#x22;dc:subject\\&#x22; content=\\&#x22;Ecology\\&#x22; \/&#x3E;\\n        &#x3C;meta property=\\&#x22;dc:type\\&#x22; content=\\&#x22;website\\&#x22; \/&#x3E;\\n        &#x3C;!&#8211; RDFa Metadata (in OpenGraph) &#8211;&#x3E;\\n        &#x3C;meta property=\\&#x22;og:title\\&#x22; content=\\&#x22;Onboarding visdat, a tool for preliminary visualisation of whole dataframes\\&#x22; \/&#x3E;\\n        &#x3C;meta property=\\&#x22;og:author\\&#x22; content=\\&#x22;\/index.html#me\\&#x22; \/&#x3E;  &#x3C;!&#8211; Should be Liquid? URI? &#8211;&#x3E;\\n        &#x3C;meta property=\\&#x22;http:\/\/ogp.me\/ns\/profile#first_name\\&#x22; content=\\&#x22;\\&#x22;\/&#x3E;\\n        &#x3C;meta property=\\&#x22;http:\/\/ogp.me\/ns\/profile#last_name\\&#x22; content=\\&#x22;\\&#x22;\/&#x3E;\\n\n<\/div>\n<p>(it goes on for a bit, best to run the code locally)<\/p>\n<p>We can use the <code>reticulate<\/code> package to load the Python <code>readability<\/code> module to just get the clean, article text:<\/p>\n<pre id=\"retread02\"><code class=\"language-r\">readability &lt;- import(&quot;readability&quot;) # pip install readability-lxml\r\n\r\ndoc &lt;- readability$Document(httr::content(res, as=&quot;text&quot;, endoding=&quot;UTF-8&quot;))\r\n\r\ndoc$summary() %&gt;%\r\n  read_xml() %&gt;%\r\n  xml_text()<\/code><\/pre>\n<div style=\"font-family:monospace; height:300px; width=100%; overflow-y:auto\">\n# [1] &#x22;Take a look at the dataThis is a phrase that comes up when you first get a dataset.It is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?Starting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types &#8211; height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let&#x27;s not forget everyone&#x27;s favourite: missing data.These growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can&#x27;t even start to take a look at the data, and that is frustrating.The visdat package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \\&#x22;get a look at the data\\&#x22;.Making visdat was fun, and it was easy to use. But I couldn&#x27;t help but think that maybe visdat could be more. I felt like the code was a little sloppy, and that it could be better. I wanted to know whether others found it useful.What I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.Too much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci onboarding process provides.rOpenSci onboarding basicsOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with JOSS.What&#x27;s in it for the author?Feedback on your packageSupport from rOpenSci membersMaintain ownership of your packagePublicity from it being under rOpenSciContribute something to rOpenSciPotentially a publicationWhat can rOpenSci do that CRAN cannot?The rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN 1. Here&#x27;s what rOpenSci does that CRAN cannot:Assess documentation readability \/ usabilityProvide a code review to find weak points \/ points of improvementDetermine whether a package is overlapping with another.\n<\/div>\n<p>(again, it goes on for a bit, best to run the code locally)<\/p>\n<p>That text is now in good enough shape to <a href=\"https:\/\/cran.r-project.org\/web\/packages\/tidytext\/index.html\">tidy<\/a>.<\/p>\n<p>Here&#8217;s the same version with <code>clean_text()<\/code>:<\/p>\n<pre id=\"retread03\"><code class=\"language-r\"># devtools::install_github(&quot;hrbrmstr\/hgr&quot;)\r\nhgr::clean_text(content(res, as=&quot;text&quot;, endoding=&quot;UTF-8&quot;))<\/code><\/pre>\n<div style=\"font-family:monospace; height:300px; width=100%; overflow-y:auto\">\n## [1] &#x22;Onboarding visdat, a tool for preliminary visualisation of whole dataframes\\n            \\n              \\n    \\n      \\n         &#xA0;\\n      \\n    \\n              \\n            \\n            August 22, 2017 \\n        \\n        \\n            \\n                \\nTake a look at the data\\n\\n\\nThis is a phrase that comes up when you first get a dataset.\\n\\nIt is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?\\n\\nStarting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types &#8211; height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let&#x27;s not forget everyone&#x27;s favourite: missing data.\\n\\nThese growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can&#x27;t even start to take a look at the data, and that is frustrating.\\n\\nThe  package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to \\&#x22;get a look at the data\\&#x22;.\\n\\nMaking  was fun, and it was easy to use. But I couldn&#x27;t help but think that maybe  could be more.\\n\\n I felt like the code was a little sloppy, and that it could be better.\\n I wanted to know whether others found it useful.\\nWhat I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.\\n\\nToo much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci  provides.\\n\\nrOpenSci onboarding basics\\n\\nOnboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with .\\n\\nWhat&#x27;s in it for the author?\\n\\nFeedback on your package\\nSupport from rOpenSci members\\nMaintain ownership of your package\\nPublicity from it being under rOpenSci\\nContribute something to rOpenSci\\nPotentially a publication\\nWhat can rOpenSci do that CRAN cannot?\\n\\nThe rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN . Here&#x27;s what rOpenSci does that CRAN cannot:\\n\\nAssess documentation readability \/ usability\\nProvide a code review to find weak points \/ points of improvement\\nDetermine whether a package is overlapping with another.\n<\/div>\n<p>(lastly, it goes on for a bit, best to run the code locally)<\/p>\n<p>As you can see, even though that version is usable, <code>readability<\/code> does a much smarter job of cleaning the text.<\/p>\n<p>The Python code is quite &#8212; heh &#8212; readable, and R could really use a native port (i.e. this would be a ++gd project or an aspiring package author to take on).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non-&#8220;main text content&#8221; from an HTML document and it usually does a good job but there are some pages [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":3,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":""},"categories":[91,725],"tags":[810],"class_list":["post-6172","post","type-post","status-publish","format-standard","hentry","category-r","category-web-scraping","tag-post"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Reticulating Readability - rud.is<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/rud.is\/b\/2017\/08\/24\/reticulating-readability\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Reticulating Readability - rud.is\" \/>\n<meta property=\"og:description\" content=\"I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non-&#8220;main text content&#8221; from an HTML document and it usually does a good job but there are some pages [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/rud.is\/b\/2017\/08\/24\/reticulating-readability\/\" \/>\n<meta property=\"og:site_name\" content=\"rud.is\" \/>\n<meta property=\"article:published_time\" content=\"2017-08-24T16:01:41+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2018-03-10T12:54:26+00:00\" \/>\n<meta name=\"author\" content=\"hrbrmstr\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"hrbrmstr\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/24\\\/reticulating-readability\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/24\\\/reticulating-readability\\\/\"},\"author\":{\"name\":\"hrbrmstr\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"headline\":\"Reticulating Readability\",\"datePublished\":\"2017-08-24T16:01:41+00:00\",\"dateModified\":\"2018-03-10T12:54:26+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/24\\\/reticulating-readability\\\/\"},\"wordCount\":1831,\"commentCount\":5,\"publisher\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"keywords\":[\"post\"],\"articleSection\":[\"R\",\"web scraping\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/24\\\/reticulating-readability\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/24\\\/reticulating-readability\\\/\",\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/24\\\/reticulating-readability\\\/\",\"name\":\"Reticulating Readability - rud.is\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#website\"},\"datePublished\":\"2017-08-24T16:01:41+00:00\",\"dateModified\":\"2018-03-10T12:54:26+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/24\\\/reticulating-readability\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/24\\\/reticulating-readability\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/2017\\\/08\\\/24\\\/reticulating-readability\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/rud.is\\\/b\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Reticulating Readability\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#website\",\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/\",\"name\":\"rud.is\",\"description\":\"&quot;In God we trust. All others must bring data&quot;\",\"publisher\":{\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/rud.is\\\/b\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\\\/\\\/rud.is\\\/b\\\/#\\\/schema\\\/person\\\/d7cb7487ab0527447f7fda5c423ff886\",\"name\":\"hrbrmstr\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"url\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"contentUrl\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\",\"width\":460,\"height\":460,\"caption\":\"hrbrmstr\"},\"logo\":{\"@id\":\"https:\\\/\\\/i0.wp.com\\\/rud.is\\\/b\\\/wp-content\\\/uploads\\\/2023\\\/10\\\/ukr-shield.png?fit=460%2C460&ssl=1\"},\"description\":\"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7\",\"sameAs\":[\"http:\\\/\\\/rud.is\"],\"url\":\"https:\\\/\\\/rud.is\\\/b\\\/author\\\/hrbrmstr\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Reticulating Readability - rud.is","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/rud.is\/b\/2017\/08\/24\/reticulating-readability\/","og_locale":"en_US","og_type":"article","og_title":"Reticulating Readability - rud.is","og_description":"I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non-&#8220;main text content&#8221; from an HTML document and it usually does a good job but there are some pages [&hellip;]","og_url":"https:\/\/rud.is\/b\/2017\/08\/24\/reticulating-readability\/","og_site_name":"rud.is","article_published_time":"2017-08-24T16:01:41+00:00","article_modified_time":"2018-03-10T12:54:26+00:00","author":"hrbrmstr","twitter_card":"summary_large_image","twitter_misc":{"Written by":"hrbrmstr","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/rud.is\/b\/2017\/08\/24\/reticulating-readability\/#article","isPartOf":{"@id":"https:\/\/rud.is\/b\/2017\/08\/24\/reticulating-readability\/"},"author":{"name":"hrbrmstr","@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"headline":"Reticulating Readability","datePublished":"2017-08-24T16:01:41+00:00","dateModified":"2018-03-10T12:54:26+00:00","mainEntityOfPage":{"@id":"https:\/\/rud.is\/b\/2017\/08\/24\/reticulating-readability\/"},"wordCount":1831,"commentCount":5,"publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"keywords":["post"],"articleSection":["R","web scraping"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/rud.is\/b\/2017\/08\/24\/reticulating-readability\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/rud.is\/b\/2017\/08\/24\/reticulating-readability\/","url":"https:\/\/rud.is\/b\/2017\/08\/24\/reticulating-readability\/","name":"Reticulating Readability - rud.is","isPartOf":{"@id":"https:\/\/rud.is\/b\/#website"},"datePublished":"2017-08-24T16:01:41+00:00","dateModified":"2018-03-10T12:54:26+00:00","breadcrumb":{"@id":"https:\/\/rud.is\/b\/2017\/08\/24\/reticulating-readability\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/rud.is\/b\/2017\/08\/24\/reticulating-readability\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/rud.is\/b\/2017\/08\/24\/reticulating-readability\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/rud.is\/b\/"},{"@type":"ListItem","position":2,"name":"Reticulating Readability"}]},{"@type":"WebSite","@id":"https:\/\/rud.is\/b\/#website","url":"https:\/\/rud.is\/b\/","name":"rud.is","description":"&quot;In God we trust. All others must bring data&quot;","publisher":{"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/rud.is\/b\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/rud.is\/b\/#\/schema\/person\/d7cb7487ab0527447f7fda5c423ff886","name":"hrbrmstr","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","url":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","contentUrl":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1","width":460,"height":460,"caption":"hrbrmstr"},"logo":{"@id":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2023\/10\/ukr-shield.png?fit=460%2C460&ssl=1"},"description":"Don't look at me\u2026I do what he does \u2014 just slower. #rstats avuncular \u2022 ?Resistance Fighter \u2022 Cook \u2022 Christian \u2022 [Master] Chef des Donn\u00e9es de S\u00e9curit\u00e9 @ @rapid7","sameAs":["http:\/\/rud.is"],"url":"https:\/\/rud.is\/b\/author\/hrbrmstr\/"}]}},"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p23idr-1By","jetpack_likes_enabled":true,"jetpack-related-posts":[{"id":6215,"url":"https:\/\/rud.is\/b\/2017\/09\/04\/readability-redux\/","url_meta":{"origin":6172,"position":0},"title":"Readability Redux","author":"hrbrmstr","date":"2017-09-04","format":false,"excerpt":"I recently posted about using a Python module to convert HTML to usable text. Since then, a new package has hit CRAN dubbed htm2txt that is 100% R and uses regular expressions to strip tags from text. I gave it a spin so folks could compare some basic output, but\u2026","rel":"","context":"In &quot;data wrangling&quot;","block_context":{"text":"data wrangling","link":"https:\/\/rud.is\/b\/category\/data-wrangling\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":6242,"url":"https:\/\/rud.is\/b\/2017\/09\/13\/revisiting-readability-with-rstudio\/","url_meta":{"origin":6172,"position":1},"title":"Revisiting Readability With RStudio","author":"hrbrmstr","date":"2017-09-13","format":false,"excerpt":"I've blogged about my in-development R package hgr a before and it's slowly getting to a CRAN release. There are two new features to it that are more useful in an interactive session than in a programmatic context. Since they build on each other, we'll take them in order. New\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/m4.png?fit=1200%2C900&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/m4.png?fit=1200%2C900&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/m4.png?fit=1200%2C900&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/m4.png?fit=1200%2C900&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/rud.is\/b\/wp-content\/uploads\/2017\/09\/m4.png?fit=1200%2C900&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":9579,"url":"https:\/\/rud.is\/b\/2018\/04\/12\/convert-epub-to-text-for-processing-in-r\/","url_meta":{"origin":6172,"position":2},"title":"Convert epub to Text for Processing in R","author":"hrbrmstr","date":"2018-04-12","format":false,"excerpt":"@RMHoge asked the following on Twitter: Hello #rstats hyve mind! Is there a package that reads epub into R? I can not find any, I now convert to text and parse the text but you sort of lose the structure of the text. Pinging @dataandme @hrbrmstr\u2014 Roel (@RMHoge) April 12,\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":3028,"url":"https:\/\/rud.is\/b\/2014\/09\/20\/chartingmapping-the-scottish-vote-with-r-rvestdplyrtidyrtopojsonggplot\/","url_meta":{"origin":6172,"position":3},"title":"Charting\/Mapping the Scottish Vote with R (an rvest\/dplyr\/tidyr\/TopoJSON\/ggplot tutorial)","author":"hrbrmstr","date":"2014-09-20","format":false,"excerpt":"The BBC did a pretty good job [live tracking the Scotland secession vote](http:\/\/www.bbc.com\/news\/events\/scotland-decides\/results), but I really didn't like the color scheme they chose and decided to use the final tally site as the basis for another tutorial using the tools from the Hadleyverse and taking advantage of the fact that\u2026","rel":"","context":"In &quot;Charts &amp; Graphs&quot;","block_context":{"text":"Charts &amp; Graphs","link":"https:\/\/rud.is\/b\/category\/charts-graphs\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":4547,"url":"https:\/\/rud.is\/b\/2016\/07\/24\/mid-year-r-packages-update-summary\/","url_meta":{"origin":6172,"position":4},"title":"Mid-year R Packages Update Summary","author":"hrbrmstr","date":"2016-07-24","format":false,"excerpt":"I been updating some existing packages and github-releasing new ones (before a CRAN push). Most are \"cyber\"-related, but there are some general purpose ones. Here's a quick overview: docxtractr (CRAN, now, v0.2.0) was initially designed to make it easy to get data tables out of MS Word (docx) documents. The\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":11765,"url":"https:\/\/rud.is\/b\/2019\/01\/14\/splashr-0-6-0-now-uses-the-cran-nascent-stevedore-package-for-docker-orchestration\/","url_meta":{"origin":6172,"position":5},"title":"splashr 0.6.0 Now Uses the CRAN-nascent stevedore Package for Docker Orchestration","author":"hrbrmstr","date":"2019-01-14","format":false,"excerpt":"The splashr package [srht|GL|GH] \u2014 an alternative to Selenium for javascript-enabled\/browser-emulated web scraping \u2014 is now at version 0.6.0 (still in dev-mode but on its way to CRAN in the next 14 days). The major change from version 0.5.x (which never made it to CRAN) is a swap out of\u2026","rel":"","context":"In &quot;R&quot;","block_context":{"text":"R","link":"https:\/\/rud.is\/b\/category\/r\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/6172","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/comments?post=6172"}],"version-history":[{"count":0,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/posts\/6172\/revisions"}],"wp:attachment":[{"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/media?parent=6172"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/categories?post=6172"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rud.is\/b\/wp-json\/wp\/v2\/tags?post=6172"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}