The longurl
package has been updated to version 0.3.0 as a result of a bug report noting that the URL expansion API it was using went pay-for-use. Since this was the second time a short URL expansion service either went belly-up or had breaking changes the package is now completely client-side-based and a very thin, highly-focused wrapper around the httr::HEAD()
function.
Why longurl
?
On the D&D alignment scale, short links are chaotic evil. [Full-disclosure: I use shortened links all the time, so the pot is definitely kettle-calling here]. Ostensibly, they are for making it easier to show memorable links on tiny, glowing rectangles or printed prose but they are mostly used to directly track you and mask other tracking parameters that the target site is using to keep tabs on you. Furthermore, short URLs are also used by those with even more malicious intent than greedy startups or mega-corporations.
In retrospect, giving a third-party API service access to URLs you are interested in expanding just exacerbated the tracking problem, but many of these third-party URL expansion services do use some temporal caching of results, so they can be a bit faster than doing this in a non-caching package (but, there’s nothing stopping you putting caching code around it if you are using it in a “production” capacity).
How does the updated package work without a URL expansion API?
By default, httr
“verb” requests use the curl
package and that is a wrapper for libcurl
. The httr
verb calls set the “please follow all HTTP status 3xx redirects that are found in responses” option (this is the libcurl
CURLOPT_FOLLOWLOCATION
equivalent option). There are other options that can be set to help configure minutae around how redirect following works. So, just by calling httr::HEAD(some_url)
you get built-in short URL expansion (if what you passed in was a short URL or a URL with a redirect).
Take, for example, this innocent link: http://lnk.direct/zFu
. We can see what goes on under the covers by passing in the verbose()
option to an httr::HEAD()
call:
httr::HEAD("http://lnk.direct/zFu", verbose())
## -> HEAD /zFu HTTP/1.1
## -> Host: lnk.direct
## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Cookie: shorturl=4e0aql3p49rat1c8kqcrmv4gn2
## -> Accept: application/json, text/xml, application/xml, */*
## ->
## <- HTTP/1.1 301 Moved Permanently
## <- Server: nginx/1.0.15
## <- Date: Sun, 18 Dec 2016 19:03:48 GMT
## <- Content-Type: text/html; charset=UTF-8
## <- Connection: keep-alive
## <- X-Powered-By: PHP/5.6.20
## <- Expires: Thu, 19 Nov 1981 08:52:00 GMT
## <- Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
## <- Pragma: no-cache
## <- Location: http://ow.ly/Ko70307eKmI
## <-
## -> HEAD /Ko70307eKmI HTTP/1.1
## -> Host: ow.ly
## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Accept: application/json, text/xml, application/xml, */*
## ->
## <- HTTP/1.1 301 Moved Permanently
## <- Content-Length: 0
## <- Location: http://bit.ly/2gZq7qG
## <- Connection: close
## <-
## -> HEAD /2gZq7qG HTTP/1.1
## -> Host: bit.ly
## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Accept: application/json, text/xml, application/xml, */*
## ->
## <- HTTP/1.1 301 Moved Permanently
## <- Server: nginx
## <- Date: Sun, 18 Dec 2016 19:04:36 GMT
## <- Content-Type: text/html; charset=utf-8
## <- Content-Length: 127
## <- Connection: keep-alive
## <- Cache-Control: private, max-age=90
## <- Location: http://example.com/IT_IS_A_SURPRISE
## <-
## -> HEAD /IT_IS_A_SURPRISE HTTP/1.1
## -> Host: example.com
## -> User-Agent: libcurl/7.51.0 r-curl/2.3 httr/1.2.1
## -> Accept-Encoding: gzip, deflate
## -> Cookie: _csrf/link=g3iBgezgD_OYN0vOh8yI930E1O9ZAKLr4uHmVioxwwQ; mc=null; dmvk=5856d9e39e747; ts=475630; v1st=03AE3C5AD67E224DEA304AEB56361C9F
## -> Accept: application/json, text/xml, application/xml, */*
## ->
## <- HTTP/1.1 200 OK
## ...
## <-
We can reduce the clutter and see that it follows multiple redirects from multiple URL shorteners:
Here’s what the output of a request to longurl::expand_urls()
returns:
longurl::expand_urls("http://lnk.direct/zFu")
## # A tibble: 1 × 3
## orig_url expanded_url status_code
## <chr> <chr> <int>
## 1 http://lnk.direct/zFu http://example.com/IT_IS_A_SURPRISE 200
NOTE: the link does actually go somewhere, and somewhere not malicious, political or preachy (a rarity in general in this post-POTUS-election world of ours).
What else is different?
The longurl::expand_urls()
function returns a tbl_df
and now includes the HTTP status code of the final, resolved link. You can also pass in a custom HTTP referrer since many (many) sites will change behavior depending on the referrer.
What’s next?
This bug-fix release had to go out fairly quickly since the package was essentially broken. With the new foundation being built on client-side machinations, future enhancements will be to pull more features (in the machine learning sense) out of the curl
or httr
requests (I may switch directly to using curl
if I need more granular data) and include some basic visualizations for both request trees (mostly likely using the DiagrammeR
and ggplot2
packages). I may try to add a caching layer, but I believe that’s more of a situation-specific feature folks should add on their own, so I may just add a “check hook” capability that will add an extra function call to a cache checking function of your choosing.
If you have a feature request, please add it to the github repo.
The Actual, Über, Ultimate Fake News List
It is surprisingly shorter (or longer if you grok regular expressions) than you think it might be:
http[s]://.*
.Like it or not, bias is everywhere and tis woefully insidious. “Mainstream media” [MSM]; Alt-left; Alt-right; and, everything in-between creates and promotes “fake” news. Hourly. It always has. Due to the speed at which [dis]information travels now, this is the ultimate era of whatever _you_ believe is “true” is actually “true”, regardless of the morality or (unadulterated, non-theoretical) scientific facts. Sadly, it always has been this way and it always will be this way. Failing to recognize Truth tis the nature of an inherently flawed, broken species.
The reason the MSM is having trouble defining and coming to the grips with “fake news” is that it — itself — has been a witting and unwitting co-conspirator to it since before the invention of the printing press. Tis hard to see the big picture when one navel-gazes so much.
Nobody wants to be wrong and — right now — the only ones who _are_ wrong are those who disagree with _you_. Fundamentally, everyone wants their own aberrations to be normalized and we’re obliging that _in spades_ in our 21st century society. Go. Team.
Rather than sit back and accept said aberrations as the norm, verify **everything** that is presented as “fact”. That may mean opening an honest-to-goodness book or three. Sadly, given the sorry state of public & university libraries, you may need to go to multiple ones to finally get at an actual, real, fact. It may also mean coming to grips with that what you want *so desperately to be “OK”* is actually not OK. Just realize you have permission to be wrong.
Naively believe *nothing* (including this rant blog post). Challenge everything. Seek to recognize, acknowledge and promote Truth wherever it shines. Finally, don’t mistake your own desires for said Truth.