Continuing the blog’s UDF theme of late, there are two new UDF kids in town:
drill-url-tools
? for slicing & dicing URI/URLs (just going to use ‘URL’ from now on in the post)drill-domain-tools
? for slicing & dicing internet domain names (IDNs).
Now, if you’re an Apache Drill fanatic, you’re likely thinking “Hey hrbrmstr: don’t you know that Drill has a parse_url()
? function already?” My answer is “Sure, but it’s based on java.net.URL
which is fundamentally broken.”
Slicing & dicing URLs and IDNs is a large part of the $DAYJOB
and they go together pretty well, hence the joint UDF release.
Rather than just use boring SQL for an example, we’ll start with some SQL and use R for a decent example of working with the two, new UDFs.
Counting Lying Lock Icons
SSL/TLS is all the craze these days, so let’s see how many distinct sites in the GDELT Global Front Page (GFG) data set use port 443 vs port 80 (a good indicator, plus it will help show how the URL tools pick up ports even when they’re not there).
If you go to the aforementioned URL it instructs us that the most current GFG dataset URL can be retrieved by inspecting the contents of this metadata URL
There are over a million records in that data set but — as we’ll see — not nearly as many distinct hosts.
Let’s get the data:
library(sergeant)
library(tidyverse)
read_delim(
file = "http://data.gdeltproject.org/gdeltv3/gfg/alpha/lastupdate.txt",
delim = " ",
col_names = FALSE,
col_types = "ccc"
) -> gfg_update
dl_path <- file.path("~/Data/gfg_links.tsv.gz")
if (!file.exists(dl_path)) download.file(gfg_update$X3[1], dl_path)
Those operations have placed the GFG data set in a place where my local Drill instance can get to them. It's a tab separated file (TSV) which — while not a great data format — is workable with Drill.
Now we'll setup a SQL query that will parse the URLs and domains, giving us a nice rectangular structure for R & dbplyr
. We'll use the second column since a significant percentage of the URLs in column 6 are malformed:
db <- src_drill()
tbl(db, "(
SELECT
b.host,
port,
b.rec.hostname AS hostname,
b.rec.assigned AS assigned,
b.rec.tld AS tld,
b.rec.subdomain AS subdomain
FROM
(SELECT
host, port, suffix_extract(host) AS rec -- break the hostname into components
FROM
(SELECT
a.rec.host AS host, a.rec.port AS port
FROM
(SELECT
columns[1] AS url, url_parse(columns[1]) AS rec -- break the URL into components
FROM dfs.d.`/gfg_links.tsv.gz`) a
WHERE a.rec.port IS NOT NULL -- filter out URL parsing failures
)
) b
WHERE b.rec.tld IS NOT NULL -- filter out domain parsing failures
)") -> gfg_df
gfg_df
## # Database: DrillConnection
## hostname port host subdomain assigned tld
##
## 1 www 80 www.eestikirik.ee NA eestikirik.ee ee
## 2 www 80 www.eestikirik.ee NA eestikirik.ee ee
## 3 www 80 www.eestikirik.ee NA eestikirik.ee ee
## 4 www 80 www.eestikirik.ee NA eestikirik.ee ee
## 5 www 80 www.eestikirik.ee NA eestikirik.ee ee
## 6 www 80 www.eestikirik.ee NA eestikirik.ee ee
## 7 www 80 www.eestikirik.ee NA eestikirik.ee ee
## 8 www 80 www.eestikirik.ee NA eestikirik.ee ee
## 9 www 80 www.eestikirik.ee NA eestikirik.ee ee
## 10 www 80 www.eestikirik.ee NA eestikirik.ee ee
## # ... with more rows
While we could have done it all in SQL, we saved some bits for R:
distinct(gfg_df, assigned, port) %>%
count(port) %>%
collect() -> port_counts
port_counts
# A tibble: 2 x 2
port n
*
1 80 20648
2 443 22178
You'd think more news-oriented sites would be HTTPS by default given the current global political climate (though those lock icons are no safety panacea by any stretch of the imagination).
FIN
Now, R can do URL & IDN slicing, but Drill can operate at-scale. That is, R's urltools
package may be fine for single-node, in-memory ops, but Drill can process billions of URLs when part of a cluster.
I'm not 100% settled on the galimatias
library for URL parsing (I need to do some extended testing) and I may add some less-strict IDN slicing & dicing functions as well.
Kick the tyres & file issues & PRs as necessary.