(For first-timers, R⁶ tagged posts are short & sweet with minimal expository; R⁶ feed)
At work-work I mostly deal with medium-to-large-ish data. I often want to poke at new or existing data sets w/o working across billions of rows. I also use Apache Drill for much of my exploratory work.
Here’s how to uniformly sample data from Apache Drill using the sergeant
package:
library(sergeant)
db <- src_drill("sonar")
tbl <- tbl(db, "dfs.dns.`aaaa.parquet`")
summarise(tbl, n=n())
## # Source: lazy query [?? x 1]
## # Database: DrillConnection
## n
## <int>
## 1 19977415
mutate(tbl, r=rand()) %>%
filter(r <= 0.01) %>%
summarise(n=n())
## # Source: lazy query [?? x 1]
## # Database: DrillConnection
## n
## <int>
## 1 199808
mutate(tbl, r=rand()) %>%
filter(r <= 0.50) %>%
summarise(n=n())
## # Source: lazy query [?? x 1]
## # Database: DrillConnection
## n
## <int>
## 1 9988797
And, for groups (using a different/larger “database”):
fdns <- tbl(db, "dfs.fdns.`201708`")
summarise(fdns, n=n())
## # Source: lazy query [?? x 1]
## # Database: DrillConnection
## n
## <int>
## 1 1895133100
filter(fdns, type %in% c("cname", "txt")) %>%
count(type)
## # Source: lazy query [?? x 2]
## # Database: DrillConnection
## type n
## <chr> <int>
## 1 cname 15389064
## 2 txt 67576750
filter(fdns, type %in% c("cname", "txt")) %>%
group_by(type) %>%
mutate(r=rand()) %>%
ungroup() %>%
filter(r <= 0.15) %>%
count(type)
## # Source: lazy query [?? x 2]
## # Database: DrillConnection
## type n
## <chr> <int>
## 1 cname 2307604
## 2 txt 10132672
I will (hopefully) be better at cranking these bite-sized posts more frequently in 2018.
One Trackback/Pingback
[…] leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data […]