Recipe 3 Wiring Up Drill and R (dplyr
-style)
3.1 Problem
You have Drill installed and want to work with Drill from R using dplyr
idioms.
3.2 Solution
Install, load and use the sergeant
package.
3.3 Discussion
Working with Drill in a dplyr
context is pretty straightforward. Assuming you have drill running you need to:
- connect to the database
- identify a table
- perform normal operations
The Drill query interface (http://localhost:8047/query) provides an example query we can use for testing if your setup is working:
SELECT * FROM cp.`employee.json` LIMIT 20
Let’s take a look at the employee.json
table using R & dplyr
.
## src: DrillConnection
## tbls: cp.default, dfs.default, dfs.otg, dfs.root, dfs.samsung, dfs.tmp,
## dfs.wikimedia, INFORMATION_SCHEMA, sys
## # Source: table<cp.`employee.json`> [?? x 16]
## # Database: DrillConnection
## store_id gender department_id birth_date supervisor_id last_name
## <int> <chr> <int> <date> <int> <chr>
## 1 0 F 1 1961-08-26 0 Nowmer
## 2 0 M 1 1915-07-03 1 Whelply
## 3 0 M 1 1969-06-20 1 Spence
## 4 0 F 1 1951-05-10 1 Gutierrez
## 5 0 F 2 1942-10-08 1 Damstra
## 6 0 F 3 1949-03-27 1 Kanagaki
## 7 9 F 11 1922-08-10 5 Brunner
## 8 21 F 11 1979-06-23 5 Blumberg
## 9 0 M 5 1949-08-26 1 Stanz
## 10 1 M 11 1967-06-20 5 Murraiin
## # ... with more rows, and 10 more variables: position_title <chr>,
## # hire_date <dttm>, management_role <chr>, salary <dbl>,
## # marital_status <chr>, full_name <chr>, employee_id <int>,
## # education_level <chr>, first_name <chr>, position_id <int>
If you’ve never worked with the dplyr
and databases before, you may not be aware that what’s really happening is different than what you experince with local R data frames.
The call to src_drill("localhost")
does some RJDBC/DBI “magic” behind the scenes to setup all the necessary connection parameters.
By printing db
, a SQL query —SHOW DATABASES
—is issued to get available storage/databases and reformats it to conform to the expected dplyr
datbase API.
When tbl(...)
is executed a
SELECT * FROM cp.`employee.json` LIMIT 1
is issued to obtain field information for future operations on employee
.
Finally, when employee
is printed, the query
SELECT * FROM cp.`employee.json` LIMIT 10
is issued and the contents transferred back into R.
Notice that the data is really all still in Drill. Future recipes introduce dplyr::collect()
which is how you ultimately transfer the query results back into an R object.