-
Notifications
You must be signed in to change notification settings - Fork 21
/
extending_etl.Rmd
114 lines (80 loc) · 5.92 KB
/
extending_etl.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
title: "Extending etl"
author: "Ben Baumer"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Extending etl}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
The `etl` package provides a framework for working with **medium data**. These data are typically:
- downloadable via publicly-accessible URLs
- larger than what would comfortably fit in a laptop's memory (e.g. several gigabytes)
- ideally stored in a SQL-based relational database management system
The `etl` package provides a series of S3 generics and convenience functions that facilitate consistent development of `etl`-style packages. Current examples include:
- [macleish](https://github.com/beanumber/etl): [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/macleish)](https://cran.r-project.org/package=macleish)
Weather and spatial data from the MacLeish Field Station in Whately, MA.
- [nyctaxi](https://github.com/beanumber/nyctaxi): [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/nyctaxi)](https://cran.r-project.org/package=nyctaxi)
Trip data from the New York City Taxi and Limousine Commission
- [airlines](https://github.com/beanumber/airlines): On-time flight arrival data from the Bureau of Transportation Statistics
- [citibike](https://github.com/beanumber/citibike): Municipal bike-sharing system in New York City
- [nyc311](https://github.com/beanumber/nyc311): Phone calls to New York City's feedback hotline
- [fec](https://github.com/beanumber/fec): Campaign contribution data from the Federal Election Commission
- [imdb](https://github.com/beanumber/imdb): Mirror of the Internet Movie Database
Please see these packages for examples of how to develop an `etl`-dependent package. This vignette documents how these extensions can be crafted.
## The basics
Suppose that you want to write the `etl`-dependent package `foo`. In order for `foo` to work, it should implement at least one of these methods (and sometimes all three):
1. `etl_extract.etl_foo()` - downloads data from Internet
2. `etl_transform.etl_foo()` - takes downloaded data and puts it into CSV format suitable for import to SQL
3. `etl_load.etl_foo()` - imports data into a SQL database
That's basically it. The rest of the machinery should be taken care of for you. In particular, `etl_extract()`, `etl_transform()`, and `etl_load()` are all generic functions that have sensible default methods.
1. `etl_extract.default()` - pulls all of the `data.frame`s available through the `data()` function for `foo`, and writes them as CSVs
2. `etl_extract.default()`- simply copies all of the CSVs to the load directory.
3. `etl_load.default()` - imports all of those CSVs into a SQL database
Note that you don't have to write an `etl` method to handle `foo`. You do, however, have to have the `foo` package installed in order for the `etl` instantiation function to work.
```{r, error=TRUE, message=FALSE}
library(etl)
foo <- etl("foo")
```
To see the default methods in action, pick a package with some data and import it.
```{r}
ggplots <- etl("ggplot2") %>%
etl_update()
src_tbls(ggplots)
```
## The details
### Main `etl` methods
Each of the three main `etl` methods must take an `etl_foo` object as it's first argument, and (should invisibly) return an `etl_foo` object. These methods are *pipeable* and *predictable*, but not *pure*, since they by design have side-effects (i.e. downloading files, etc.) Your major task in writing the `foo` package will be to write these functions. How you write them is entirely up to you, and the particular implementation will of course depend on what the purpose of `foo` is.
All three of the main `etl` methods should take the same set or arguments. Most commonly these define the span of time for the files that you want to extract, transform, or load. For example, in the `airlines` package, these functions take optional `year` and `month` arguments.
We illustrate with `cities`, which unfortunately takes only `...`. Also, `etl_cities` uses `etl_load.default()`, so there is no `etl:::etl_load.etl_cities()` method.
```{r, error=TRUE}
etl_extract.etl_cities %>% args()
etl_transform.etl_cities %>% args()
etl_load.etl_cities %>% args()
```
### Other `etl` methods
There are four additional functions in the `etl` toolchain:
1. `etl_init()` - initialize the database
2. `etl_cleanup()` - delete unnecessary files
3. `etl_update()` - run `etl_extract`, `etl_transform()` and `etl_load()` in succession with the same arguments
4. `etl_create()` - run `etl_init()`, `etl_update()`, and `etl_cleanup()` in succession
These functions can generally be used without modification and thus are *not* commonly extended by `foo`.
The `etl_init()` function will initialize the SQL database.
If you want to contribute your own hard-coded SQL initialization script, it must be placed in `inst/sql/`. The `etl_init()` function will look there, and find files whose file extensions match the database type. For example, scripts written for MySQL should have the `.mysql` file extension, while scripts written for PostgreSQL should have the `.postgresql` file extension.
If no such file exists, all of the tables and views in the database will be deleted, and new tables schemas will be created on-the-fly by `dplyr`.
### `etl_foo` object attributes
Every `etl_foo` object has a directory where it can store files and a `DBIConnection` where it can write to a database. By default, these come from `tempdir()` and `RSQLite::SQLite()`, but the user can alternatively specify other locations.
```{r}
cities <- etl("cities")
str(cities)
```
Note that an `etl_foo` object is also a `src_dbi` object and a `src_sql` object. Please see the [`dbplyr` vignette](https://CRAN.R-project.org/package=dbplyr) for more information about these database connections.
## References
- Wickham, Hadley, [*R Packages*](https://r-pkgs.org/)
- Wickham, Hadley, [*Advanced R*](https://adv-r.hadley.nz/)
```{r}
citation("etl")
citation("dplyr")
citation("dbplyr")
```