Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
hakobera committed Jul 15, 2016
1 parent fdebdef commit 5bfcc14
Showing 1 changed file with 178 additions and 44 deletions.
222 changes: 178 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
[![Build Status](https://travis-ci.org/tumugi/tumugi-plugin-bigquery.svg?branch=master)](https://travis-ci.org/tumugi/tumugi-plugin-bigquery) [![Code Climate](https://codeclimate.com/github/tumugi/tumugi-plugin-bigquery/badges/gpa.svg)](https://codeclimate.com/github/tumugi/tumugi-plugin-bigquery) [![Coverage Status](https://coveralls.io/repos/github/tumugi/tumugi-plugin-bigquery/badge.svg?branch=master)](https://coveralls.io/github/tumugi/tumugi-plugin-bigquery) [![Gem Version](https://badge.fury.io/rb/tumugi-plugin-bigquery.svg)](https://badge.fury.io/rb/tumugi-plugin-bigquery)

# tumugi-plugin-bigquery
# Google BigQuery plugin for [tumugi](https://github.com/tumugi/tumugi)

tumugi-plugin-bigquery is a plugin for integrate [Google BigQuery](https://cloud.google.com/bigquery/) and [Tumugi](https://github.com/tumugi/tumugi).
tumugi-plugin-bigquery is a plugin for integrate [Google BigQuery](https://cloud.google.com/bigquery/) and [tumugi](https://github.com/tumugi/tumugi).

## Installation

Expand All @@ -12,127 +12,261 @@ Add this line to your application's Gemfile:
gem 'tumugi-plugin-bigquery'
```

And then execute:

```sh
$ bundle
```

Or install it yourself as:

```sb
$ gem install tumugi-plugin-bigquery
```
And then execute `bundle install`.

## Target

### Tumugi::Plugin::BigqueryDatasetTarget

`Tumugi::Plugin::BigqueryDatasetTarget` is target for BigQuery dataset.

#### Parameters

| Name | type | required? | default | description |
|------------|--------|-----------|---------|------------------------------------------------------------------|
| dataset_id | string | required | | Dataset ID |
| project_id | string | optional | | [Project](https://cloud.google.com/compute/docs/projects) ID |

#### Examples

```rb
task :task1 do
output target(:bigquery_dataset, dataset_id: "your_dataset_id")
end
```

```rb
task :task1 do
output target(:bigquery_dataset, project_id: "project_id", dataset_id: "dataset_id")
end
```

#### Tumugi::Plugin::BigqueryTableTarget

`Tumugi::Plugin::BigqueryDatasetTarget` is target for BigQuery table.

#### Parameters

| name | type | required? | default | description |
|------------|--------|-----------|---------|------------------------------------------------------------------|
| table_id | string | required | | Table ID |
| dataset_id | string | required | | Dataset ID |
| project_id | string | optional | | [Project](https://cloud.google.com/compute/docs/projects) ID |

#### Examples

```rb
task :task1 do
output target(:bigquery_table, table_id: "table_id", dataset_id: "your_dataset_id")
end
```

## Task

### Tumugi::Plugin::BigqueryDatasetTask

`Tumugi::Plugin::BigqueryDatasetTask` is task to create a dataset.

#### Usage
#### Parameters

| name | type | required? | default | description |
|------------|--------|-----------|---------|------------------------------------------------------------------|
| dataset_id | string | required | | Dataset ID |
| project_id | string | optional | | [Project](https://cloud.google.com/compute/docs/projects) ID |

#### Examples

```rb
task :task1, type: :bigquery_dataset do
param_set :dataset_id, 'test'
dataset_id 'test'
end
```

### Tumugi::Plugin::BigqueryQueryTask

`Tumugi::Plugin::BigqueryQueryTask` is task to run `query` and save the result into the table which specified by parameter.

#### Usage
#### Parameters

| name | type | required? | default | description |
|-----------------|---------|-----------|------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| query | string | required | | query to execute |
| table_id | string | required | | destination table ID |
| dataset_id | string | required | | destination dataset ID |
| project_id | string | optional | | destination project ID |
| mode | string | optional | "truncate" | specifies the action that occurs if the destination table already exists. [see](#parameters_mode) |
| flatten_results | boolean | optional | true | when you query nested data, BigQuery automatically flattens the table data or not. [see](https://cloud.google.com/bigquery/docs/data#flatten) |
| use_legacy_sql | bool | optional | true | use legacy SQL syntanx for BigQuery or not |
| wait | integer | optional | 60 | wait time (seconds) for query execution |

#### Examples

##### truncate mode (default)

```rb
task :task1, type: :bigquery_query do
param_set :query, "SELECT COUNT(*) AS cnt FROM [bigquery-public-data:samples.wikipedia]"
param_set :dataset_id, 'test'
param_set :table_id, "dest_table#{Time.now.to_i}"
query "SELECT COUNT(*) AS cnt FROM [bigquery-public-data:samples.wikipedia]"
table_id "dest_table#{Time.now.to_i}"
dataset_id "test"
end
```

##### append mode

If you set `mode` to `'append'`, query result append to existing table.
If you set `mode` to `append`, query result append to existing table.

```rb
task :task1, type: :bigquery_query do
param_set :query, "SELECT COUNT(*) AS cnt FROM [bigquery-public-data:samples.wikipedia]"
param_set :dataset_id, 'test'
param_set :table_id, "dest_table#{Time.now.to_i}"
param_set :mode, 'append'
query "SELECT COUNT(*) AS cnt FROM [bigquery-public-data:samples.wikipedia]"
table_id "dest_table#{Time.now.to_i}"
dataset_id "test"
mode "append"
end
```

### Tumugi::Plugin::BigqueryCopyTask

`Tumugi::Plugin::BigqueryCopyTask` is task to copy table which specified by parameter.

#### Usage
#### Parameters

| name | type | required? | default | description |
|-----------------|--------|-----------|---------|---------------------------------------------------------|
| src_table_id | string | required | | source table ID |
| src_dataset_id | string | required | | source dataset ID |
| src_project_id | string | optional | | source project ID |
| dest_table_id | string | required | | destination table ID |
| dest_dataset_id | string | required | | destination dataset ID |
| dest_project_id | string | optional | | destination project ID |
| force_copy | bool | optional | false | force copy when destination table already exists or not |
| wait | integer| optional | 60 | wait time (seconds) for query execution |

#### Examples

Copy `test.src_table` to `test.dest_table`.

##### Normal usecase

```rb
task :task1, type: :bigquery_copy do
param_set :src_dataset_id, 'test'
param_set :src_table_id, 'src_table'
param_set :dest_dataset_id, 'test'
param_set :dest_table_id, 'dest_table'
src_table_id "src_table"
src_dataset_id "test"
dest_table_id "dest_table"
dest_dataset_id "test"
end
```

##### force_copy

If `force_copy` is `true`, copy operation always execute even if target table is existed. Data of target table is truncate.
If `force_copy` is `true`, copy operation always execute even if destination table exists.
This means data of destination table data is deleted, so be carefull to enable this parameter.

```rb
task :task1, type: :bigquery_copy do
param_set :src_dataset_id, 'test'
param_set :src_table_id, 'src_table'
param_set :dest_dataset_id, 'test'
param_set :dest_table_id, 'dest_table'
param_set :force_copy, true
src_table_id "src_table"
src_dataset_id "test"
dest_table_id "dest_table"
dest_dataset_id "test"
force_copy true
end
```

### Tumugi::Plugin::BigqueryLoadTask

`Tumugi::Plugin::BigqueryLoadTask` is task to load structured data from GCS into BigQuery.

#### Usage
#### Parameters

| name | type | required? | default | description |
|-----------------------|-----------------|------------------------------------|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------|
| bucket | string | required | | source GCS bucket name |
| key | string | required | | source path of file like "/path/to/file.csv" |
| table_id | string | required | | destination table ID |
| dataset_id | string | required | | destination dataset ID |
| project_id | string | optional | | destination project ID |
| schema | array of object | required when mode is not "append" | | see [schema format](#parameters_schema) |
| mode | string | optional | "append" | specifies the action that occurs if the destination table already exists. [see](#parameters_mode) |
| source_format | string | optional | "CSV" | source file format. [see](#parameters_format) |
| ignore_unknown_values | bool | optional | false | indicates if BigQuery should allow extra values that are not represented in the table schema |
| max_bad_records | integer | optional | 0 | maximum number of bad records that BigQuery can ignore when running the job |
| field_delimiter | string | optional | "," | separator for fields in a CSV file. used only when source_format is "CSV" |
| allow_jagged_rows | bool | optional | false | accept rows that are missing trailing optional columns. The missing values are treated as null. used only when source_format is "CSV" |
| allow_quoted_newlines | bool | optional | false | indicates if BigQuery should allow quoted data sections that contain newline characters in a CSV file. used only when source_format is "CSV" |
| quote | string | optional | "\"" (double-quote) | value that is used to quote data sections in a CSV file. used only when source_format is "CSV" |
| skip_leading_rows | integer | optional | 0 | .number of rows at the top of a CSV file that BigQuery will skip when loading the data. used only when source_format is "CSV" |
| wait | integer | optional | 60 | wait time (seconds) for query execution |
#### Example

Load `gs://test_bucket/load_data.csv` into `dest_project:dest_dataset.dest_table`

```rb
task :task1, type: :bigquery_load do
param_set :bucket, 'test_bucket'
param_set :key, 'load_data.csv'
param_set :project_id, 'dest_project'
param_set :datset_id, 'dest_dataset'
param_set :table_id, 'dest_table'
bucket "test_bucket"
key "load_data.csv"
table_id "dest_table"
datset_id "dest_dataset"
project_id "dest_project"
end
```

### Config Section
## Common parameter value

<a id="#parameters_mode"></a>
### mode

| value | description |
|----------|-------------|
| truncate | If the table already exists, BigQuery overwrites the table data. |
| append | If the table already exists, BigQuery appends the data to the table. |
| empty | If the table already exists and contains data, a 'duplicate' error is returned in the job result. |

<a id="#parameters_format"></a>
### format

| value | description |
|------------------------|--------------------------------------------|
| CSV | CSV |
| NEWLINE_DELIMITED_JSON | Each line is JSON + new line |
| AVRO | [see](https://avro.apache.org/docs/1.2.0/) |

<a id="#parameters_schema"></a>
### schema

Format of `schema` parameter is array of nested object like below:

```js
[
{
"name": "column1",
"type": "string"
},
{
"name": "column2",
"type": "integer",
"mode": "repeated"
},
{
"name": "record1",
"type": "record",
"fields": [
{
"name": "key1",
"type": "integer",
},
{
"name": "key2",
"type": "integer"
}
]
}
]
```

## Config Section

tumugi-plugin-bigquery provide config section named "bigquery" which can specified BigQuery autenticaion info.

#### Authenticate by client_email and private_key
### Authenticate by client_email and private_key

```rb
Tumugi.configure do |config|
Expand All @@ -144,7 +278,7 @@ Tumugi.configure do |config|
end
```

#### Authenticate by JSON key file
### Authenticate by JSON key file

```rb
Tumugi.configure do |config|
Expand Down

0 comments on commit 5bfcc14

Please sign in to comment.