Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bq date partitioning #641

Merged
merged 26 commits into from
Feb 12, 2018
Merged

Bq date partitioning #641

merged 26 commits into from
Feb 12, 2018

Conversation

drewbanin
Copy link
Contributor

@drewbanin drewbanin commented Jan 19, 2018

This branch adds support for date partitioning on BigQuery in dbt.

TODO:

  • fix tests
  • prevent views from overwriting date-partitioned table, as this could delete a significant amount of data!

Usage

Run for a single day, hard-coded

Simple usage, specify a single date manually:

{{
    config(
        materialized='table',
        partition_date='20180101',
    )
}}

select *
from `public`.`events_20180101`

Run for a range of days

More complex usage, specify a range of dates. The date_sharded_table macro will interpolate the 8-digit date for each of the days between January 1st and January 10th, inclusive. In this way, the resulting date partitioned table will have 10 partitions, one for each day, built from the corresponding date-sharded events_[YYYYMMDD] tables.

{{
    config(
        materialized='table',
        partition_date='20180101,20180110',
    )
}}

select *
from `public`.`{{ date_sharded_table('events_') }}`

Dynamically specify partition date(s)

Use a variable instead of hardcoding a single date. The variable defaults to "yesterday" if partition_date is not provided.

-- macros/datetime.sql
{% macro yesterday() -%}
    {%- set delta = modules.datetime.timedelta(days=-1) -%}
    {{ return(run_started_at + delta).strftime('%Y%m%d') }}
{%- endmacro %}
--models/partitioned.sql
{{
    config(
        materialized='table',
        partition_date=var('partition_date', yesterday()),
    )
}}

select *
from `public`.`{{ date_sharded_table('events_') }}`

This branch is intended to be used in conjunction with #640 to supply variables to date partitioned tables on the command line.

Additional configuration

A full list of configuration options for date partitioned tables is shown below:

{{
    config(
        materialized='table',
        partition_date='2018-01-01,2018-01-10,
        partition_date_format='%Y-%m-%d',
        verbose=True
    )
}}
  • partition_date_format : The date format (using strptime/stftime conventions) with which to parse the partition_date field. Default: %Y%m%d
  • verbose : If set to True, dbt will output one log line for each date partition created during the invocation of the date partitioned model. Default: False
$ dbt run
Found 1 models, 1 tests, 0 archives, 0 analyses, 49 macros, 0 operations

19:43:10 | Concurrency: 1 threads (target='dev')
19:43:10 |
19:43:18 | 6 of 6 START table model dbt_dbanin.partitioned.................. [RUN]
19:43:18 | -> Running for day 20180101
19:43:19 | -> Running for day 20180102
19:43:21 | -> Running for day 20180103
19:43:22 | 1 of 1 OK created table model dbt_dbanin.partitioned............. [CREATED 3 PARTITIONS in 4.60s]
19:43:22 |
19:43:22 | Finished running 1 table models in 13.23s.

Completed successfully

Done. PASS=1 ERROR=0 SKIP=0 TOTAL=1

@drewbanin drewbanin requested a review from cmcarthur January 29, 2018 00:28
@drewbanin drewbanin added this to the 0.9.2 milestone Jan 29, 2018
@cmcarthur
Copy link
Member

my gut reaction to the 20180101,20180110 syntax is that it's not ideal. it seems like it'd be better to pass a range(20180101,20180110) as the partition_date. that's more powerful & would let you do things like partitiondate=[20180101, 20180201] (rebuild jan 1 and feb 1, specifically). but my guess is that you used 20180101,20180110 so that you can pass in a variable range on the command line, is that right?

Copy link
Member

@cmcarthur cmcarthur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i left a few comments that need to be addressed

all_tables = []
for schema in schemas:
dataset = cls.get_dataset(profile, schema, model_name)
all_tables.extend(dataset.list_tables())
all_tables.extend(client.list_tables(dataset))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof, is this the API change you were referencing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah :/

relation_object.delete()
client.delete_table(relation_object)

cls.release_connection(profile, model_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think it's correct to release the connection here -- what if you drop the table first, and then create it? i guess for bigquery it makes no difference, but better to exclude it if extraneous

res = cls.fetch_query_results(query)
res = list(iterator)

cls.release_connection(profile, model_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as on drop re: releasing connection

dataset.create()
client.create_dataset(dataset)

cls.release_connection(profile, model_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

for table in client.list_tables(dataset):
client.delete_table(table.reference)

cls.release_connection(profile, name=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


{% for i in range(0, day_count + 1) %}
{% set the_day = (modules.datetime.timedelta(days=i) + start_date).strftime('%Y%m%d') %}
{% if verbose %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

The provided partition date '{{ date_str }}' does not match the expected format '{{ date_fmt }}'
{%- endset %}

{% set res = try_or_compiler_error(error_msg, modules.datetime.datetime.strptime, date_str.strip(), date_fmt) %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really clever haha

@@ -173,6 +173,7 @@ def setUp(self):

# it's important to use a different connection handle here so
# we don't look into an incomplete transaction
adapter.cleanup_connections()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this obviates the need for release_connection everywhere else

@drewbanin
Copy link
Contributor Author

@cmcarthur my first cut of this used start_date and end_date as two different variables instead of a single partition_date. That worked pretty well, but it was confusing that when end_date isn't set, the date partitioning only runs for the start_date. Moreover, it's a little more difficult to type out a yaml dictionary with two elements on the command line IMO.

I like the idea of using range conceptually, but these dates are essentially strings, not integers! Eg:

range(20180131, 20180201)

^this would run for 20180131 (good) and then 20180132 (bad), so we'd need to implement our own sort of date_range function I think.

I think you're right though -- it's unusual that partition_date accepts a comma-separated string. One other option I can think of is to make the materialization accept a list of dates to run for, where this list can be generated by a macro from a start/end date pair. We can implement this macro in the global project to make this easy/transparent for users.

So:

$ dbt run --vars 'partition_date: "20180101, 20180131"'

Then in your model:

{{
    config(
        materialized='table',
        partition_date=date_range_to_list(var('partition_date')),
    )
}}

...

and then a macro which looks like:

{% macro date_range_to_list(range_str) %}

  start, end = range_str.split(",")
  dates = []
  for each date in range(start,end):
    dates.append(date)

  return dates

{% endmacro %}

So the CLI interface is the same, but the macro interface can work for a single date, a date range, or a smattering of random dates.

Let me know what you think about this kind of approach

@cmcarthur
Copy link
Member

I like this approach a lot. We can write up docs on how to set your exact CLI syntax up for a project, but if someone really wanted to implement the other use case, they could do it themselves via:

{{config(partition_date(var('partition_dates').split(',')))}}
dbt run --vars 'partition_dates: 20180101, 20180201'

@drewbanin
Copy link
Contributor Author

Ok, i'll do that

@drewbanin
Copy link
Contributor Author

This is now implemented such that the table materialization can accept a list of "partitions". Eg:

{{
    config(
        materialized='table',
        partitions=['20180101', '20180102', '20180103'],
        verbose=True
    )
}}

The type checking is pretty loosey goosey here, so you can give a list of strings, list of ints, a single string, a single int, etc. These dates must be provided in BigQuery date format, ie. an 8-character series of digits.

To generate this list of dates, users can use the dbt.partition_range function built into the dbt global project. In practice, this looks like:

{{
    config(
        materialized='table',
        partitions=dbt.partition_range('20180101, 20180201'),
        verbose=True
    )
}}

This partition_range function will generate a list of dates in the range of the two provided dates. If only one date is provided, the resulting date range will only contain the date specified. This function also accepts an optional date format string. Finally, this macro can be combined with CLI vars to configure date ranges from the CLI, eg.

$ dbt run --model partitioned_model --vars 'dates: "20180101, 20180201"'

coupled with

{{
    config(
        materialized='table',
        partitions=dbt.partition_range(var('dates')),
        verbose=True
    )
}}

Users can further extend these macros to simplify patterns which they use frequently.

@cmcarthur
Copy link
Member

lgtm!

@drewbanin drewbanin merged commit 4eb75ec into development Feb 12, 2018
@drewbanin drewbanin deleted the bq-date-partitioning branch February 12, 2018 21:10
iknox-fa pushed a commit that referenced this pull request Feb 8, 2022
* first cut of date partitioning

* cleanup, implement partitioning in materialization

* update requirements.txt

* wip for date partitioning with range

* log data

* arg handling, logging, cleanup + view compat for new bq version

* add partitioning tests, compatibility with bq 0.29.0 release

* pep8

* fix for strange error in appveyor

* debug appveyor...

* dumb

* debugging weird bq adapter use in pg test

* do not use read_project in bq tests

* cleanup connections, initialize bq tests

* remove debug lines

* fix integration tests (actually)

* warning for view creation which clobbers tables

* add query timeout example for bq

* no need to release connections in the adapter

* partition_date interface change (wip)

* list of dates for bq dp tables

* tiny fixes for crufty dbt_project.yml files

* rm debug line

* fix tests


automatic commit by git-black, original commits:
  4eb75ec
iknox-fa pushed a commit that referenced this pull request Feb 8, 2022
* first cut of date partitioning

* cleanup, implement partitioning in materialization

* update requirements.txt

* wip for date partitioning with range

* log data

* arg handling, logging, cleanup + view compat for new bq version

* add partitioning tests, compatibility with bq 0.29.0 release

* pep8

* fix for strange error in appveyor

* debug appveyor...

* dumb

* debugging weird bq adapter use in pg test

* do not use read_project in bq tests

* cleanup connections, initialize bq tests

* remove debug lines

* fix integration tests (actually)

* warning for view creation which clobbers tables

* add query timeout example for bq

* no need to release connections in the adapter

* partition_date interface change (wip)

* list of dates for bq dp tables

* tiny fixes for crufty dbt_project.yml files

* rm debug line

* fix tests


automatic commit by git-black, original commits:
  4eb75ec
  a37374d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants