Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for BigQuery bulk export (to Avro, for now) #136

Merged
merged 28 commits into from
Dec 1, 2020

Conversation

vinceatbluelabs
Copy link
Contributor

@vinceatbluelabs vinceatbluelabs commented Dec 1, 2020

This is just enough code to be able to move a table from BigQuery to BigQuery via a GCS bucket.

It turns out that BigQuery doesn't export via Parquet, so the easiest path was to add the first steps of Avro support to Records Mover, and to allow Avro on BigQuery import.

In addition to that functionality, it lays the groundwork for general BigQuery export, which will be useful for moving to other places like Redshift as well.

To get BigQuery->Redshift to work for medium-sized tables, we'll probably need one of:

  • Delimited support in BigQuery export (assuming Redshift and BigQuery share a delimited variant...)
  • Add Avro import support to Redshift

To get BigQuery->Redshift to work for large tables, we'll need some kind of GCS->S3 copying at scale. In typical vendor fashion, I think the GCP stuff I played around with in #132 is S3->GCS only. We might also need to teach BigQuery, Redshift and the Records spec about compressed Avro files, as well.

@@ -1 +1 @@
1103
1125
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

389: tests/integration/*****
373: setup.py
364: records_mover/records/schema/field/__init__.py

Reduce total number of bigfiles violations to 1103 or below!

@@ -87,8 +87,6 @@ def load_from_fileobj(self, schema: str, table: str,
# https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.load_table_from_file
job = client.load_table_from_file(fileobj,
f"{schema}.{table}",
# Must match the destination dataset location.
location="US",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this was ever needed - it's in the example code, but everything seems to run fine without it, so there must be some inference logic to figure out the dataset location based on the client object.

@vinceatbluelabs vinceatbluelabs marked this pull request as ready for review December 1, 2020 16:05
# support any direct conversion from an Avro type into a
# DATETIME field."
#
return records_schema.convert_datetimes_to_string()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came up during the table2table integration test; BigQuery fails to load if you try to load the same string it exports from a DATETIME column into a DATETIME column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬

Copy link
Contributor

@cwegrzyn cwegrzyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me! Thanks for getting it in! :shipit:

@vinceatbluelabs vinceatbluelabs merged commit 28915f6 into master Dec 1, 2020
@vinceatbluelabs vinceatbluelabs deleted the bigquery_bulk_export branch December 1, 2020 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants