-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for BigQuery bulk export (to Avro, for now) #136
Conversation
389: tests/integration/***** 373: setup.py 364: records_mover/records/schema/field/__init__.py Reduce total number of bigfiles violations to 1103 or below!
@@ -1 +1 @@ | |||
1103 | |||
1125 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
389: tests/integration/*****
373: setup.py
364: records_mover/records/schema/field/__init__.py
Reduce total number of bigfiles violations to 1103 or below!
@@ -87,8 +87,6 @@ def load_from_fileobj(self, schema: str, table: str, | |||
# https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.load_table_from_file | |||
job = client.load_table_from_file(fileobj, | |||
f"{schema}.{table}", | |||
# Must match the destination dataset location. | |||
location="US", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced this was ever needed - it's in the example code, but everything seems to run fine without it, so there must be some inference logic to figure out the dataset location based on the client object.
# support any direct conversion from an Avro type into a | ||
# DATETIME field." | ||
# | ||
return records_schema.convert_datetimes_to_string() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This came up during the table2table integration test; BigQuery fails to load if you try to load the same string it exports from a DATETIME column into a DATETIME column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me! Thanks for getting it in!
This is just enough code to be able to move a table from BigQuery to BigQuery via a GCS bucket.
It turns out that BigQuery doesn't export via Parquet, so the easiest path was to add the first steps of Avro support to Records Mover, and to allow Avro on BigQuery import.
In addition to that functionality, it lays the groundwork for general BigQuery export, which will be useful for moving to other places like Redshift as well.
To get BigQuery->Redshift to work for medium-sized tables, we'll probably need one of:
To get BigQuery->Redshift to work for large tables, we'll need some kind of GCS->S3 copying at scale. In typical vendor fashion, I think the GCP stuff I played around with in #132 is S3->GCS only. We might also need to teach BigQuery, Redshift and the Records spec about compressed Avro files, as well.