Add support for BigQuery bulk export (to Avro, for now) #136

vinceatbluelabs · 2020-12-01T11:50:19Z

This is just enough code to be able to move a table from BigQuery to BigQuery via a GCS bucket.

It turns out that BigQuery doesn't export via Parquet, so the easiest path was to add the first steps of Avro support to Records Mover, and to allow Avro on BigQuery import.

In addition to that functionality, it lays the groundwork for general BigQuery export, which will be useful for moving to other places like Redshift as well.

To get BigQuery->Redshift to work for medium-sized tables, we'll probably need one of:

Delimited support in BigQuery export (assuming Redshift and BigQuery share a delimited variant...)
Add Avro import support to Redshift

To get BigQuery->Redshift to work for large tables, we'll need some kind of GCS->S3 copying at scale. In typical vendor fashion, I think the GCP stuff I played around with in #132 is S3->GCS only. We might also need to teach BigQuery, Redshift and the Records spec about compressed Avro files, as well.

records_mover/db/bigquery/unloader.py

tests/unit/db/bigquery/test_bigquery_unloader.py

tests/unit/records/sources/test_table_edge_cases.py

tests/unit/db/bigquery/test_bigquery_loader.py

389: tests/integration/***** 373: setup.py 364: records_mover/records/schema/field/__init__.py Reduce total number of bigfiles violations to 1103 or below!

vinceatbluelabs · 2020-12-01T16:00:34Z

metrics/bigfiles_high_water_mark

@@ -1 +1 @@
-1103
+1125


389: tests/integration/***** 373: setup.py 364: records_mover/records/schema/field/__init__.py Reduce total number of bigfiles violations to 1103 or below!

vinceatbluelabs · 2020-12-01T16:02:49Z

records_mover/db/bigquery/loader.py

@@ -87,8 +87,6 @@ def load_from_fileobj(self, schema: str, table: str,
        # https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html#google.cloud.bigquery.client.Client.load_table_from_file
        job = client.load_table_from_file(fileobj,
                                          f"{schema}.{table}",
-                                          # Must match the destination dataset location.
-                                          location="US",


I'm not convinced this was ever needed - it's in the example code, but everything seems to run fine without it, so there must be some inference logic to figure out the dataset location based on the client object.

vinceatbluelabs · 2020-12-01T16:08:10Z

records_mover/db/bigquery/bigquery_db_driver.py

+            # support any direct conversion from an Avro type into a
+            # DATETIME field."
+            #
+            return records_schema.convert_datetimes_to_string()


This came up during the table2table integration test; BigQuery fails to load if you try to load the same string it exports from a DATETIME column into a DATETIME column.

cwegrzyn

This looks great to me! Thanks for getting it in!

vinceatbluelabs added 2 commits December 1, 2020 06:49

Add support for BigQuery bulk export (to Avro, for now)

e095590

Improve test coverage

0fa5321

bluelabsbutler reviewed Dec 1, 2020

View reviewed changes

records_mover/db/bigquery/unloader.py Outdated Show resolved Hide resolved

bluelabsbutler reviewed Dec 1, 2020

View reviewed changes

tests/unit/db/bigquery/test_bigquery_unloader.py Outdated Show resolved Hide resolved

bluelabsbutler reviewed Dec 1, 2020

View reviewed changes

tests/unit/db/bigquery/test_bigquery_unloader.py Outdated Show resolved Hide resolved

bluelabsbutler reviewed Dec 1, 2020

View reviewed changes

tests/unit/db/bigquery/test_bigquery_unloader.py Outdated Show resolved Hide resolved

vinceatbluelabs added 3 commits December 1, 2020 08:33

Improve test coverage

28f48ab

Improve test coverage

5ebb19b

Improve test coverage

de77caa

bluelabsbutler reviewed Dec 1, 2020

View reviewed changes

tests/unit/records/sources/test_table_edge_cases.py Outdated Show resolved Hide resolved

bluelabsbutler reviewed Dec 1, 2020

View reviewed changes

tests/unit/records/sources/test_table_edge_cases.py Outdated Show resolved Hide resolved

Improve test coverage

09ceea5

bluelabsbutler reviewed Dec 1, 2020

View reviewed changes

tests/unit/db/bigquery/test_bigquery_loader.py Show resolved Hide resolved

bluelabsbutler reviewed Dec 1, 2020

View reviewed changes

tests/unit/db/bigquery/test_bigquery_loader.py Outdated Show resolved Hide resolved

vinceatbluelabs added 6 commits December 1, 2020 08:46

Improve test coverage

2a721af

Improve test coverage

f9f88ad

Ratchet coverage

8b10df9

Ratchet

5c04074

Flake8 fixes

c00afd2

Old Python fixes

d5c8e84

bluelabsbutler reviewed Dec 1, 2020

View reviewed changes

tests/unit/db/bigquery/test_bigquery_loader.py Show resolved Hide resolved

vinceatbluelabs added 9 commits December 1, 2020 09:09

Break up big file

35e7e29

Fix known_supported_records_formats_for_load()

99631b6

Try with config.use_avro_logical_types

ba7d15b

Set config.use_avro_logical_types on import as well

2dcfda0

Deal with Avro limitations

8898433

Improve coverage

16b041c

Unratchet

059b2a7

Ratchet

cba3e2e

Bump bigfiles

f69b823

389: tests/integration/***** 373: setup.py 364: records_mover/records/schema/field/__init__.py Reduce total number of bigfiles violations to 1103 or below!

vinceatbluelabs added 2 commits December 1, 2020 10:45

Drop unneeded import

03f1e28

Ratchet

ebbf53f

vinceatbluelabs commented Dec 1, 2020

View reviewed changes

Reformat array literal

0179a71

vinceatbluelabs requested a review from cwegrzyn December 1, 2020 16:05

vinceatbluelabs marked this pull request as ready for review December 1, 2020 16:05

vinceatbluelabs commented Dec 1, 2020

View reviewed changes

cwegrzyn approved these changes Dec 1, 2020

View reviewed changes

vinceatbluelabs added 4 commits December 1, 2020 11:21

Move schema adjustment logic

63114b5

Improve coverage

1568ada

Ratchet coverage

48a7fdb

Ratchet

e0aab23

vinceatbluelabs merged commit 28915f6 into master Dec 1, 2020

vinceatbluelabs deleted the bigquery_bulk_export branch December 1, 2020 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for BigQuery bulk export (to Avro, for now) #136

Add support for BigQuery bulk export (to Avro, for now) #136

vinceatbluelabs commented Dec 1, 2020 •

edited

Loading

vinceatbluelabs Dec 1, 2020

vinceatbluelabs Dec 1, 2020

vinceatbluelabs Dec 1, 2020

cwegrzyn Dec 1, 2020

cwegrzyn left a comment

		@@ -1 +1 @@
		1103
		1125

Add support for BigQuery bulk export (to Avro, for now) #136

Add support for BigQuery bulk export (to Avro, for now) #136

Conversation

vinceatbluelabs commented Dec 1, 2020 • edited Loading

vinceatbluelabs Dec 1, 2020

Choose a reason for hiding this comment

vinceatbluelabs Dec 1, 2020

Choose a reason for hiding this comment

vinceatbluelabs Dec 1, 2020

Choose a reason for hiding this comment

cwegrzyn Dec 1, 2020

Choose a reason for hiding this comment

cwegrzyn left a comment

Choose a reason for hiding this comment

vinceatbluelabs commented Dec 1, 2020 •

edited

Loading