Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery import from GCS buckets #113

Merged
merged 165 commits into from
Nov 11, 2020
Merged

BigQuery import from GCS buckets #113

merged 165 commits into from
Nov 11, 2020

Conversation

vinceatbluelabs
Copy link
Contributor

@vinceatbluelabs vinceatbluelabs commented Oct 26, 2020

Right now we only have code to load data into BigQuery by streaming data over the network to it. This PR teaches it as well to be able to point to existing data in a GCS bucket.

This is important when data gets large - there are ways to move mass amount of data into a GCS bucket (e.g., GCP Cloud Data Transfer) that accomplish the goal much faster than trying to stream data down to your laptop and back up again. BigQuery is able to read from a bucket in parallel and over a much faster network in order to load data.

One major limitation remaining: all of the methods that Records Mover knows about to get data into that GCS bucket are inherently slow as they rely on streaming data down to Records Mover and then back up to the GCS bucket. In practice for large datasets (>100MB) folks will want to first get data into a GCS bucket and then use Records Mover.

Watch this space.

Depends on #120 and set to merge into it for clarity of diff.

@@ -152,7 +158,8 @@ def test_select_lastpass_session_by_config(self,
default_boto3_session=PleaseInfer.token,
default_gcp_creds=PleaseInfer.token,
default_gcs_client=PleaseInfer.token,
scratch_s3_url='s3://foo/')
scratch_s3_url='s3://foo/',
scratch_gcs_url='gs://bar/')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fiddler on the roof style singing TRADITION

Copy link

@crvena-sonja crvena-sonja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably shouldn't do "late nite reviewing" like this but here we are.

Overall looks good. A few comments on test function naming but it feels less useful in some of these instances so it might not even be worth making the changes. Totally defer to you.

@@ -135,6 +135,34 @@ def test_s3_scratch_bucket_via_prefix_assumed_role(self,
self.assertIsNone(out)
mock_get_config.assert_called_with('records_mover', 'bluelabs')

@patch('records_mover.creds.base_creds.get_config')
@patch('records_mover.creds.base_creds.os')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E128 continuation line under-indented for visual indent

@@ -135,6 +135,34 @@ def test_s3_scratch_bucket_via_prefix_assumed_role(self,
self.assertIsNone(out)
mock_get_config.assert_called_with('records_mover', 'bluelabs')

@patch('records_mover.creds.base_creds.get_config')
@patch('records_mover.creds.base_creds.os')
def test_gcs_scratch_bucket_configured_true(self,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E128 continuation line under-indented for visual indent

mock_get_config.assert_called_with('records_mover', 'bluelabs')

@patch('records_mover.creds.base_creds.get_config')
@patch('records_mover.creds.base_creds.os')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E128 continuation line under-indented for visual indent


@patch('records_mover.creds.base_creds.get_config')
@patch('records_mover.creds.base_creds.os')
def test_gcs_scratch_bucket_not_configured_true(self,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E128 continuation line under-indented for visual indent

@@ -163,6 +191,36 @@ def test_s3_scratch_bucket_no_config_file(self,
self.assertIsNone(out)
mock_get_config.assert_called_with('records_mover', 'bluelabs')

@patch('records_mover.creds.base_creds.get_config')
@patch('records_mover.creds.base_creds.os')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E128 continuation line under-indented for visual indent

@@ -163,6 +191,36 @@ def test_s3_scratch_bucket_no_config_file(self,
self.assertIsNone(out)
mock_get_config.assert_called_with('records_mover', 'bluelabs')

@patch('records_mover.creds.base_creds.get_config')
@patch('records_mover.creds.base_creds.os')
def test_gcs_scratch_bucket_no_config_file_true(self,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E128 continuation line under-indented for visual indent

self.assertIsNone(out)
mock_get_config.assert_called_with('records_mover', 'bluelabs')

@patch('records_mover.creds.base_creds.os')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E128 continuation line under-indented for visual indent

out = creds.default_scratch_gcs_url()
self.assertIsNone(out)

@patch('records_mover.creds.base_creds.os')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E128 continuation line under-indented for visual indent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants