-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery import from GCS buckets #113
Conversation
It's not needed anymore, presumably due to changes in the move() algorithm.
…dshift_load_when_bucket_load_not_available
@@ -152,7 +158,8 @@ def test_select_lastpass_session_by_config(self, | |||
default_boto3_session=PleaseInfer.token, | |||
default_gcp_creds=PleaseInfer.token, | |||
default_gcs_client=PleaseInfer.token, | |||
scratch_s3_url='s3://foo/') | |||
scratch_s3_url='s3://foo/', | |||
scratch_gcs_url='gs://bar/') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fiddler on the roof style singing TRADITION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably shouldn't do "late nite reviewing" like this but here we are.
Overall looks good. A few comments on test function naming but it feels less useful in some of these instances so it might not even be worth making the changes. Totally defer to you.
Co-authored-by: Sonja Duric <[email protected]>
Co-authored-by: Sonja Duric <[email protected]>
Co-authored-by: Sonja Duric <[email protected]>
Co-authored-by: Sonja Duric <[email protected]>
Co-authored-by: Sonja Duric <[email protected]>
Co-authored-by: Sonja Duric <[email protected]>
Co-authored-by: Sonja Duric <[email protected]>
@@ -135,6 +135,34 @@ def test_s3_scratch_bucket_via_prefix_assumed_role(self, | |||
self.assertIsNone(out) | |||
mock_get_config.assert_called_with('records_mover', 'bluelabs') | |||
|
|||
@patch('records_mover.creds.base_creds.get_config') | |||
@patch('records_mover.creds.base_creds.os') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E128 continuation line under-indented for visual indent
@@ -135,6 +135,34 @@ def test_s3_scratch_bucket_via_prefix_assumed_role(self, | |||
self.assertIsNone(out) | |||
mock_get_config.assert_called_with('records_mover', 'bluelabs') | |||
|
|||
@patch('records_mover.creds.base_creds.get_config') | |||
@patch('records_mover.creds.base_creds.os') | |||
def test_gcs_scratch_bucket_configured_true(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E128 continuation line under-indented for visual indent
mock_get_config.assert_called_with('records_mover', 'bluelabs') | ||
|
||
@patch('records_mover.creds.base_creds.get_config') | ||
@patch('records_mover.creds.base_creds.os') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E128 continuation line under-indented for visual indent
|
||
@patch('records_mover.creds.base_creds.get_config') | ||
@patch('records_mover.creds.base_creds.os') | ||
def test_gcs_scratch_bucket_not_configured_true(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E128 continuation line under-indented for visual indent
@@ -163,6 +191,36 @@ def test_s3_scratch_bucket_no_config_file(self, | |||
self.assertIsNone(out) | |||
mock_get_config.assert_called_with('records_mover', 'bluelabs') | |||
|
|||
@patch('records_mover.creds.base_creds.get_config') | |||
@patch('records_mover.creds.base_creds.os') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E128 continuation line under-indented for visual indent
@@ -163,6 +191,36 @@ def test_s3_scratch_bucket_no_config_file(self, | |||
self.assertIsNone(out) | |||
mock_get_config.assert_called_with('records_mover', 'bluelabs') | |||
|
|||
@patch('records_mover.creds.base_creds.get_config') | |||
@patch('records_mover.creds.base_creds.os') | |||
def test_gcs_scratch_bucket_no_config_file_true(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E128 continuation line under-indented for visual indent
self.assertIsNone(out) | ||
mock_get_config.assert_called_with('records_mover', 'bluelabs') | ||
|
||
@patch('records_mover.creds.base_creds.os') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E128 continuation line under-indented for visual indent
out = creds.default_scratch_gcs_url() | ||
self.assertIsNone(out) | ||
|
||
@patch('records_mover.creds.base_creds.os') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E128 continuation line under-indented for visual indent
Right now we only have code to load data into BigQuery by streaming data over the network to it. This PR teaches it as well to be able to point to existing data in a GCS bucket.
This is important when data gets large - there are ways to move mass amount of data into a GCS bucket (e.g., GCP Cloud Data Transfer) that accomplish the goal much faster than trying to stream data down to your laptop and back up again. BigQuery is able to read from a bucket in parallel and over a much faster network in order to load data.
One major limitation remaining: all of the methods that Records Mover knows about to get data into that GCS bucket are inherently slow as they rely on streaming data down to Records Mover and then back up to the GCS bucket. In practice for large datasets (>100MB) folks will want to first get data into a GCS bucket and then use Records Mover.
Watch this space.
Depends on #120 and set to merge into it for clarity of diff.