Load via INSERT on Redshift when scratch bucket not available #114

vinceatbluelabs · 2020-10-27T22:05:48Z

This PR teaches Records Mover that it can use INSERT statements to import to Redshift when nothing else is possible. INSERT doesn't involve AWS credentials and a temporary S3 bucket.

This is sort of useful on its own, but primarily it sets me up in the next PR (#113) to do something similar for BigQuery, so folks don't need to configure a BigQuery Google Cloud Storage bucket just to be able to do imports like they're able to do today.

records_mover/db/redshift/loader.py

records_mover/db/vertica/unloader.py

records_mover/records/targets/base.py

records_mover/records/targets/data_url.py

records_mover/records/targets/table/move_from_dataframes_source.py

It's not needed anymore, presumably due to changes in the move() algorithm.

…dshift_load_when_bucket_load_not_available

records_mover/db/redshift/unloader.py

records_mover/db/redshift/loader.py

records_mover/db/redshift/unloader.py

tests/unit/db/redshift/test_redshift_db_driver.py

vinceatbluelabs · 2020-10-28T01:02:21Z

records_mover/db/redshift/loader.py

@@ -23,11 +23,19 @@ class RedshiftLoader(LoaderFromRecordsDirectory):
    def __init__(self,
                 db: Union[sqlalchemy.engine.Engine, sqlalchemy.engine.Connection],
                 meta: sqlalchemy.MetaData,
-                 temporary_s3_directory_loc: Callable[[], ContextManager[BaseDirectoryUrl]])\
+                 s3_temp_base_loc: Optional[BaseDirectoryUrl])\


I was doing a too-clever-by-half thing before and passing in a function that creates a temporary directory in the right place in S3.

Now I need to know things like "do I actually have a bucket in which to create the temporary directory", so I'm passing in something more low level (at the price of duplicating a three line function between two classes with no inheritance relationship).

vinceatbluelabs · 2020-10-28T01:10:22Z

records_mover/db/loader.py

@@ -47,6 +47,12 @@ def temporary_loadable_directory_loc(self) -> Iterator[BaseDirectoryUrl]:
        with TemporaryDirectory(prefix='temporary_loadable_directory_loc') as dirname:


Loader is an abstract class made concrete by each of the database types which can load out of a records directory.

vinceatbluelabs · 2020-10-28T01:14:20Z

records_mover/records/mover.py

@@ -2,7 +2,7 @@
                      SupportsToFileobjsSource,
                      FileobjsSource, SupportsToDataframesSource)
 from .targets.base import (RecordsTarget, SupportsMoveFromRecordsDirectory,
-                           SupportsMoveFromTempLocAfterFillingIt,
+                           MightSupportMoveFromTempLocAfterFillingIt,


This is another one of those interfaces implemented by sources and targets (in this case, targets). I renamed this interface to be more tentative - as now it can tell you at runtime whether it is able to do do what it says on the tin.

Specifically, this interface has a function which tells the target to:

create a temporary location that the target can load data from efficiently (e.g., an S3 bucket directory that Redshift can run a COPY statement from)

tell the source to fill in that temporary location with a records directory (e.g., copy a CSV file from your local disk to that temporary S3 directory and add a manifest that Redshift likes)

load from that temporary location (e.g., run the Redshift COPY command)

vinceatbluelabs · 2020-10-28T01:16:04Z

records_mover/records/mover.py

-          records_source.has_compatible_format(records_target)):
+          isinstance(records_target, MightSupportMoveFromTempLocAfterFillingIt) and
+          records_source.has_compatible_format(records_target) and
+          records_target.can_move_from_temp_loc_after_filling_it()):


This is a key bit - I'm changing the move() function (the key algorithm for Records Mover) to only use this interface's function if the interface says it's OK.

In the most common case of a Table target, that means that the database in question has a temporary bucket location configured that it can do a bulk export to.

vinceatbluelabs · 2020-10-28T01:19:39Z

records_mover/records/targets/table/move_from_dataframes_source.py

                           "This may be very slow.")
            return self.move_from_dataframes_source_via_insert()

-    def move_from_dataframes_source_via_records_directory(self) -> MoveResult:
+    def move_from_dataframes_source_via_temporary_records_directory(self) -> MoveResult:


Clarify that this function uses a temp directory, as it was a bit of a surprise to me as well by the original name.

vinceatbluelabs · 2020-10-28T01:19:59Z

records_mover/records/targets/table/target.py

+        loader = driver.loader()
+        if loader is None:
+            return False
+        return loader.has_temporary_loadable_directory_loc()


The key implementation of the expanded interface.

…ve_algorithm_predicates

…s' into do_slow_redshift_load_when_bucket_load_not_available

…ad_when_bucket_load_not_available

crvena-sonja

As always, thank you for the comments. Very helpful. 👍🏼

Load via INSERT on Redshift when scratch bucket not available

a66f4db

bluelabsbutler reviewed Oct 27, 2020

View reviewed changes

records_mover/db/redshift/loader.py Outdated Show resolved Hide resolved

bluelabsbutler reviewed Oct 27, 2020

View reviewed changes

records_mover/db/vertica/unloader.py Outdated Show resolved Hide resolved

bluelabsbutler reviewed Oct 27, 2020

View reviewed changes

records_mover/records/targets/base.py Outdated Show resolved Hide resolved

bluelabsbutler reviewed Oct 27, 2020

View reviewed changes

records_mover/records/targets/data_url.py Outdated Show resolved Hide resolved

Add TODO

d20d77f

bluelabsbutler reviewed Oct 27, 2020

View reviewed changes

records_mover/records/targets/table/move_from_dataframes_source.py Outdated Show resolved Hide resolved

vinceatbluelabs added 2 commits October 27, 2020 18:15

Retire SupportsMoveFromTempLocAfterFillingIt from DataUrlTarget

e4bd0ea

It's not needed anymore, presumably due to changes in the move() algorithm.

Merge remote-tracking branch 'origin/retire_protocol' into do_slow_re…

f53587a

…dshift_load_when_bucket_load_not_available

vinceatbluelabs changed the base branch from master to retire_protocol October 27, 2020 23:15

vinceatbluelabs added 6 commits October 27, 2020 19:17

Add comment

ef0b8c1

Implement for Vertica

643ee51

TODONE

2ec1930

Fix refactored logic

97dc95c

Revert change

575a3cd

Refactor

3314b64

bluelabsbutler reviewed Oct 27, 2020

View reviewed changes

records_mover/db/redshift/unloader.py Outdated Show resolved Hide resolved

bluelabsbutler reviewed Oct 27, 2020

View reviewed changes

records_mover/db/redshift/loader.py Outdated Show resolved Hide resolved

bluelabsbutler reviewed Oct 27, 2020

View reviewed changes

records_mover/db/redshift/unloader.py Outdated Show resolved Hide resolved

vinceatbluelabs added 2 commits October 27, 2020 19:49

Clean up and drop TODOs

ce9371e

Fix tests

66a5f2f

bluelabsbutler reviewed Oct 28, 2020

View reviewed changes

tests/unit/db/redshift/test_redshift_db_driver.py Outdated Show resolved Hide resolved

vinceatbluelabs mentioned this pull request Oct 28, 2020

Retire SupportsMoveFromTempLocAfterFillingIt from DataUrlTarget #115

Merged

vinceatbluelabs added 3 commits October 27, 2020 20:10

Fix unused import

bdd6099

Fix flake8 issues

7b239cb

Fix coverage slip

dc7eb21

vinceatbluelabs commented Oct 28, 2020

View reviewed changes

vinceatbluelabs marked this pull request as ready for review October 28, 2020 01:20

vinceatbluelabs requested a review from crvena-sonja October 28, 2020 01:21

vinceatbluelabs mentioned this pull request Oct 28, 2020

BigQuery import from GCS buckets #113

Merged

Fix missing special case revealed by Redshift testing

ce539ec

vinceatbluelabs removed the request for review from crvena-sonja October 28, 2020 18:50

vinceatbluelabs added 3 commits October 29, 2020 06:37

Unratchet a bit

f45011a

Revert chars/bytes fix--8 should be valid in UTF-8 for simple times

ad67b78

Reratchet tests

4caebd9

vinceatbluelabs requested a review from crvena-sonja October 29, 2020 11:26

vinceatbluelabs added 10 commits October 29, 2020 07:43

Clarify semantics of can_load_direct() and fix incorrect definition

9ca1a7a

Fix test

b03f064

Clarify comment

504b39b

Rename method to be more explicit

b58f999

Drop commented code

b6fff82

Fix quality issue

d9ea8a1

Shorten method names and make consistent

4edebe3

Fix function names in tests

82b940e

Clarify documentation

f2f1f0c

Merge remote-tracking branch 'origin/retire_protocol' into clarify_mo…

e5a49ea

…ve_algorithm_predicates

Base automatically changed from retire_protocol to master October 29, 2020 20:09

Merge remote-tracking branch 'origin/clarify_move_algorithm_predicate…

44e7e68

…s' into do_slow_redshift_load_when_bucket_load_not_available

vinceatbluelabs changed the base branch from master to clarify_move_algorithm_predicates November 5, 2020 16:08

vinceatbluelabs mentioned this pull request Nov 5, 2020

Do slow redshift unload via SELECT when bucket unload not available #117

Merged

Base automatically changed from clarify_move_algorithm_predicates to master November 5, 2020 19:48

Merge remote-tracking branch 'origin/master' into do_slow_redshift_lo…

aa05ef7

…ad_when_bucket_load_not_available

crvena-sonja approved these changes Nov 5, 2020

View reviewed changes

vinceatbluelabs merged commit 066f85e into master Nov 6, 2020

vinceatbluelabs deleted the do_slow_redshift_load_when_bucket_load_not_available branch November 6, 2020 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load via INSERT on Redshift when scratch bucket not available #114

Load via INSERT on Redshift when scratch bucket not available #114

vinceatbluelabs commented Oct 27, 2020 •

edited

Loading

vinceatbluelabs Oct 28, 2020

vinceatbluelabs Oct 28, 2020

vinceatbluelabs Oct 28, 2020 •

edited

Loading

vinceatbluelabs Oct 28, 2020

vinceatbluelabs Oct 28, 2020

vinceatbluelabs Oct 28, 2020

crvena-sonja left a comment

		@@ -47,6 +47,12 @@ def temporary_loadable_directory_loc(self) -> Iterator[BaseDirectoryUrl]:
		with TemporaryDirectory(prefix='temporary_loadable_directory_loc') as dirname:

Load via INSERT on Redshift when scratch bucket not available #114

Load via INSERT on Redshift when scratch bucket not available #114

Conversation

vinceatbluelabs commented Oct 27, 2020 • edited Loading

vinceatbluelabs Oct 28, 2020

Choose a reason for hiding this comment

vinceatbluelabs Oct 28, 2020

Choose a reason for hiding this comment

vinceatbluelabs Oct 28, 2020 • edited Loading

Choose a reason for hiding this comment

vinceatbluelabs Oct 28, 2020

Choose a reason for hiding this comment

vinceatbluelabs Oct 28, 2020

Choose a reason for hiding this comment

vinceatbluelabs Oct 28, 2020

Choose a reason for hiding this comment

crvena-sonja left a comment

Choose a reason for hiding this comment

vinceatbluelabs commented Oct 27, 2020 •

edited

Loading

vinceatbluelabs Oct 28, 2020 •

edited

Loading