Do slow redshift unload via SELECT when bucket unload not available #117

vinceatbluelabs · 2020-11-05T15:54:39Z

This allows Redshift to export data even when we don't have a scratch bucket to copy it to first.

The particular method is similar to the fix done for importing data (#114) - adding some predicates to the Source/Target and the underlying DBDriver interfaces so that the move() algorithm can figure out whether for a given URL scheme it can do a bulk export or whether the move() algorithm should do things via SELECT.

It's not needed anymore, presumably due to changes in the move() algorithm.

…dshift_load_when_bucket_load_not_available

…s' into do_slow_redshift_unload_when_bucket_unload_not_available

…ve_algorithm_predicates

…load_when_bucket_unload_not_available

vinceatbluelabs · 2020-11-06T14:12:16Z

metrics/coverage_high_water_mark

@@ -1 +1 @@
-93.6900
+93.7300


I tested the bulk of the new code added, and threw in some unrelated tests as well to get this number up.

vinceatbluelabs · 2020-11-06T14:29:58Z

records_mover/db/loader.py

@@ -42,6 +42,9 @@ def can_load_this_format(self, source_records_format: BaseRecordsFormat) -> bool
        method"""
        ...

+    def temporary_loadable_directory_scheme(self) -> str:
+        return 'file'


As a refresher:

Scheme/URL scheme: the "s3" in s3://bucket/directory/file.csv

DBDriver abstract classes: This is a wrapper on top of SQLAlchemy's database dialects that describe how to load/unload data and also provide some extra information about things like the numeric ranges of types.

Loader/Unloader abstract classes: These are the interfaces extended for each database to show how to do bulk imports/exports.

source/target: sources and targets are things like files on disk, directories in S3, tables in databases, etc - you move data with Records Mover from a source to a target.

Source/Target abstract classes: these are abstract interfaces that show what capabilities a given source or target have. The table source and table target of course use the DBDriver interface extensively to know what capabilities a specific database have.

move() algorithm - uses Source and Target objects and inspect to find out which particular Source/Target abstract classes are implemented to figure out the fastest way to move data from a source to a target.

So this is an additional exported bit of info on the Loader interface that helps us know that if we had to create a temporary directory to load from, where it would be created. We'll see where that's used later on in the PR here.

vinceatbluelabs · 2020-11-06T14:31:26Z

records_mover/db/postgres/unloader.py

+    def can_unload_to_scheme(self, scheme: str) -> bool:
+        # Unloading is done via streams, so it is scheme-independent
+        # and requires no scratch buckets.
+        return True


In one movement scenario, we ask a source database to export into a records directory (basically a pile of CSVs with a JSON file for metadata). This is how we can find out if the source database can export into the right kind of place for the target database (or whatever) to import.

records_mover/db/unloader.py

vinceatbluelabs · 2020-11-06T14:36:40Z

records_mover/db/vertica/unloader.py

@@ -47,7 +47,7 @@ def unload(self,
               table: str,
               unload_plan: RecordsUnloadPlan,
               directory: RecordsDirectory) -> Optional[int]:
-        if not self.s3_available():
+        if not self.s3_export_available():


I'm separating out the concepts of "does the database have the right add-ons installed to export to S3" and "do we have a temporary S3 bucket available".

Even if we don't have a temporary S3 bucket available, we may be able to export straight to the destination S3 bucket!

vinceatbluelabs · 2020-11-06T14:37:43Z

records_mover/db/vertica/unloader.py

-            return [DelimitedRecordsFormat(variant='vertica')]
-        else:
-            return []
+        return [DelimitedRecordsFormat(variant='vertica')]


This was an existing kludge to try to accomplish some of the "negotiation" happening with this PR. Now that we have it more fully baked at the move() algorithm level, we can tear down this hack, as it's no longer needed.

vinceatbluelabs · 2020-11-06T14:38:57Z

records_mover/records/mover.py

@@ -126,6 +127,8 @@ def move(records_source: RecordsSource,
    elif (isinstance(records_source, SupportsMoveToRecordsDirectory) and
          isinstance(records_target, MightSupportMoveFromTempLocAfterFillingIt) and
          records_source.has_compatible_format(records_target) and
+          records_source.
+            can_move_to_scheme(records_target.temporary_loadable_directory_scheme()) and


These three lines added lines are the meat of this PR. They result in us not trying to do bulk exports when we'd need a temporary bucket and don't have one. As a result, we move on and try a less efficient approach rather than dying with an exception that asks the user to provide a temporary bucket.

The first clause is generally about database bulk exports - e.g., table2url, table2file, table2recordsdir, etc.

The second clause is used when the target isn't a file - e.g., table2table - we need a temporary place to dump all the CSV files, and need to make sure that the temporary place is somewhere that the source can currently export.

vinceatbluelabs · 2020-11-06T14:44:29Z

records_mover/records/sources/base.py

+        expensive when data is large and/or network bandwidth is
+        limited.
+        """
+        pass


This is one of the abstract classes that show what capabilities a Source has.

vinceatbluelabs · 2020-11-06T14:45:30Z

records_mover/records/targets/table/target.py

+        loader = driver.loader()
+        if loader is None:
+            raise TypeError("Please check can_move_from_temp_loc_after_filling_it() "
+                            "before calling this")


This isn't the ideal from a type safety perspective, but I looked at a few alternatives and couldn't come up with a better approach without making the move() method much uglier.

With the understanding that this not shown to the External API, and mostly as a signal to fix logic elsewhere, I see no reason to belabor this point.

records_mover/records/sources/table.py

…e' of vinceatbluelabs.github.com:bluelabsio/records-mover into do_slow_redshift_unload_when_bucket_unload_not_available

vinceatbluelabs · 2020-11-06T14:54:56Z

records_mover/db/loader.py

        with TemporaryDirectory(prefix='temporary_loadable_directory_loc') as dirname:
            yield FilesystemDirectoryUrl(dirname)

    def has_temporary_loadable_directory_loc(self) -> bool:
+        """Returns True if a temporary directory can be provided by
+        temporary_loadable_directory_loc()"""


Be more consistent with documentation in this key class.

crvena-sonja · 2020-11-06T16:39:21Z

tests/unit/db/redshift/test_unloader.py

+                             s3_temp_base_loc=None)
+        self.assertTrue(redshift_unloader.can_unload_to_scheme('s3'))
+
+    def test_can_unload_to_scheme_file_with_temp_bucket_True(self):


the most pedantic comment, but we're here.... True is capitalized in the function name but not in others.

crvena-sonja

One comment about capitalizing True in a function name, but otherwise LGTM.

bluelabsbutler · 2020-11-06T17:31:19Z

tests/unit/db/redshift/test_unloader.py

+                             s3_temp_base_loc=None)
+        self.assertFalse(redshift_unloader.can_unload_to_scheme('file'))
+
+    def test_can_unload_to_scheme_file_with_temp_bucket_true(self):


F811 redefinition of unused 'test_can_unload_to_scheme_file_with_temp_bucket_true' from line 47

vinceatbluelabs added 30 commits October 27, 2020 18:05

Load via INSERT on Redshift when scratch bucket not available

a66f4db

Add TODO

d20d77f

Retire SupportsMoveFromTempLocAfterFillingIt from DataUrlTarget

e4bd0ea

It's not needed anymore, presumably due to changes in the move() algorithm.

Merge remote-tracking branch 'origin/retire_protocol' into do_slow_re…

f53587a

…dshift_load_when_bucket_load_not_available

Add comment

ef0b8c1

Implement for Vertica

643ee51

TODONE

2ec1930

Fix refactored logic

97dc95c

Revert change

575a3cd

Refactor

3314b64

Clean up and drop TODOs

ce9371e

Fix tests

66a5f2f

Fix unused import

bdd6099

Fix flake8 issues

7b239cb

Fix coverage slip

dc7eb21

Fix missing special case revealed by Redshift testing

ce539ec

Unratchet a bit

f45011a

Revert chars/bytes fix--8 should be valid in UTF-8 for simple times

ad67b78

Reratchet tests

4caebd9

Clarify semantics of can_load_direct() and fix incorrect definition

9ca1a7a

Fix test

b03f064

Clarify comment

504b39b

Rename method to be more explicit

b58f999

Drop commented code

b6fff82

Fix quality issue

d9ea8a1

Shorten method names and make consistent

4edebe3

Fix function names in tests

82b940e

Clarify documentation

f2f1f0c

Merge remote-tracking branch 'origin/clarify_move_algorithm_predicate…

637e64e

…s' into do_slow_redshift_unload_when_bucket_unload_not_available

Merge remote-tracking branch 'origin/retire_protocol' into clarify_mo…

e5a49ea

…ve_algorithm_predicates

vinceatbluelabs added 2 commits November 5, 2020 17:23

Drop TODO

d2d8e68

Document method

cdfddf4

Base automatically changed from do_slow_redshift_load_when_bucket_load_not_available to master November 6, 2020 13:49

Merge remote-tracking branch 'origin/master' into do_slow_redshift_un…

a6e95d5

…load_when_bucket_unload_not_available

vinceatbluelabs commented Nov 6, 2020

View reviewed changes

vinceatbluelabs changed the title ~~Do slow redshift unload when bucket unload not available~~ Do slow redshift unload via SELECT when bucket unload not available Nov 6, 2020

vinceatbluelabs commented Nov 6, 2020

View reviewed changes

records_mover/db/unloader.py Show resolved Hide resolved

vinceatbluelabs commented Nov 6, 2020

View reviewed changes

records_mover/records/sources/table.py Show resolved Hide resolved

vinceatbluelabs added 3 commits November 6, 2020 09:46

Update records_mover/records/sources/table.py

0636f7b

Document temporary directory methods in loader.py

cf8d38e

Merge branch 'do_slow_redshift_unload_when_bucket_unload_not_availabl…

858880e

…e' of vinceatbluelabs.github.com:bluelabsio/records-mover into do_slow_redshift_unload_when_bucket_unload_not_available

vinceatbluelabs commented Nov 6, 2020

View reviewed changes

Improve loader and unloader method documentation

ee02ed4

vinceatbluelabs marked this pull request as ready for review November 6, 2020 15:03

vinceatbluelabs requested a review from crvena-sonja November 6, 2020 15:11

crvena-sonja reviewed Nov 6, 2020

View reviewed changes

crvena-sonja approved these changes Nov 6, 2020

View reviewed changes

vinceatbluelabs mentioned this pull request Nov 6, 2020

Also test Redshift without S3 scratch bucket #118

Merged

More consistent caps

4d0a77b

bluelabsbutler reviewed Nov 6, 2020

View reviewed changes

Clarify function name

c693f13

vinceatbluelabs merged commit 1b33493 into master Nov 6, 2020

vinceatbluelabs deleted the do_slow_redshift_unload_when_bucket_unload_not_available branch November 6, 2020 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do slow redshift unload via SELECT when bucket unload not available #117

Do slow redshift unload via SELECT when bucket unload not available #117

vinceatbluelabs commented Nov 5, 2020 •

edited

Loading

vinceatbluelabs Nov 6, 2020

vinceatbluelabs Nov 6, 2020 •

edited

Loading

vinceatbluelabs Nov 6, 2020

vinceatbluelabs Nov 6, 2020

vinceatbluelabs Nov 6, 2020

vinceatbluelabs Nov 6, 2020 •

edited

Loading

vinceatbluelabs Nov 6, 2020

vinceatbluelabs Nov 6, 2020

crvena-sonja Nov 6, 2020

vinceatbluelabs Nov 6, 2020 •

edited

Loading

crvena-sonja Nov 6, 2020

crvena-sonja left a comment

bluelabsbutler Nov 6, 2020

		@@ -1 +1 @@
		93.6900
		93.7300

Do slow redshift unload via SELECT when bucket unload not available #117

Do slow redshift unload via SELECT when bucket unload not available #117

Conversation

vinceatbluelabs commented Nov 5, 2020 • edited Loading

Choose a reason for hiding this comment

vinceatbluelabs Nov 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinceatbluelabs Nov 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinceatbluelabs Nov 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crvena-sonja left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinceatbluelabs commented Nov 5, 2020 •

edited

Loading

vinceatbluelabs Nov 6, 2020 •

edited

Loading

vinceatbluelabs Nov 6, 2020 •

edited

Loading

vinceatbluelabs Nov 6, 2020 •

edited

Loading