Comprehensive prerequisite checking for SQL captures #564

willdonnelly · 2023-03-03T20:50:04Z

Description:

This PR factors out various startup checks and initialization behavior that were previously done in an ad-hoc fashion (typically only in the Pull RPC) into a single place for each connector, and collectively names these checks and initialization "prerequisites". In addition, checking (and sometimes initialization) logic has been added for various other connector prerequisites that were not previously checked at all. These checks are now performed consistently as part of a Validate operation as well as Pull, which means that any setup errors can now be reported while the user is still on the "Create a Capture" page in the UI.

There are, broadly speaking, two sorts of connector deployment:

Sometimes users will be testing the connector out using a test database and will give it very broad credentials, potentially even the superuser account on the database. In this situation we would like to perform as much setup work as we possibly can automatically -- ideally a completely blank-slate database can automatically have CDC enabled, watermarks table created, and any other setup performed without the user needing to think about it.
But other times users will want to run the connector against their production database, generally with the most restrictive permissions possible. In this situation we want to verify at startup whether we have the required set of permissions and resources, and provide the user a nice summary of all the things they'll need to fix. Ideally they should be presented with these errors while still on the "Create a Capture" page in the UI, as it's pretty user-unfriendly to let them create a capture that instantly goes red and they may not even notice that.

I have attempted wherever possible to make the validation-and-setup routines flexible, such that if a prerequisite is already satisfied or an attempt to set it up ourselves succeeds then we're happy, and if both of those options fail then we report a concise and human-readable description of what's wrong.

There are so many potential combinations of database setup and user permissions that it wasn't really feasible to include automated tests of every permutation in our CI builds. However, I have manually tested the three most important scenarios (these being "prerequisite is met", "prerequisite is not met but we can fix that because we're the superuser", and "prerequisite is not met and we can't fix that because we're a restricted user") for every prerequisite check added in this PR.

Workflow steps:

At the start of a Validate or Pull operation the connector will now perform a series of sanity checks and report all errors encountered during these checks as a bulleted list. In general this should be a no-op for preexisting captures that currently work, and if that's not the case then that's probably a bug in the implementation of a particular check.

The most notable workflow change is that now in the "Create a Capture" UI the user may be presented with this list of errors if they need to fix their database configuration. This will hopefully be more intuitive to users than the current state of affairs, where they can successfully create a capture even if the database doesn't even have CDC enabled, but then the capture will immediately show up as failed in the UI and they'll have to dig into the details to find out why, and potentially repeat this multiple times as each setup issue in turn causes the capture to fail as the previous one gets fixed.

Documentation links affected:

None of our connector documentation is necessarily incorrect after these changes, but we may want to revisit the recommended database setup instructions as some parts may be unnecessary.

Notes for reviewers:

I have attempted to keep each commit reasonably focused on a single chunk of the overall work, so commit-by-commit is probably the way to read this PR.

This change is

This change gives the automated test suite two distinct user logins, one for 'control' operations like creating tables and inserting data, and the other for actually performing captures. This alone turned up an issue with our recommended setup instructions. It turns out a minimally-privileged user needs an additional `USAGE` permission grant in order to capture from a non-`public` schema. As part of the same work, the default setup of the CI test database has been altered a bit, and I added a bit in the README discussing how the test suite can be run against a local Supabase Postgres instance (spoiler: you use the exact same user-creation and permission-granting commands from `init-user-db.sh`). Also while I was refactoring this code, the whole "open control connection and perform database setup" thing has been removed from TestMain and is now done by a helper function on a per-test basis. This change is long overdue, because doing a bunch of shared setup in TestMain was getting pretty awkward as the test suite grows.

This makes MySQL test setup around control vs capture accounts consistent with the previous Postgres changes.

The SQL Server capture always had a decent separation between control/capture users in tests, but I'm tweaking it anyway so that some names and behavior more closely match the Postgres and MySQL test suites.

These are new methods which are called at the appropriate points during the Validate RPC and before starting a capture, and which are expected to verify various requirements of the target database, things like "CDC is Enabled" and "The tables exist and can be read by the capture user". Currently they're stubbed out with some TODOs for the various checks that ought to be performed for each database.

Also moves the automatic creation of the watermarks table, replication slot, and publication into those checks, rather than having them take place at various points partway through capture startup. Doing it this way means that we can attempt to autocreate them during validation (while the user is still setting up the capture in the UI) rather than having the first failure occur after they leave the capture setup UI and the capture just starts failing repeatedly. There's a new automated test which I've been using to exercise these checks, however it doesn't actually cover most of the failure cases because many of those require deliberately misconfigured database config or user permissions setups, and those are tricky to automate.

Previously this was done as part of the connect operation, but this is the more appropriate place for the check to be done.

This adds new checks that CDC is enabled and the agent is running, and moves the whole "create watermarks table" and "enable CDC for table" bits of work into the setup process from where they used to happen.

It is surprisingly difficult to get a proper UPSERT operation in Transact-SQL, but I think this ought to work.

williamhbaker

LGTM

williamhbaker · 2023-03-03T21:12:51Z

source-postgres/prerequisites.go

+	if err := db.conn.QueryRow(ctx, fmt.Sprintf(`SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = '%s' AND slot_type = 'logical';`, slotName)).Scan(&count); err != nil {
+		return fmt.Errorf("error querying replication slots: %w", err)
+	}
+	if count == 1 {


I'm assuming its not possible for this and other similar checks to be > 1?

That's correct, the checks where I've written == 1 are querying by unique columns so it should be impossible for the results to be > 1.

And if things are somehow so badly broken that we get multiple results for a particular replication slot name then it's probably better to fail anyway.

source-mysql/prerequisites.go

This change fixes some inconsistencies in how table name capitalization and quoting worked in tests. The actual capture logic is and has been correct (because we assume the `namespace` and `stream` properties of the resource spec are properly capitalized, which is the case for our discovery results, and preserve that all the way to the database queries using `"%s"` quoting on the name components, and then treat the lowercase-normalization behavior of JoinStreamID as a one-way function and never feed names from a streamID back into the database). However our *tests* were playing a bit fast and loose with the whole thing. This rectifies that by having CreateTable() generate a lowercase table name (since an all-lowercase name can be used quoted or unquoted without any change in behavior), and adding two new tests which will exercise capitalization and quoting behavior.

The result of this query should always contain one row with one column, but just in case that's not true we may as well report a cleaner error message.

For some reason the test suite works fine locally when running against the dockerized test Postgres instance, but when running in a CI build it fails because when user `flow_capture` tries to create the `public.flow_watermarks` table they get a permission denied error. Which is doubly weird because: 1. A newly created user should by default have permissions to interact with the `public` schema and create tables there. 2. It works when running locally, and both local and CI builds are using the same `source-postgres/docker-compose.yaml` and just running `docker compose up` on that, so they should have identical behavior. So let's see if explicitly granting `flow_capture` the `CREATE` and `USAGE` permissions on schema `public` fixes the inconsistency.

willdonnelly added 11 commits March 3, 2023 14:32

source-mysql: Clarify control vs capture user in tests

a20983e

This makes MySQL test setup around control vs capture accounts consistent with the previous Postgres changes.

source-sqlserver: Make tests consistent with Postgres

fc4ea3a

The SQL Server capture always had a decent separation between control/capture users in tests, but I'm tweaking it anyway so that some names and behavior more closely match the Postgres and MySQL test suites.

source-sqlserver: Fix a minor lint complaint

529b66f

sqlcapture: Downgrade some spammy log messages to Trace level

2dcca65

source-mysql: Implement prerequisite checks

be09dc4

source-mysql: Move binlog expiry sanity-check into prerequisites

7048bc8

Previously this was done as part of the connect operation, but this is the more appropriate place for the check to be done.

source-sqlserver: Implement prerequisite checks

a47431d

This adds new checks that CDC is enabled and the agent is running, and moves the whole "create watermarks table" and "enable CDC for table" bits of work into the setup process from where they used to happen.

source-{mysql,postgres,sqlserver}: Improved 'TestPrerequisites'

31a882d

willdonnelly requested review from mdibaiee, williamhbaker and a team March 3, 2023 20:50

willdonnelly changed the title ~~Wgd/sql capture prerequisites~~ Comprehensive prerequisite checking for SQL captures Mar 3, 2023

source-sqlserver: Fix watermarks table initialization

38cbf44

It is surprisingly difficult to get a proper UPSERT operation in Transact-SQL, but I think this ought to work.

willdonnelly force-pushed the wgd/sql-capture-prerequisites branch from 140f7ff to 38cbf44 Compare March 3, 2023 20:54

williamhbaker approved these changes Mar 3, 2023

View reviewed changes

willdonnelly added 5 commits March 9, 2023 17:04

tests/source-myql: Fix password for integration test

ce7e8a3

source-sqlserver: Fix 'MissingTable' test snapshot

ea67595

source-mysql: Tweak error reporting in an impossible edge case

6397a68

The result of this query should always contain one row with one column, but just in case that's not true we may as well report a cleaner error message.

willdonnelly merged commit 676e8f2 into main Mar 15, 2023

willdonnelly deleted the wgd/sql-capture-prerequisites branch March 15, 2023 19:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comprehensive prerequisite checking for SQL captures #564

Comprehensive prerequisite checking for SQL captures #564

willdonnelly commented Mar 3, 2023 •

edited by jgraettinger

Loading

williamhbaker left a comment

williamhbaker Mar 3, 2023

willdonnelly Mar 6, 2023

Comprehensive prerequisite checking for SQL captures #564

Comprehensive prerequisite checking for SQL captures #564

Conversation

willdonnelly commented Mar 3, 2023 • edited by jgraettinger Loading

williamhbaker left a comment

Choose a reason for hiding this comment

williamhbaker Mar 3, 2023

Choose a reason for hiding this comment

willdonnelly Mar 6, 2023

Choose a reason for hiding this comment

willdonnelly commented Mar 3, 2023 •

edited by jgraettinger

Loading