Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comprehensive prerequisite checking for SQL captures #564

Merged
merged 17 commits into from
Mar 15, 2023

Conversation

willdonnelly
Copy link
Member

@willdonnelly willdonnelly commented Mar 3, 2023

Description:

This PR factors out various startup checks and initialization behavior that were previously done in an ad-hoc fashion (typically only in the Pull RPC) into a single place for each connector, and collectively names these checks and initialization "prerequisites". In addition, checking (and sometimes initialization) logic has been added for various other connector prerequisites that were not previously checked at all. These checks are now performed consistently as part of a Validate operation as well as Pull, which means that any setup errors can now be reported while the user is still on the "Create a Capture" page in the UI.

There are, broadly speaking, two sorts of connector deployment:

  • Sometimes users will be testing the connector out using a test database and will give it very broad credentials, potentially even the superuser account on the database. In this situation we would like to perform as much setup work as we possibly can automatically -- ideally a completely blank-slate database can automatically have CDC enabled, watermarks table created, and any other setup performed without the user needing to think about it.
  • But other times users will want to run the connector against their production database, generally with the most restrictive permissions possible. In this situation we want to verify at startup whether we have the required set of permissions and resources, and provide the user a nice summary of all the things they'll need to fix. Ideally they should be presented with these errors while still on the "Create a Capture" page in the UI, as it's pretty user-unfriendly to let them create a capture that instantly goes red and they may not even notice that.

I have attempted wherever possible to make the validation-and-setup routines flexible, such that if a prerequisite is already satisfied or an attempt to set it up ourselves succeeds then we're happy, and if both of those options fail then we report a concise and human-readable description of what's wrong.

There are so many potential combinations of database setup and user permissions that it wasn't really feasible to include automated tests of every permutation in our CI builds. However, I have manually tested the three most important scenarios (these being "prerequisite is met", "prerequisite is not met but we can fix that because we're the superuser", and "prerequisite is not met and we can't fix that because we're a restricted user") for every prerequisite check added in this PR.

Workflow steps:

At the start of a Validate or Pull operation the connector will now perform a series of sanity checks and report all errors encountered during these checks as a bulleted list. In general this should be a no-op for preexisting captures that currently work, and if that's not the case then that's probably a bug in the implementation of a particular check.

The most notable workflow change is that now in the "Create a Capture" UI the user may be presented with this list of errors if they need to fix their database configuration. This will hopefully be more intuitive to users than the current state of affairs, where they can successfully create a capture even if the database doesn't even have CDC enabled, but then the capture will immediately show up as failed in the UI and they'll have to dig into the details to find out why, and potentially repeat this multiple times as each setup issue in turn causes the capture to fail as the previous one gets fixed.

Documentation links affected:

None of our connector documentation is necessarily incorrect after these changes, but we may want to revisit the recommended database setup instructions as some parts may be unnecessary.

Notes for reviewers:

I have attempted to keep each commit reasonably focused on a single chunk of the overall work, so commit-by-commit is probably the way to read this PR.


This change is Reviewable

This change gives the automated test suite two distinct user
logins, one for 'control' operations like creating tables and
inserting data, and the other for actually performing captures.

This alone turned up an issue with our recommended setup instructions.
It turns out a minimally-privileged user needs an additional `USAGE`
permission grant in order to capture from a non-`public` schema.

As part of the same work, the default setup of the CI test database
has been altered a bit, and I added a bit in the README discussing
how the test suite can be run against a local Supabase Postgres
instance (spoiler: you use the exact same user-creation and
permission-granting commands from `init-user-db.sh`).

Also while I was refactoring this code, the whole "open control
connection and perform database setup" thing has been removed from
TestMain and is now done by a helper function on a per-test basis.
This change is long overdue, because doing a bunch of shared setup
in TestMain was getting pretty awkward as the test suite grows.
This makes MySQL test setup around control vs capture accounts
consistent with the previous Postgres changes.
The SQL Server capture always had a decent separation between
control/capture users in tests, but I'm tweaking it anyway so
that some names and behavior more closely match the Postgres
and MySQL test suites.
These are new methods which are called at the appropriate points
during the Validate RPC and before starting a capture, and which
are expected to verify various requirements of the target database,
things like "CDC is Enabled" and "The tables exist and can be read
by the capture user".

Currently they're stubbed out with some TODOs for the various
checks that ought to be performed for each database.
Also moves the automatic creation of the watermarks table,
replication slot, and publication into those checks, rather
than having them take place at various points partway through
capture startup. Doing it this way means that we can attempt
to autocreate them during validation (while the user is still
setting up the capture in the UI) rather than having the first
failure occur after they leave the capture setup UI and the
capture just starts failing repeatedly.

There's a new automated test which I've been using to exercise
these checks, however it doesn't actually cover most of the
failure cases because many of those require deliberately
misconfigured database config or user permissions setups,
and those are tricky to automate.
Previously this was done as part of the connect operation, but
this is the more appropriate place for the check to be done.
This adds new checks that CDC is enabled and the agent is running,
and moves the whole "create watermarks table" and "enable CDC for
table" bits of work into the setup process from where they used
to happen.
@willdonnelly willdonnelly requested review from mdibaiee, williamhbaker and a team March 3, 2023 20:50
@willdonnelly willdonnelly changed the title Wgd/sql capture prerequisites Comprehensive prerequisite checking for SQL captures Mar 3, 2023
It is surprisingly difficult to get a proper UPSERT operation
in Transact-SQL, but I think this ought to work.
@willdonnelly willdonnelly force-pushed the wgd/sql-capture-prerequisites branch from 140f7ff to 38cbf44 Compare March 3, 2023 20:54
Copy link
Member

@williamhbaker williamhbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

if err := db.conn.QueryRow(ctx, fmt.Sprintf(`SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = '%s' AND slot_type = 'logical';`, slotName)).Scan(&count); err != nil {
return fmt.Errorf("error querying replication slots: %w", err)
}
if count == 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming its not possible for this and other similar checks to be > 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, the checks where I've written == 1 are querying by unique columns so it should be impossible for the results to be > 1.

And if things are somehow so badly broken that we get multiple results for a particular replication slot name then it's probably better to fail anyway.

source-mysql/prerequisites.go Show resolved Hide resolved
source-mysql/prerequisites.go Show resolved Hide resolved
This change fixes some inconsistencies in how table name capitalization
and quoting worked in tests.

The actual capture logic is and has been correct (because we assume the
`namespace` and `stream` properties of the resource spec are properly
capitalized, which is the case for our discovery results, and preserve
that all the way to the database queries using `"%s"` quoting on the
name components, and then treat the lowercase-normalization behavior
of JoinStreamID as a one-way function and never feed names from a
streamID back into the database).

However our *tests* were playing a bit fast and loose with the whole
thing. This rectifies that by having CreateTable() generate a lowercase
table name (since an all-lowercase name can be used quoted or unquoted
without any change in behavior), and adding two new tests which will
exercise capitalization and quoting behavior.
The result of this query should always contain one row with one
column, but just in case that's not true we may as well report a
cleaner error message.
For some reason the test suite works fine locally when running
against the dockerized test Postgres instance, but when running
in a CI build it fails because when user `flow_capture` tries to
create the `public.flow_watermarks` table they get a permission
denied error.

Which is doubly weird because:

  1. A newly created user should by default have permissions to
     interact with the `public` schema and create tables there.
  2. It works when running locally, and both local and CI builds
     are using the same `source-postgres/docker-compose.yaml` and
     just running `docker compose up` on that, so they should have
     identical behavior.

So let's see if explicitly granting `flow_capture` the `CREATE`
and `USAGE` permissions on schema `public` fixes the inconsistency.
@willdonnelly willdonnelly merged commit 676e8f2 into main Mar 15, 2023
@willdonnelly willdonnelly deleted the wgd/sql-capture-prerequisites branch March 15, 2023 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants