-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comprehensive prerequisite checking for SQL captures #564
Conversation
This change gives the automated test suite two distinct user logins, one for 'control' operations like creating tables and inserting data, and the other for actually performing captures. This alone turned up an issue with our recommended setup instructions. It turns out a minimally-privileged user needs an additional `USAGE` permission grant in order to capture from a non-`public` schema. As part of the same work, the default setup of the CI test database has been altered a bit, and I added a bit in the README discussing how the test suite can be run against a local Supabase Postgres instance (spoiler: you use the exact same user-creation and permission-granting commands from `init-user-db.sh`). Also while I was refactoring this code, the whole "open control connection and perform database setup" thing has been removed from TestMain and is now done by a helper function on a per-test basis. This change is long overdue, because doing a bunch of shared setup in TestMain was getting pretty awkward as the test suite grows.
This makes MySQL test setup around control vs capture accounts consistent with the previous Postgres changes.
The SQL Server capture always had a decent separation between control/capture users in tests, but I'm tweaking it anyway so that some names and behavior more closely match the Postgres and MySQL test suites.
These are new methods which are called at the appropriate points during the Validate RPC and before starting a capture, and which are expected to verify various requirements of the target database, things like "CDC is Enabled" and "The tables exist and can be read by the capture user". Currently they're stubbed out with some TODOs for the various checks that ought to be performed for each database.
Also moves the automatic creation of the watermarks table, replication slot, and publication into those checks, rather than having them take place at various points partway through capture startup. Doing it this way means that we can attempt to autocreate them during validation (while the user is still setting up the capture in the UI) rather than having the first failure occur after they leave the capture setup UI and the capture just starts failing repeatedly. There's a new automated test which I've been using to exercise these checks, however it doesn't actually cover most of the failure cases because many of those require deliberately misconfigured database config or user permissions setups, and those are tricky to automate.
Previously this was done as part of the connect operation, but this is the more appropriate place for the check to be done.
This adds new checks that CDC is enabled and the agent is running, and moves the whole "create watermarks table" and "enable CDC for table" bits of work into the setup process from where they used to happen.
It is surprisingly difficult to get a proper UPSERT operation in Transact-SQL, but I think this ought to work.
140f7ff
to
38cbf44
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if err := db.conn.QueryRow(ctx, fmt.Sprintf(`SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = '%s' AND slot_type = 'logical';`, slotName)).Scan(&count); err != nil { | ||
return fmt.Errorf("error querying replication slots: %w", err) | ||
} | ||
if count == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming its not possible for this and other similar checks to be > 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, the checks where I've written == 1
are querying by unique columns so it should be impossible for the results to be > 1.
And if things are somehow so badly broken that we get multiple results for a particular replication slot name then it's probably better to fail anyway.
This change fixes some inconsistencies in how table name capitalization and quoting worked in tests. The actual capture logic is and has been correct (because we assume the `namespace` and `stream` properties of the resource spec are properly capitalized, which is the case for our discovery results, and preserve that all the way to the database queries using `"%s"` quoting on the name components, and then treat the lowercase-normalization behavior of JoinStreamID as a one-way function and never feed names from a streamID back into the database). However our *tests* were playing a bit fast and loose with the whole thing. This rectifies that by having CreateTable() generate a lowercase table name (since an all-lowercase name can be used quoted or unquoted without any change in behavior), and adding two new tests which will exercise capitalization and quoting behavior.
The result of this query should always contain one row with one column, but just in case that's not true we may as well report a cleaner error message.
For some reason the test suite works fine locally when running against the dockerized test Postgres instance, but when running in a CI build it fails because when user `flow_capture` tries to create the `public.flow_watermarks` table they get a permission denied error. Which is doubly weird because: 1. A newly created user should by default have permissions to interact with the `public` schema and create tables there. 2. It works when running locally, and both local and CI builds are using the same `source-postgres/docker-compose.yaml` and just running `docker compose up` on that, so they should have identical behavior. So let's see if explicitly granting `flow_capture` the `CREATE` and `USAGE` permissions on schema `public` fixes the inconsistency.
Description:
This PR factors out various startup checks and initialization behavior that were previously done in an ad-hoc fashion (typically only in the
Pull
RPC) into a single place for each connector, and collectively names these checks and initialization "prerequisites". In addition, checking (and sometimes initialization) logic has been added for various other connector prerequisites that were not previously checked at all. These checks are now performed consistently as part of aValidate
operation as well asPull
, which means that any setup errors can now be reported while the user is still on the "Create a Capture" page in the UI.There are, broadly speaking, two sorts of connector deployment:
I have attempted wherever possible to make the validation-and-setup routines flexible, such that if a prerequisite is already satisfied or an attempt to set it up ourselves succeeds then we're happy, and if both of those options fail then we report a concise and human-readable description of what's wrong.
There are so many potential combinations of database setup and user permissions that it wasn't really feasible to include automated tests of every permutation in our CI builds. However, I have manually tested the three most important scenarios (these being "prerequisite is met", "prerequisite is not met but we can fix that because we're the superuser", and "prerequisite is not met and we can't fix that because we're a restricted user") for every prerequisite check added in this PR.
Workflow steps:
At the start of a
Validate
orPull
operation the connector will now perform a series of sanity checks and report all errors encountered during these checks as a bulleted list. In general this should be a no-op for preexisting captures that currently work, and if that's not the case then that's probably a bug in the implementation of a particular check.The most notable workflow change is that now in the "Create a Capture" UI the user may be presented with this list of errors if they need to fix their database configuration. This will hopefully be more intuitive to users than the current state of affairs, where they can successfully create a capture even if the database doesn't even have CDC enabled, but then the capture will immediately show up as failed in the UI and they'll have to dig into the details to find out why, and potentially repeat this multiple times as each setup issue in turn causes the capture to fail as the previous one gets fixed.
Documentation links affected:
None of our connector documentation is necessarily incorrect after these changes, but we may want to revisit the recommended database setup instructions as some parts may be unnecessary.
Notes for reviewers:
I have attempted to keep each commit reasonably focused on a single chunk of the overall work, so commit-by-commit is probably the way to read this PR.
This change is