-
Notifications
You must be signed in to change notification settings - Fork 696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDL failure testing #2212
DDL failure testing #2212
Conversation
8eea61c
to
9133912
Compare
17769ef
to
a8bf960
Compare
3121431
to
6c01440
Compare
a8bf960
to
2795a93
Compare
Codecov Report
@@ Coverage Diff @@
## mitmproxy-failure-testing #2212 +/- ##
=============================================================
+ Coverage 93.7% 93.84% +0.13%
=============================================================
Files 110 110
Lines 27453 27453
=============================================================
+ Hits 25726 25762 +36
+ Misses 1727 1691 -36 |
6c01440
to
3e309e3
Compare
2795a93
to
b84f481
Compare
2 | ||
(1 row) | ||
|
||
-- cance; as soon as the coordinator sends worker_apply_shard_ddl_command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cance;
-> cancel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there, sorry for taking forever to get to this! It mostly looks great, I've pointed out a couple typos and just one mistake: You should have used ^COMMIT
instead of COMMIT
.
I think the WARNINGs are a real behavior which should be tested for when possible, it's important the user know that something went wrong, but I understand wanting to hide things like prepared transaction ids.
Marco knows better than I re how much duplication of tests for our transaction management makes sense. I think this file is great, but now that we have it the other tests probably don't need to add their own tests for transaction management.
@@ -0,0 +1,1228 @@ | |||
-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is very comprehensive!
DETAIL: server closed the connection unexpectedly | ||
This probably means the server terminated abnormally | ||
before or while processing the request. | ||
SELECT citus.mitmproxy('conn.allow()'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't really need conn.allow()
if you're not going to make any connections to the workers while you setup for the next test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused with this. There is one more ALTER TABLE
command below. Don't we need conn.allow()
before doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From failure_ddl.sql
:
-- in the first test, kill just in the first
-- response we get from the worker
SELECT citus.mitmproxy('conn.onAuthenticationOk().kill()');
ALTER TABLE test_table ADD COLUMN new_column INT;
SELECT citus.mitmproxy('conn.allow()'); -- the conn.allow() in question
SELECT count(*) FROM public.table_attrs where relid = 'test_table'::regclass; -- this happens locally
-- cancel just in the first
-- response we get from the worker
SELECT citus.mitmproxy('conn.onAuthenticationOk().cancel(' || pg_backend_pid() || ')'); -- reconfigure
ALTER TABLE test_table ADD COLUMN new_column INT; -- runs under the new rules
SELECT citus.mitmproxy('conn.allow()');
SELECT count(*) FROM public.table_attrs where relid = 'test_table'::regclass;
There is another ALTER TABLE
, but that doesn't happen until we run citus.mitmproxy
again, the conn.allow()
never has a change to be used!
This isn't very important, if you think it makes things more clear feel free to leave it in, just wanted to point this out.
|
||
(1 row) | ||
|
||
SELECT count(*) FROM public.table_attrs where relid = 'test_table'::regclass; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd never seen table_attrs
before, this is nice!
It took me a few minutes to figure out what was happening though, it might be more understandable to write:
SELECT array_agg(name::text) FROM public.table_attrs where relid = 'test_table'::regclass;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd never seen table_attrs before, this is nice!
Just FYI, this is Citus' test helper function see here. I think the main motivation to add that was Postgres' ouput changes between different versions and with this views we could skip adding _0.
files for such minor differences.
it might be more understandable to write:
Makes sense
(1 row) | ||
|
||
-- kill as soon as the coordinator sends COMMIT | ||
SELECT citus.mitmproxy('conn.onQuery(query="COMMIT").kill()'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Careful! This doesn't test what you think it does. COMMIT
is a regex which also matches "BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED". I think what you want is ^COMMIT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use ^COMMIT
, then you'll get some shards with the new metadata and some with the old metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ops, I've missed/forgot that. I'll fix all the occurrences in the file. That's too bad for me that we've missed various cases with this bug in the test :/
(4 rows) | ||
|
||
-- cancel as soon as the coordinator sends COMMIT | ||
SELECT citus.mitmproxy('conn.onQuery(query="COMMIT").cancel(' || pg_backend_pid() || ')'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here, should be ^COMMIT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I change it to ^COMMIT
the query isn't cancelled which... I guess that's probably okay? I suppose we should at least still have a test for it, to document that you can't cancel after that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, during COMMIT/ROLLBACK
events the interrupts are held, so it is expected. I'll add the following comment:
-- interrupts are held during COMMIT/ROLLBACK, so the command
-- should have been applied without any issues since cancel is ignored
(localhost,57640,100803,t,2) | ||
(4 rows) | ||
|
||
-- we should be able to revocer the transaction and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"revocer" -> "recover"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While following along by doing it by hand I accidentally noticed that recover_prepared_transactions() probably has a bug. If I leave in the part which kills connections on ^COMMIT
:
brian=# SELECT recover_prepared_transactions();
WARNING: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
WARNING: connection not open
recover_prepared_transactions
-------------------------------
0
(1 row)
It returns the number of transactions which it was able to recover, but doesn't give any indication that there are still transactions left to recover!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I remember this. But, aren't the WARNING
s important indication for the necessity to re-run the recovery?
I remember during the PR for recover_prepared_transactions ()
we decided to warn the user about the connection errors, but, do not fail the recovery at all since we could potentially recover prepared transactions from the other workers.
Do you see any other approaches for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not error out here? Recovery has failed! Warnings are something you can ignore, here it feels like we're suppressing an important failure. Until recover_prepared_transactions()
runs all your queries on this table are going to block and you won't know why. If we want to continue and try to recover prepared transactions from the other workers, we can still try to do that before we raise the ERROR
.
My complaints aren't with this PR though, I guess I should open a ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to continue and try to recover prepared transactions from the other workers, we can still try to do that before we raise the ERROR.
Yes, that would be a nice improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just note that we've got the exact same approach with deadlock detection where we warn the user if we cannot connect one of the nodes. We should probably improve both!
(localhost,57640,100807,t,2) | ||
(8 rows) | ||
|
||
-- we should be able to revocer the transaction and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"revocer" -> "recover"
(8 rows) | ||
|
||
-- we should be able to revocer the transaction and | ||
-- see that the command is rollbacked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see that what command is rollbacked? Above all shards have two columns, and below they also have two columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
- There was an extra
SELECT run_command_...()
call, which I've removed - Expanded the comment to
-- we should be able to recover the transaction and
-- see that the command is rollbacked on all workers
-- note that in this case recover_prepared_transactions()
-- sends ROLLBACK PREPARED to the workers given that
-- the transaction has not been commited on any placement yet
-- finally, test failing on ROLLBACK with 2CPC | ||
-- fail just after the coordinator sends the ROLLBACK | ||
-- so the command can be rollbacked | ||
SELECT citus.mitmproxy('conn.onQuery(query="ROLLBACK").kill()'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this is not a particularly useful test. We're suppressing WARNING so we're not really resting any of the coordinator's behavior. And by killing connections all we're exercising is the Postgres behavior: automatically rolling back when the backend ends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, these are all to make sure that we won't break anything in the future. We've had various bugs while doing ROLLBACK
. For example, we're still exercising that Citus can send the ROLLBACK to the other workers without any issues. Or, make sure that rollback handler in Citus can gracefully handle the failures during ROLLBACK
when there is 2PC.
I see this test file as the main one for exercising almost all possible failure scenarious on transaction management. So, it wouldn't hurt to have some not very useful tests as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, good points.
(1 row) | ||
|
||
SET search_path TO 'public'; | ||
DROP SCHEMA ddl_failure CASCADE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also drop test_table
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DROP SCHEMA
cascades to tables created in that schema. I think might be confused because the log level is error and we cannot see the cascade notice messages 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And another good point, this is a really nice idea, we can create whatever we want and not keep track of it.
This commit adds an extensive failure testing, which covers quite a bit of things and their combinations: - 1PC vs 2PC - Replication factor 1 and Replication factor 2 - Network failures and query cancellations - Sequential vs Parallel query execution mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've applied your feedback. I think the minor points we think slightly differently:
- The existence of some tests that are not very useful
- Log level being
ERROR
, which I also wish that we've added your commit to enable consistent outputs in the framework PR
I tried to reply those in the comments, feel free to discuss more
|
||
(1 row) | ||
|
||
SELECT count(*) FROM public.table_attrs where relid = 'test_table'::regclass; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd never seen table_attrs before, this is nice!
Just FYI, this is Citus' test helper function see here. I think the main motivation to add that was Postgres' ouput changes between different versions and with this views we could skip adding _0.
files for such minor differences.
it might be more understandable to write:
Makes sense
(1 row) | ||
|
||
SET search_path TO 'public'; | ||
DROP SCHEMA ddl_failure CASCADE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DROP SCHEMA
cascades to tables created in that schema. I think might be confused because the log level is error and we cannot see the cascade notice messages 😄
DETAIL: server closed the connection unexpectedly | ||
This probably means the server terminated abnormally | ||
before or while processing the request. | ||
SELECT citus.mitmproxy('conn.allow()'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused with this. There is one more ALTER TABLE
command below. Don't we need conn.allow()
before doing that?
2 | ||
(1 row) | ||
|
||
SELECT run_command_on_placements('test_table', $$SELECT count(*) FROM public.table_attrs where relid = '%s'::regclass$$) ORDER BY 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I converted them to $$SELECT array_agg(name::text ORDER BY name::text) FROM public.table_attrs where relid = '%s'::regclass;$$
as well for ease of read.
|
||
-- but now kill just after the worker sends response to | ||
-- ROLLBACK command, so we'll have lots of warnings but the command | ||
-- should have been rollbacked both on the distributed table and the placements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to keep both them in case we introduce a subtle bug that leads to crashes or does something wrong at some point?
(4 rows) | ||
|
||
-- cancel as soon as the coordinator sends COMMIT | ||
SELECT citus.mitmproxy('conn.onQuery(query="COMMIT").cancel(' || pg_backend_pid() || ')'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, during COMMIT/ROLLBACK
events the interrupts are held, so it is expected. I'll add the following comment:
-- interrupts are held during COMMIT/ROLLBACK, so the command
-- should have been applied without any issues since cancel is ignored
|
||
(1 row) | ||
|
||
ALTER TABLE test_table ADD COLUMN new_column INT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
|
||
BEGIN; | ||
ALTER TABLE test_table DROP COLUMN new_column; | ||
ROLLBACK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense these tests would end up with consistent test outputs.
(4 rows) | ||
|
||
-- killing on command complete of COMMIT PREPARE, we should see that the command succeeds | ||
-- and all the workers committed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I understand your point of having prepared transactions with consistent output better. It sounds a little bit late to have that. But, we could open an issue to track it in the future?
(localhost,57640,100803,t,2) | ||
(4 rows) | ||
|
||
-- we should be able to revocer the transaction and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I remember this. But, aren't the WARNING
s important indication for the necessity to re-run the recovery?
I remember during the PR for recover_prepared_transactions ()
we decided to warn the user about the connection errors, but, do not fail the recovery at all since we could potentially recover prepared transactions from the other workers.
Do you see any other approaches for that?
b84f481
to
a446e71
Compare
Some views are not cleaned up \d |
Those are added via Another note from above:
|
This commit adds an extensive failure testing, which covers quite
a bit of thing and their combinations:
I don't think we need to add that many tests for all types of queries. This PR seems to cover a lot of cases in the connection and transaction management.
Missing item:
mitmproxy-failure-testing
branch to the master