DDL failure testing #2212

onderkalaci · 2018-06-12T12:47:05Z

This commit adds an extensive failure testing, which covers quite
a bit of thing and their combinations:

1PC vs 2PC
Replication factor 1 and Replication factor 2
Network failures and query cancellations
Parallel vs Sequential mode

I don't think we need to add that many tests for all types of queries. This PR seems to cover a lot of cases in the connection and transaction management.

Missing item:

Add tests for sequential mode once @lithp rebases mitmproxy-failure-testing branch to the master

codecov · 2018-07-02T06:57:13Z

Codecov Report

Merging #2212 into mitmproxy-failure-testing will increase coverage by 0.13%.
The diff coverage is n/a.

@@                      Coverage Diff                      @@
##           mitmproxy-failure-testing    #2212      +/-   ##
=============================================================
+ Coverage                       93.7%   93.84%   +0.13%     
=============================================================
  Files                            110      110              
  Lines                          27453    27453              
=============================================================
+ Hits                           25726    25762      +36     
+ Misses                          1727     1691      -36

furkansahin · 2018-07-10T10:13:20Z

src/test/regress/expected/failure_ddl.out

+     2
+(1 row)
+
+-- cance; as soon as the coordinator sends worker_apply_shard_ddl_command


cance; -> cancel

lithp

Hey there, sorry for taking forever to get to this! It mostly looks great, I've pointed out a couple typos and just one mistake: You should have used ^COMMIT instead of COMMIT.

I think the WARNINGs are a real behavior which should be tested for when possible, it's important the user know that something went wrong, but I understand wanting to hide things like prepared transaction ids.

Marco knows better than I re how much duplication of tests for our transaction management makes sense. I think this file is great, but now that we have it the other tests probably don't need to add their own tests for transaction management.

lithp · 2018-07-11T20:20:14Z

src/test/regress/expected/failure_ddl.out

@@ -0,0 +1,1228 @@
+-- 


Wow, this is very comprehensive!

lithp · 2018-07-11T20:23:37Z

src/test/regress/expected/failure_ddl.out

+DETAIL:  server closed the connection unexpectedly
+	This probably means the server terminated abnormally
+	before or while processing the request.
+SELECT citus.mitmproxy('conn.allow()');


You don't really need conn.allow() if you're not going to make any connections to the workers while you setup for the next test.

I'm a bit confused with this. There is one more ALTER TABLE command below. Don't we need conn.allow() before doing that?

From failure_ddl.sql:

-- in the first test, kill just in the first -- response we get from the worker SELECT citus.mitmproxy('conn.onAuthenticationOk().kill()'); ALTER TABLE test_table ADD COLUMN new_column INT; SELECT citus.mitmproxy('conn.allow()'); -- the conn.allow() in question SELECT count(*) FROM public.table_attrs where relid = 'test_table'::regclass; -- this happens locally -- cancel just in the first -- response we get from the worker SELECT citus.mitmproxy('conn.onAuthenticationOk().cancel(' || pg_backend_pid() || ')'); -- reconfigure ALTER TABLE test_table ADD COLUMN new_column INT; -- runs under the new rules SELECT citus.mitmproxy('conn.allow()'); SELECT count(*) FROM public.table_attrs where relid = 'test_table'::regclass;

There is another ALTER TABLE, but that doesn't happen until we run citus.mitmproxy again, the conn.allow() never has a change to be used!

This isn't very important, if you think it makes things more clear feel free to leave it in, just wanted to point this out.

lithp · 2018-07-11T20:32:48Z

src/test/regress/expected/failure_ddl.out

+
+(1 row)
+
+SELECT count(*) FROM public.table_attrs where relid = 'test_table'::regclass;


I'd never seen table_attrs before, this is nice!

It took me a few minutes to figure out what was happening though, it might be more understandable to write:

SELECT array_agg(name::text) FROM public.table_attrs where relid = 'test_table'::regclass;

I'd never seen table_attrs before, this is nice!

Just FYI, this is Citus' test helper function see here. I think the main motivation to add that was Postgres' ouput changes between different versions and with this views we could skip adding _0. files for such minor differences.

it might be more understandable to write:

Makes sense

lithp · 2018-07-11T20:36:38Z

src/test/regress/expected/failure_ddl.out

+(1 row)
+
+-- kill as soon as the coordinator sends COMMIT
+SELECT citus.mitmproxy('conn.onQuery(query="COMMIT").kill()');


Careful! This doesn't test what you think it does. COMMIT is a regex which also matches "BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED". I think what you want is ^COMMIT.

If you use ^COMMIT, then you'll get some shards with the new metadata and some with the old metadata.

Ops, I've missed/forgot that. I'll fix all the occurrences in the file. That's too bad for me that we've missed various cases with this bug in the test :/

lithp · 2018-07-11T20:37:19Z

src/test/regress/expected/failure_ddl.out

+(4 rows)
+
+-- cancel as soon as the coordinator sends COMMIT
+SELECT citus.mitmproxy('conn.onQuery(query="COMMIT").cancel(' ||  pg_backend_pid() || ')');


Same thing here, should be ^COMMIT

When I change it to ^COMMIT the query isn't cancelled which... I guess that's probably okay? I suppose we should at least still have a test for it, to document that you can't cancel after that point.

Yes, during COMMIT/ROLLBACK events the interrupts are held, so it is expected. I'll add the following comment:

-- interrupts are held during COMMIT/ROLLBACK, so the command -- should have been applied without any issues since cancel is ignored

lithp · 2018-07-12T01:25:55Z

src/test/regress/expected/failure_ddl.out

+ (localhost,57640,100803,t,2)
+(4 rows)
+
+-- we should be able to revocer the transaction and


"revocer" -> "recover"

While following along by doing it by hand I accidentally noticed that recover_prepared_transactions() probably has a bug. If I leave in the part which kills connections on ^COMMIT:

brian=# SELECT recover_prepared_transactions(); WARNING: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. WARNING: connection not open recover_prepared_transactions ------------------------------- 0 (1 row)

It returns the number of transactions which it was able to recover, but doesn't give any indication that there are still transactions left to recover!

Yeah, I remember this. But, aren't the WARNINGs important indication for the necessity to re-run the recovery?

I remember during the PR for recover_prepared_transactions () we decided to warn the user about the connection errors, but, do not fail the recovery at all since we could potentially recover prepared transactions from the other workers.

Do you see any other approaches for that?

Why not error out here? Recovery has failed! Warnings are something you can ignore, here it feels like we're suppressing an important failure. Until recover_prepared_transactions() runs all your queries on this table are going to block and you won't know why. If we want to continue and try to recover prepared transactions from the other workers, we can still try to do that before we raise the ERROR.

My complaints aren't with this PR though, I guess I should open a ticket.

If we want to continue and try to recover prepared transactions from the other workers, we can still try to do that before we raise the ERROR.

Yes, that would be a nice improvement.

Just note that we've got the exact same approach with deadlock detection where we warn the user if we cannot connect one of the nodes. We should probably improve both!

lithp · 2018-07-12T01:30:26Z

src/test/regress/expected/failure_ddl.out

+ (localhost,57640,100807,t,2)
+(8 rows)
+
+-- we should be able to revocer the transaction and


"revocer" -> "recover"

lithp · 2018-07-12T01:31:49Z

src/test/regress/expected/failure_ddl.out

+(8 rows)
+
+-- we should be able to revocer the transaction and
+-- see that the command is rollbacked


see that what command is rollbacked? Above all shards have two columns, and below they also have two columns.

Two things:

There was an extra SELECT run_command_...() call, which I've removed

Expanded the comment to

-- we should be able to recover the transaction and -- see that the command is rollbacked on all workers -- note that in this case recover_prepared_transactions() -- sends ROLLBACK PREPARED to the workers given that -- the transaction has not been commited on any placement yet

lithp · 2018-07-12T01:40:41Z

src/test/regress/expected/failure_ddl.out

+-- finally, test failing on ROLLBACK with 2CPC
+-- fail just after the coordinator sends the ROLLBACK
+-- so the command can be rollbacked
+SELECT citus.mitmproxy('conn.onQuery(query="ROLLBACK").kill()');


I still think this is not a particularly useful test. We're suppressing WARNING so we're not really resting any of the coordinator's behavior. And by killing connections all we're exercising is the Postgres behavior: automatically rolling back when the backend ends.

Well, these are all to make sure that we won't break anything in the future. We've had various bugs while doing ROLLBACK. For example, we're still exercising that Citus can send the ROLLBACK to the other workers without any issues. Or, make sure that rollback handler in Citus can gracefully handle the failures during ROLLBACK when there is 2PC.

I see this test file as the main one for exercising almost all possible failure scenarious on transaction management. So, it wouldn't hurt to have some not very useful tests as well?

Okay, good points.

lithp · 2018-07-12T01:41:48Z

src/test/regress/expected/failure_ddl.out

+(1 row)
+
+SET search_path TO 'public';
+DROP SCHEMA ddl_failure CASCADE;


You should also drop test_table :)

DROP SCHEMA cascades to tables created in that schema. I think might be confused because the log level is error and we cannot see the cascade notice messages 😄

And another good point, this is a really nice idea, we can create whatever we want and not keep track of it.

This commit adds an extensive failure testing, which covers quite a bit of things and their combinations: - 1PC vs 2PC - Replication factor 1 and Replication factor 2 - Network failures and query cancellations - Sequential vs Parallel query execution mode

onderkalaci

I've applied your feedback. I think the minor points we think slightly differently:

The existence of some tests that are not very useful
Log level being ERROR, which I also wish that we've added your commit to enable consistent outputs in the framework PR

I tried to reply those in the comments, feel free to discuss more

onderkalaci · 2018-07-12T07:33:39Z

src/test/regress/expected/failure_ddl.out

+
+(1 row)
+
+SELECT count(*) FROM public.table_attrs where relid = 'test_table'::regclass;


I'd never seen table_attrs before, this is nice!

Just FYI, this is Citus' test helper function see here. I think the main motivation to add that was Postgres' ouput changes between different versions and with this views we could skip adding _0. files for such minor differences.

it might be more understandable to write:

Makes sense

onderkalaci · 2018-07-12T07:41:49Z

src/test/regress/expected/failure_ddl.out

+(1 row)
+
+SET search_path TO 'public';
+DROP SCHEMA ddl_failure CASCADE;


DROP SCHEMA cascades to tables created in that schema. I think might be confused because the log level is error and we cannot see the cascade notice messages 😄

onderkalaci · 2018-07-12T08:55:24Z

src/test/regress/expected/failure_ddl.out

+DETAIL:  server closed the connection unexpectedly
+	This probably means the server terminated abnormally
+	before or while processing the request.
+SELECT citus.mitmproxy('conn.allow()');


I'm a bit confused with this. There is one more ALTER TABLE command below. Don't we need conn.allow() before doing that?

onderkalaci · 2018-07-12T08:57:29Z

src/test/regress/expected/failure_ddl.out

+     2
+(1 row)
+
+SELECT run_command_on_placements('test_table', $$SELECT count(*) FROM public.table_attrs where relid = '%s'::regclass$$) ORDER BY 1;


I converted them to $$SELECT array_agg(name::text ORDER BY name::text) FROM public.table_attrs where relid = '%s'::regclass;$$ as well for ease of read.

onderkalaci · 2018-07-12T09:04:26Z

src/test/regress/expected/failure_ddl.out

+
+-- but now kill just after the worker sends response to 
+-- ROLLBACK command, so we'll have lots of warnings but the command
+-- should have been rollbacked both on the distributed table and the placements


I prefer to keep both them in case we introduce a subtle bug that leads to crashes or does something wrong at some point?

onderkalaci · 2018-07-12T09:50:51Z

src/test/regress/expected/failure_ddl.out

+(4 rows)
+
+-- cancel as soon as the coordinator sends COMMIT
+SELECT citus.mitmproxy('conn.onQuery(query="COMMIT").cancel(' ||  pg_backend_pid() || ')');


Yes, during COMMIT/ROLLBACK events the interrupts are held, so it is expected. I'll add the following comment:

-- interrupts are held during COMMIT/ROLLBACK, so the command -- should have been applied without any issues since cancel is ignored

onderkalaci · 2018-07-12T09:53:17Z

src/test/regress/expected/failure_ddl.out

+
+(1 row)
+
+ALTER TABLE test_table ADD COLUMN new_column INT;


makes sense

onderkalaci · 2018-07-12T09:54:38Z

src/test/regress/expected/failure_ddl.out

+
+BEGIN;
+ALTER TABLE test_table DROP COLUMN new_column;
+ROLLBACK;


makes sense these tests would end up with consistent test outputs.

onderkalaci · 2018-07-12T09:57:10Z

src/test/regress/expected/failure_ddl.out

+(4 rows)
+
+-- killing on command complete of COMMIT PREPARE, we should see that the command succeeds
+-- and all the workers committed


Now I understand your point of having prepared transactions with consistent output better. It sounds a little bit late to have that. But, we could open an issue to track it in the future?

onderkalaci · 2018-07-12T10:02:27Z

src/test/regress/expected/failure_ddl.out

+ (localhost,57640,100803,t,2)
+(4 rows)
+
+-- we should be able to revocer the transaction and


Yeah, I remember this. But, aren't the WARNINGs important indication for the necessity to re-run the recovery?

I remember during the PR for recover_prepared_transactions () we decided to warn the user about the connection errors, but, do not fail the recovery at all since we could potentially recover prepared transactions from the other workers.

Do you see any other approaches for that?

mtuncer · 2018-07-13T08:28:58Z

Some views are not cleaned up

onderkalaci · 2018-07-13T08:42:23Z

Some views are not cleaned up

Those are added via multi_test_helpers.sql at the beginning of the regression tests.

Another note from above:

Just FYI, this is Citus' test helper function see here. I think the main motivation to add that was Postgres' ouput changes between different versions and with this views we could skip adding _0. files for such minor differences.

onderkalaci requested a review from lithp June 12, 2018 12:47

onderkalaci mentioned this pull request Jun 13, 2018

Always throw errors on failure on critical connection in router executor #2215

Merged

lithp force-pushed the mitmproxy-failure-testing branch 3 times, most recently from 8eea61c to 9133912 Compare June 21, 2018 20:48

onderkalaci force-pushed the ddl_failure_testing branch from 17769ef to a8bf960 Compare June 26, 2018 11:05

onderkalaci mentioned this pull request Jun 26, 2018

Mitmproxy-based automated failure testing #2119

Merged

16 tasks

lithp force-pushed the mitmproxy-failure-testing branch 3 times, most recently from 3121431 to 6c01440 Compare June 29, 2018 02:13

onderkalaci force-pushed the ddl_failure_testing branch from a8bf960 to 2795a93 Compare July 2, 2018 06:57

metdos mentioned this pull request Jul 6, 2018

Add tests for 1PC COPY on append and hash-distributed tables #2244

Merged

lithp force-pushed the mitmproxy-failure-testing branch from 6c01440 to 3e309e3 Compare July 6, 2018 18:51

lithp force-pushed the ddl_failure_testing branch from 2795a93 to b84f481 Compare July 6, 2018 19:20

furkansahin reviewed Jul 10, 2018

View reviewed changes

lithp suggested changes Jul 12, 2018

View reviewed changes

metdos mentioned this pull request Jul 12, 2018

Automate Remaining Failure/Cancellation Tests #2262

Open

5 tasks

onderkalaci commented Jul 12, 2018

View reviewed changes

onderkalaci force-pushed the ddl_failure_testing branch from b84f481 to a446e71 Compare July 12, 2018 10:09

onderkalaci changed the base branch from mitmproxy-failure-testing to master July 12, 2018 10:09

lithp approved these changes Jul 12, 2018

View reviewed changes

onderkalaci merged commit 5433295 into master Jul 13, 2018

marcocitus deleted the ddl_failure_testing branch July 20, 2018 05:07


		(1 row)

		SELECT count(*) FROM public.table_attrs where relid = 'test_table'::regclass;

DDL failure testing #2212

DDL failure testing #2212

Conversation

onderkalaci commented Jun 12, 2018 • edited Loading

codecov bot commented Jul 2, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

lithp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithp Jul 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onderkalaci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtuncer commented Jul 13, 2018

onderkalaci commented Jul 13, 2018

onderkalaci commented Jun 12, 2018 •

edited

Loading

codecov bot commented Jul 2, 2018 •

edited

Loading

lithp Jul 12, 2018 •

edited

Loading