Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDL failure testing #2212

Merged
merged 1 commit into from
Jul 13, 2018
Merged

DDL failure testing #2212

merged 1 commit into from
Jul 13, 2018

Conversation

onderkalaci
Copy link
Contributor

@onderkalaci onderkalaci commented Jun 12, 2018

This commit adds an extensive failure testing, which covers quite
a bit of thing and their combinations:

  • 1PC vs 2PC
  • Replication factor 1 and Replication factor 2
  • Network failures and query cancellations
  • Parallel vs Sequential mode

I don't think we need to add that many tests for all types of queries. This PR seems to cover a lot of cases in the connection and transaction management.

Missing item:

  • Add tests for sequential mode once @lithp rebases mitmproxy-failure-testing branch to the master

@onderkalaci onderkalaci requested a review from lithp June 12, 2018 12:47
@lithp lithp force-pushed the mitmproxy-failure-testing branch 3 times, most recently from 8eea61c to 9133912 Compare June 21, 2018 20:48
@onderkalaci onderkalaci force-pushed the ddl_failure_testing branch from 17769ef to a8bf960 Compare June 26, 2018 11:05
@lithp lithp force-pushed the mitmproxy-failure-testing branch 3 times, most recently from 3121431 to 6c01440 Compare June 29, 2018 02:13
@onderkalaci onderkalaci force-pushed the ddl_failure_testing branch from a8bf960 to 2795a93 Compare July 2, 2018 06:57
@codecov
Copy link

codecov bot commented Jul 2, 2018

Codecov Report

Merging #2212 into mitmproxy-failure-testing will increase coverage by 0.13%.
The diff coverage is n/a.

@@                      Coverage Diff                      @@
##           mitmproxy-failure-testing    #2212      +/-   ##
=============================================================
+ Coverage                       93.7%   93.84%   +0.13%     
=============================================================
  Files                            110      110              
  Lines                          27453    27453              
=============================================================
+ Hits                           25726    25762      +36     
+ Misses                          1727     1691      -36

2
(1 row)

-- cance; as soon as the coordinator sends worker_apply_shard_ddl_command
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cance; -> cancel

Copy link
Contributor

@lithp lithp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there, sorry for taking forever to get to this! It mostly looks great, I've pointed out a couple typos and just one mistake: You should have used ^COMMIT instead of COMMIT.

I think the WARNINGs are a real behavior which should be tested for when possible, it's important the user know that something went wrong, but I understand wanting to hide things like prepared transaction ids.

Marco knows better than I re how much duplication of tests for our transaction management makes sense. I think this file is great, but now that we have it the other tests probably don't need to add their own tests for transaction management.

@@ -0,0 +1,1228 @@
--
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is very comprehensive!

DETAIL: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
SELECT citus.mitmproxy('conn.allow()');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't really need conn.allow() if you're not going to make any connections to the workers while you setup for the next test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused with this. There is one more ALTER TABLE command below. Don't we need conn.allow() before doing that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From failure_ddl.sql:

-- in the first test, kill just in the first 
-- response we get from the worker
SELECT citus.mitmproxy('conn.onAuthenticationOk().kill()');
ALTER TABLE test_table ADD COLUMN new_column INT;
SELECT citus.mitmproxy('conn.allow()'); -- the conn.allow() in question
SELECT count(*) FROM public.table_attrs where relid = 'test_table'::regclass; -- this happens locally

-- cancel just in the first 
-- response we get from the worker
SELECT citus.mitmproxy('conn.onAuthenticationOk().cancel(' ||  pg_backend_pid() || ')'); -- reconfigure
ALTER TABLE test_table ADD COLUMN new_column INT; -- runs under the new rules
SELECT citus.mitmproxy('conn.allow()');
SELECT count(*) FROM public.table_attrs where relid = 'test_table'::regclass;

There is another ALTER TABLE, but that doesn't happen until we run citus.mitmproxy again, the conn.allow() never has a change to be used!

This isn't very important, if you think it makes things more clear feel free to leave it in, just wanted to point this out.


(1 row)

SELECT count(*) FROM public.table_attrs where relid = 'test_table'::regclass;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd never seen table_attrs before, this is nice!

It took me a few minutes to figure out what was happening though, it might be more understandable to write:

SELECT array_agg(name::text) FROM public.table_attrs where relid = 'test_table'::regclass;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd never seen table_attrs before, this is nice!

Just FYI, this is Citus' test helper function see here. I think the main motivation to add that was Postgres' ouput changes between different versions and with this views we could skip adding _0. files for such minor differences.

it might be more understandable to write:

Makes sense

(1 row)

-- kill as soon as the coordinator sends COMMIT
SELECT citus.mitmproxy('conn.onQuery(query="COMMIT").kill()');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Careful! This doesn't test what you think it does. COMMIT is a regex which also matches "BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED". I think what you want is ^COMMIT.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use ^COMMIT, then you'll get some shards with the new metadata and some with the old metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ops, I've missed/forgot that. I'll fix all the occurrences in the file. That's too bad for me that we've missed various cases with this bug in the test :/

(4 rows)

-- cancel as soon as the coordinator sends COMMIT
SELECT citus.mitmproxy('conn.onQuery(query="COMMIT").cancel(' || pg_backend_pid() || ')');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here, should be ^COMMIT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I change it to ^COMMIT the query isn't cancelled which... I guess that's probably okay? I suppose we should at least still have a test for it, to document that you can't cancel after that point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, during COMMIT/ROLLBACK events the interrupts are held, so it is expected. I'll add the following comment:

-- interrupts are held during COMMIT/ROLLBACK, so the command 
-- should have been applied without any issues since cancel is ignored

(localhost,57640,100803,t,2)
(4 rows)

-- we should be able to revocer the transaction and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"revocer" -> "recover"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While following along by doing it by hand I accidentally noticed that recover_prepared_transactions() probably has a bug. If I leave in the part which kills connections on ^COMMIT:

brian=# SELECT recover_prepared_transactions();
WARNING:  server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
WARNING:  connection not open
 recover_prepared_transactions 
-------------------------------
                             0
(1 row)

It returns the number of transactions which it was able to recover, but doesn't give any indication that there are still transactions left to recover!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I remember this. But, aren't the WARNINGs important indication for the necessity to re-run the recovery?

I remember during the PR for recover_prepared_transactions () we decided to warn the user about the connection errors, but, do not fail the recovery at all since we could potentially recover prepared transactions from the other workers.

Do you see any other approaches for that?

Copy link
Contributor

@lithp lithp Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not error out here? Recovery has failed! Warnings are something you can ignore, here it feels like we're suppressing an important failure. Until recover_prepared_transactions() runs all your queries on this table are going to block and you won't know why. If we want to continue and try to recover prepared transactions from the other workers, we can still try to do that before we raise the ERROR.

My complaints aren't with this PR though, I guess I should open a ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to continue and try to recover prepared transactions from the other workers, we can still try to do that before we raise the ERROR.

Yes, that would be a nice improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just note that we've got the exact same approach with deadlock detection where we warn the user if we cannot connect one of the nodes. We should probably improve both!

(localhost,57640,100807,t,2)
(8 rows)

-- we should be able to revocer the transaction and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"revocer" -> "recover"

(8 rows)

-- we should be able to revocer the transaction and
-- see that the command is rollbacked
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see that what command is rollbacked? Above all shards have two columns, and below they also have two columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  • There was an extra SELECT run_command_...() call, which I've removed
  • Expanded the comment to
-- we should be able to recover the transaction and
-- see that the command is rollbacked on all workers
-- note that in this case recover_prepared_transactions()
-- sends ROLLBACK PREPARED to the workers given that 
-- the transaction has not been commited on any placement yet

-- finally, test failing on ROLLBACK with 2CPC
-- fail just after the coordinator sends the ROLLBACK
-- so the command can be rollbacked
SELECT citus.mitmproxy('conn.onQuery(query="ROLLBACK").kill()');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this is not a particularly useful test. We're suppressing WARNING so we're not really resting any of the coordinator's behavior. And by killing connections all we're exercising is the Postgres behavior: automatically rolling back when the backend ends.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, these are all to make sure that we won't break anything in the future. We've had various bugs while doing ROLLBACK. For example, we're still exercising that Citus can send the ROLLBACK to the other workers without any issues. Or, make sure that rollback handler in Citus can gracefully handle the failures during ROLLBACK when there is 2PC.

I see this test file as the main one for exercising almost all possible failure scenarious on transaction management. So, it wouldn't hurt to have some not very useful tests as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, good points.

(1 row)

SET search_path TO 'public';
DROP SCHEMA ddl_failure CASCADE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also drop test_table :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DROP SCHEMA cascades to tables created in that schema. I think might be confused because the log level is error and we cannot see the cascade notice messages 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And another good point, this is a really nice idea, we can create whatever we want and not keep track of it.

This commit adds an extensive failure testing, which covers quite
a bit of things and their combinations:
   - 1PC vs 2PC
   - Replication factor 1 and Replication factor 2
   - Network failures and query cancellations
   - Sequential vs Parallel query execution mode
Copy link
Contributor Author

@onderkalaci onderkalaci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've applied your feedback. I think the minor points we think slightly differently:

  • The existence of some tests that are not very useful
  • Log level being ERROR, which I also wish that we've added your commit to enable consistent outputs in the framework PR

I tried to reply those in the comments, feel free to discuss more


(1 row)

SELECT count(*) FROM public.table_attrs where relid = 'test_table'::regclass;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd never seen table_attrs before, this is nice!

Just FYI, this is Citus' test helper function see here. I think the main motivation to add that was Postgres' ouput changes between different versions and with this views we could skip adding _0. files for such minor differences.

it might be more understandable to write:

Makes sense

(1 row)

SET search_path TO 'public';
DROP SCHEMA ddl_failure CASCADE;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DROP SCHEMA cascades to tables created in that schema. I think might be confused because the log level is error and we cannot see the cascade notice messages 😄

DETAIL: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
SELECT citus.mitmproxy('conn.allow()');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused with this. There is one more ALTER TABLE command below. Don't we need conn.allow() before doing that?

2
(1 row)

SELECT run_command_on_placements('test_table', $$SELECT count(*) FROM public.table_attrs where relid = '%s'::regclass$$) ORDER BY 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I converted them to $$SELECT array_agg(name::text ORDER BY name::text) FROM public.table_attrs where relid = '%s'::regclass;$$ as well for ease of read.


-- but now kill just after the worker sends response to
-- ROLLBACK command, so we'll have lots of warnings but the command
-- should have been rollbacked both on the distributed table and the placements
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to keep both them in case we introduce a subtle bug that leads to crashes or does something wrong at some point?

(4 rows)

-- cancel as soon as the coordinator sends COMMIT
SELECT citus.mitmproxy('conn.onQuery(query="COMMIT").cancel(' || pg_backend_pid() || ')');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, during COMMIT/ROLLBACK events the interrupts are held, so it is expected. I'll add the following comment:

-- interrupts are held during COMMIT/ROLLBACK, so the command 
-- should have been applied without any issues since cancel is ignored


(1 row)

ALTER TABLE test_table ADD COLUMN new_column INT;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense


BEGIN;
ALTER TABLE test_table DROP COLUMN new_column;
ROLLBACK;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense these tests would end up with consistent test outputs.

(4 rows)

-- killing on command complete of COMMIT PREPARE, we should see that the command succeeds
-- and all the workers committed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I understand your point of having prepared transactions with consistent output better. It sounds a little bit late to have that. But, we could open an issue to track it in the future?

(localhost,57640,100803,t,2)
(4 rows)

-- we should be able to revocer the transaction and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I remember this. But, aren't the WARNINGs important indication for the necessity to re-run the recovery?

I remember during the PR for recover_prepared_transactions () we decided to warn the user about the connection errors, but, do not fail the recovery at all since we could potentially recover prepared transactions from the other workers.

Do you see any other approaches for that?

@onderkalaci onderkalaci force-pushed the ddl_failure_testing branch from b84f481 to a446e71 Compare July 12, 2018 10:09
@onderkalaci onderkalaci changed the base branch from mitmproxy-failure-testing to master July 12, 2018 10:09
@onderkalaci onderkalaci merged commit 5433295 into master Jul 13, 2018
@mtuncer
Copy link
Member

mtuncer commented Jul 13, 2018

Some views are not cleaned up

\d
List of relations
Schema | Name | Type | Owner
--------+-----------------+------+----------
public | table_attrs | view | postgres
public | table_checks | view | postgres
public | table_desc | view | postgres
public | table_fkey_cols | view | postgres
public | table_fkeys | view | postgres
(5 rows)

@onderkalaci
Copy link
Contributor Author

Some views are not cleaned up

Those are added via multi_test_helpers.sql at the beginning of the regression tests.

Another note from above:

Just FYI, this is Citus' test helper function see here. I think the main motivation to add that was Postgres' ouput changes between different versions and with this views we could skip adding _0. files for such minor differences.

@marcocitus marcocitus deleted the ddl_failure_testing branch July 20, 2018 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants