Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for 1PC COPY on append and hash-distributed tables #2244

Merged
merged 2 commits into from
Jul 31, 2018

Conversation

lithp
Copy link
Contributor

@lithp lithp commented Jun 29, 2018

No description provided.

@lithp lithp requested a review from onderkalaci June 29, 2018 02:08
@lithp lithp force-pushed the mitmproxy-failure-testing branch from 3121431 to 6c01440 Compare June 29, 2018 02:13
@lithp lithp force-pushed the failure-1pc-copy branch from dea22c4 to c723121 Compare July 2, 2018 04:10
@codecov
Copy link

codecov bot commented Jul 2, 2018

Codecov Report

Merging #2244 into master will decrease coverage by 0.09%.
The diff coverage is n/a.

@@            Coverage Diff            @@
##           master    #2244     +/-   ##
=========================================
- Coverage   93.78%   93.68%   -0.1%     
=========================================
  Files         110      106      -4     
  Lines       27576    26638    -938     
=========================================
- Hits        25861    24956    -905     
+ Misses       1715     1682     -33

Copy link
Contributor

@onderkalaci onderkalaci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one general open question for adding the failure testing is the following:

Do we need to add tests for all possible scenarios for all tests (e.g., #2212) including
* Replication factor 1 vs 2
* Parallel vs Sequential mode (if applicable)
* 1PC vs 2PC
* Cancel tests for each step as we do for failures

If we add all the tests, we'd end up with a huge duplication since we'd be exercising the same code-paths over and over and over again. On the other hand, we're going add these tests only once and we'd be able forget about almost all network failure scenarios if we have all those tests.

What would be your thoughts on this? I'm also curious what @marcocitus would think about that.

SELECT * FROM citus.dump_network_traffic();
conn | source | message
------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0 | coordinator | [initial message]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that this network dump is super useful and have it during development or adding new tests is very useful.

But, travis sometimes fail due to place of + 0 | coordinator | ['Terminate()'].

Some random idea to not get into this state in general: Shall we (or could we) filter only the packets that we're going to test from dump_network_traffic ()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, travis sometimes fail due to place of + 0 | coordinator | ['Terminate()'].

This is a bug that should be fixed. I guess it's a timing issue. mitmdump should ignore any packets from connections which existed before the last time the buffer was cleared. I can fix it tomorrow!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the packet trace for now.

CONTEXT: while executing command on localhost:57640
ERROR: failure on connection marked as essential: localhost:57640
SELECT * FROM pg_dist_shard s, pg_dist_shard_placement p
WHERE (s.shardid = p.shardid) AND s.logicalrelid = 'copy_test'::regclass;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add ORDER BY to all queries selecting from pg_dist* to be on the safe side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, pushed a new commit where we always sort by placementid.

@metdos
Copy link
Contributor

metdos commented Jul 6, 2018

I think one general open question for adding the failure testing is the following:

Do we need to add tests for all possible scenarios for all tests (e.g., #2212) including

  • Replication factor 1 vs 2
  • Parallel vs Sequential mode (if applicable)
  • 1PC vs 2PC
  • Cancel tests for each step as we do for failures

If we add all the tests, we'd end up with a huge duplication since we'd be exercising the same code-paths over and over and over again. On the other hand, we're going add these tests only once and we'd be able forget about almost all network failure scenarios if we have all those tests.

What would be your thoughts on this? I'm also curious what @marcocitus would think about that.

Hey @marcocitus, any ideas on this? We will start to write automated tests next week, so if you have some comments, please let us know.

@marcocitus
Copy link
Member

we'd be able forget about almost all network failure scenarios

At some point adding a lot more failure tests using the current framework will have diminishing returns, while there are other areas that are not yet covered. In particular network timeouts and network failures behind a pgbouncer (e.g. BEGIN results in a query_wait_timeout error).

More on topic: The way 1PC vs. 2PC works in COPY is not different than in other parts of the code. It might make sense to dedicate a test file specifically to commit/abort failures in 1PC and 2PC, while not doing commit-time fault-injection in other tests.

Some code paths with different failure handling logic:

  • COPY
  • DDL & multi-shard DML with parallel and sequential execution
  • Bare DDL (vacuum, create index concurrently)
  • Multi-shard SELECT
  • Single-shard SELECT
  • Single-shard DML outside a transaction block
  • Single-shard DML in a transaction block
  • Multi-statement transactions that span across multiple shards
  • DML on reference table
  • Commit/abort/prepare
  • Begin
  • Savepoints

Ideally, we would test all of them with replication vs. no replication, and test cancellation. I don't think there are other dimensions we need to consider.

@lithp lithp force-pushed the mitmproxy-failure-testing branch from 6c01440 to 3e309e3 Compare July 6, 2018 18:51
@metdos
Copy link
Contributor

metdos commented Jul 8, 2018

@lithp, could you rebase this to the master and make ready to merge?

@metdos
Copy link
Contributor

metdos commented Jul 8, 2018

Are these tests checked in #2262?

@metdos
Copy link
Contributor

metdos commented Jul 10, 2018

Just noting that other tests are blocked on this for Travis, so it would be good if prioritize this over others :)

(4 rows)

-- the COPY data fails
SELECT citus.mitmproxy('conn.onCopyData().killall()');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add one more step

SELECT citus.mitmproxy('conn.onCommandComplete(command="COPY 4").kill()');

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a lot of value in this test because, from the coordinator's POV, it looks exactly the same as the conn.onCopyData().killall() test, but I've just pushed a commit which adds it.

@sumedhpathak
Copy link
Contributor

@lithp Could you follow-up on the action-items and close this PR out?

@lithp lithp force-pushed the failure-1pc-copy branch from c723121 to bbbe9ed Compare July 26, 2018 22:28
@lithp lithp changed the base branch from mitmproxy-failure-testing to master July 26, 2018 22:28
@lithp
Copy link
Contributor Author

lithp commented Jul 26, 2018

@onderkalaci I've rebased on master and addressed, I think, all your comments.

Copy link
Contributor

@onderkalaci onderkalaci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are pretty comprehensive, and all looks good, after considering two minor comments :shipit:

(Also, I think you could consider merging the ephemeral port PR first and update the test outputs here).


-- ==== okay, run some tests where there's only one active shard ====
COPY copy_test FROM PROGRAM 'echo 0, 0 && echo 1, 1 && echo 2, 4 && echo 3, 9' WITH CSV;
SELECT * FROM copy_test;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Shall we add ORDER BY? Though the data comes from a single shard, there is a possibility of ending up with a different order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added!

CONTEXT: while executing command on localhost:57640
WARNING: connection not open
CONTEXT: while executing command on localhost:57640
ERROR: failure on connection marked as essential: localhost:57640
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty minor: this test took some time for me to digest. Since we've not printed placements just after the successful COPY operation 2 lines before, I thought oh, we're erroring out but also creating the placements successfully, what's happening 😄

So, could be useful to print the placements just before starting the test as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, added!

@lithp lithp force-pushed the failure-1pc-copy branch 3 times, most recently from 81a9493 to c23bd5c Compare July 31, 2018 21:27
@lithp lithp force-pushed the failure-1pc-copy branch from c23bd5c to ea485ab Compare July 31, 2018 21:57
@lithp lithp merged commit 82fa85f into master Jul 31, 2018
@marcocitus marcocitus deleted the failure-1pc-copy branch October 21, 2019 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants