Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always throw errors on failure on critical connection in router executor #2215

Merged
merged 4 commits into from
Jun 14, 2018

Conversation

marcocitus
Copy link
Member

@marcocitus marcocitus commented Jun 12, 2018

DESCRIPTION: Fixes a bug that could cause transactions to incorrectly proceed after failure

Fixes #2214

Citus currently doesn't throw an error when a failure occurs in the router executor on a connection that was marked as critical. This can cause transactions to incorrectly proceed after a failure on a connection that's supposed to complete a 2PC due to commands earlier in the transaction. The commit handler is not ready for that situation and panics.

This PR fixes it by checking whether the connection is critical when an error response is returned, or throwing an error immediately if a connection failure occurs.

As a side-effect, this also caused some error messages in queries with CTEs to become more meaningful.

I've also defensively moved the call to CheckTransactionHealth out of CoordinatedRemoteTransactionsCommit because that created a code path where we could throw an error in the commit handler.

@marcocitus marcocitus force-pushed the fix_ref_table_failure branch 4 times, most recently from 86d37ea to ce6dbe8 Compare June 12, 2018 18:47
@codecov
Copy link

codecov bot commented Jun 12, 2018

Codecov Report

Merging #2215 into master will increase coverage by 0.07%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master    #2215      +/-   ##
==========================================
+ Coverage   93.64%   93.71%   +0.07%     
==========================================
  Files         103      102       -1     
  Lines       26385    26286      -99     
==========================================
- Hits        24707    24634      -73     
+ Misses       1678     1652      -26

@onderkalaci
Copy link
Contributor

Just FYI: This seems like a good candidate to add a test using the new failure testing framework. We have not checked in the failure testing framework, but, after we merge this PR, it should be easy to add a test via a new PR as #2212

@pykello pykello self-requested a review June 13, 2018 15:13
Copy link
Contributor

@pykello pykello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked different code paths which called StoreQueryResult() and ConsumeQueryResult() with failOnError = false to see if raising errors in these cases can be problematic.

I couldn't find any logical problems, but I think this hurts the readability a bit, since reader expects no errors to be thrown when this flag is false, but errors can be thrown.


HandleRemoteTransactionConnectionError(connection, raiseErrors);
HandleRemoteTransactionConnectionError(connection, raiseIfTransactionIsCritical);
return false;
}

singleRowMode = PQsetSingleRowMode(connection->pgConn);
if (singleRowMode == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For later: this should be an Assert(). Based on the documentation, PQsetSingleRowMode() should always return 1 if called immediately after PQsendQuery(), otherwise it is a programming error and returns 0.

@@ -1581,7 +1585,8 @@ StoreQueryResult(CitusScanState *scanState, MultiConnection *connection,
category = ERRCODE_TO_CATEGORY(ERRCODE_INTEGRITY_CONSTRAINT_VIOLATION);
isConstraintViolation = SqlStateMatchesCategory(sqlStateString, category);

if (isConstraintViolation || failOnError)
if (isConstraintViolation || failOnError ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failOnError is misnamed here, since there are cases that this can be false but we still fail. We should at least update the comments for StoreQueryResult() and ConsumeQueryResult().

@marcocitus marcocitus force-pushed the fix_ref_table_failure branch from e095ea0 to f5065b6 Compare June 14, 2018 20:31
@marcocitus marcocitus force-pushed the fix_ref_table_failure branch from f5065b6 to 0bbe778 Compare June 14, 2018 20:35
@marcocitus marcocitus merged commit 4686d49 into master Jun 14, 2018
@marcocitus marcocitus deleted the fix_ref_table_failure branch June 14, 2018 20:48
@marcocitus
Copy link
Member Author

Note: I did some renaming based on Hadi's feedback, but only the first 3 commits should be backported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reference table mid-transaction failure may cause shardstate 3 and PANIC
4 participants