Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix query failure race to preserve original failure cause #20582

Merged
merged 3 commits into from
Feb 6, 2024

Conversation

findepi
Copy link
Member

@findepi findepi commented Feb 5, 2024

Before the change there was a race around
QueryStateMachine.transitionToFailed:

  • Thread A: failure occurs, calls transitionToFailed with a specific
    exception (the original failure cause)
  • Thread A: calls cleanupQueryQuietly, which calls listeners, which
    e.g. unregisters a query from LanguageFunctionManager
  • Thread B is within StatementAnalyzer and calls
    LanguageFunctionManager.getQueryFunctions, which fails with
    IllegalStateException
  • Thread B catches the exception and fails the query; calls
    transitionToFailed
  • Thread B sets failureCause to IllegalStateException caught above
  • Thread A invokes failureCause.compareAndSet and sees non-empty
    reference; the original query failure cause is lost

Fixes #20551

Prep for adding more parameterization
@findepi findepi force-pushed the findepi/rgroups branch 2 times, most recently from e523ac9 to 9a9bff2 Compare February 5, 2024 16:39
@findepi
Copy link
Member Author

findepi commented Feb 5, 2024

Added a unit test.

cancellingThread.get(10, SECONDS);
anotherThread.get(10, SECONDS);

// TODO queryStateMachine.getFinalQueryInfo() does not exist for cancelled queries, but may be created by anotherThread due to a race
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason transitionToFailed and transitionToCanceled have different code paths and only first sets finalQueryInfo. that's probably a miss in 8a89c29?
cc @JamesRTaylor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in another commit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if #20231 is related

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question. don't know

@@ -1123,12 +1133,16 @@ public boolean transitionToFailed(Throwable throwable)

QueryState oldState = queryState.trySet(FAILED);
if (oldState.isDone()) {
QUERY_STATE_LOG.debug(throwable, "Failure after query %s finished", queryId);
if (log) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: separate commit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it not a separate commit already?

cancellingThread.get(10, SECONDS);
anotherThread.get(10, SECONDS);

// TODO queryStateMachine.getFinalQueryInfo() does not exist for cancelled queries, but may be created by anotherThread due to a race
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if #20231 is related


boolean canceled = queryState.setIf(FAILED, currentState -> !currentState.isDone());
if (canceled) {
session.getTransactionId().flatMap(transactionManager::getTransactionInfoIfExist).ifPresent(transaction -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where it's done now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the common method called from here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless i missed something, please check

@sopel39
Copy link
Member

sopel39 commented Feb 6, 2024

mind automation

Before the change there was a race around
`QueryStateMachine.transitionToFailed`:

- Thread A: failure occurs, calls `transitionToFailed` with a specific
   exception (the original failure cause)
- Thread A: calls `cleanupQueryQuietly`, which calls listeners, which
  e.g. unregisters a query from `LanguageFunctionManager`
- Thread B is within `StatementAnalyzer` and calls
  `LanguageFunctionManager.getQueryFunctions`, which fails with
  `IllegalStateException`
- Thread B catches the exception and fails the query; calls
  `transitionToFailed`
- Thread B sets `failureCause` to `IllegalStateException` caught above
- Thread A invokes `failureCause.compareAndSet` and sees non-empty
  reference; the original query failure cause is lost
Before commit 8a89c29, the only
difference between `transitionToCanceled` and `transitionToFailed`
seemed to be logging the failure thru QUERY_STATE_LOG.

After that commit, one of those methods sets `finalQueryInfo` to ensure
it's set. Unify the handling to ensure `finalQueryInfo` is set for
`transitionToCanceled` as well.
@findepi
Copy link
Member Author

findepi commented Feb 6, 2024

prev CI failed #19799

@findepi findepi merged commit 1357f67 into trinodb:master Feb 6, 2024
93 checks passed
@findepi findepi deleted the findepi/rgroups branch February 6, 2024 12:12
@github-actions github-actions bot added this to the 439 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Query mail fail with QUERY_QUEUE_FULL when resource group admission fails
3 participants