Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure operators are always closed #13721

Merged
merged 1 commit into from
Aug 22, 2022

Conversation

arhimondr
Copy link
Contributor

Description

Due to a race condition in Driver#tryWithLock there was a chance an
operator might have end up not being properly closed upon completion.

  • Driver#process executes an operator under the Driver#exclusiveLock
  • An operator throws an exception and triggers a task failure
  • A TaskStateMachine listener closes all drivers calling Driver#close
  • Driver#close is not able to acquire the Driver#exclusiveLock and
    assumes the driver will be terminated by the lock owner
  • The lock owner throws an exception and never runs post
    execution code in Driver#tryWithLock that was expected to close
    the operators

Is this change a fix, improvement, new feature, refactoring, or other?

Fix

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Core engine

How would you describe this change to a non-technical end user or system administrator?

N/A

Related issues, pull requests, and links

#11275

Documentation

(X) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(X) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@losipiuk
Copy link
Member

I do not understand why in the previous code destroyIfNecessary could not have been called. Can you elaborate?

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you find the issue? Is it a regression? Would it be possible to write a test?

@arhimondr
Copy link
Contributor Author

@losipiuk

Yeah, it's a tricky one. I tried my best to explain it with words in the description, but I can see how it can still be difficult to understand. Let me include some pictures to better display the race.

Let's assume a thread that was executing a task that throws has executed the destroyIfNecessary, but hasn't yet released a lock:

image

At the same very moment some other thread calls Driver#close:

image
image

But since the first thread throws and won't execute the post script:

image

As a result a Driver has been transitioned to NEED_DESTRUCTION but never actually destroyed

@arhimondr
Copy link
Contributor Author

@sopel39 The issue is discovered when debugging flakiness from #11275.

I was trying to write a test, but it become very cumbersome very fast, as it requires rather tricky interactions.

To write a test I would need to stop one thread after executing destroyIfNecessary in finally section right before releasing the lock and run close from a different thread at the same time. It would likely require modifying the tryWithLock code introducing a test only dependency (such as a future or a latch) to stop a thread before releasing the lock in the finally section.

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @arhimondr for images. This is super clear now. And indeed the fix looks valid.

@arhimondr arhimondr force-pushed the fix-operator-close-race branch from f3a62ff to 4401c1f Compare August 19, 2022 15:21
throw new RuntimeException(failure);
}

verify(result != null, "result is null");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary. The Optional.of() construction will fail if it's null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this more as a documentation so the intention is clearly stated. Otherwise when some task starts returning null somebody may interpret the Optional.of as a mistake and switch it to Optional.ofNullable instead of fixing the actual problem.

Due to a race condition in Driver#tryWithLock there was a chance an
operator might have end up not being properly closed upon completion.

- Driver#process executes an operator under the Driver#exclusiveLock
- An operator throws an exception and triggers a task failure
- A TaskStateMachine listener closes all drivers calling Driver#close
- Driver#close is not able to acquire the Driver#exclusiveLock and
  assumes the driver will be terminated by the lock owner
- The lock owner throws an exception and never runs post
  execution code in Driver#tryWithLock that was expected to close
  the operators
@arhimondr arhimondr force-pushed the fix-operator-close-race branch from 4401c1f to 20037c9 Compare August 19, 2022 17:08
@arhimondr arhimondr merged commit e64998e into trinodb:master Aug 22, 2022
@arhimondr arhimondr deleted the fix-operator-close-race branch August 22, 2022 13:26
@github-actions github-actions bot added this to the 394 milestone Aug 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants