Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid self-deadlock in the translog #29520

Merged
merged 3 commits into from
Apr 15, 2018

Conversation

jasontedor
Copy link
Member

@jasontedor jasontedor commented Apr 15, 2018

Today when reading an operation from the current generation fails tragically we attempt to close the translog. However, by invoking close before releasing the read lock we end up in self-deadlock because closing tries to acquire the write lock and the read lock can not be upgraded to a write lock. To avoid this, we move the close invocation outside of the try-with-resources that acquired the read lock. As an extra guard against this, we document the problem and add an assertion that we are not trying to invoke close while holding the read lock.

Closes #29509, relates #29401

Today when reading an operation from the current generation fails
tragically we attempt to close the translog. However, by invoking close
before releasing the read lock we end up in self-deadlock because
closing tries to acquire the write lock and the read lock can not be
upgraded to a write lock. To avoid this, we move the close invocation
outside of the try-with-resources that acquired the read lock. As an
extra guard against this, we document the problem and add an assertion
that we are not trying to invoke close while holding the read lock.
@jasontedor jasontedor added >bug review v7.0.0 v6.3.0 :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Apr 15, 2018
@jasontedor jasontedor requested review from s1monw and bleskes April 15, 2018 17:16
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

private void closeOnTragicEvent(final Exception ex) {
// we can not hold a read lock here because closing will attempt to obtain a write lock and that would result in self-deadlock
assert readLock.isHeldByCurrentThread() == false : Thread.currentThread().getName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @jasontedor. Can you add a note to the PR description that this was introduced with #29401, i.e. affects code that was never released?

private void closeOnTragicEvent(final Exception ex) {
// we can not hold a read lock here because closing will attempt to obtain a write lock and that would result in self-deadlock
assert readLock.isHeldByCurrentThread() == false : Thread.currentThread().getName();
Objects.requireNonNull(ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we never expect a null value to be passed as parameter to this internal method, I'm not a fan of sprinkling this check here. This is defensive programming at its worst.

@jasontedor jasontedor merged commit 00fd73a into elastic:master Apr 15, 2018
jasontedor added a commit that referenced this pull request Apr 15, 2018
Today when reading an operation from the current generation fails
tragically we attempt to close the translog. However, by invoking close
before releasing the read lock we end up in self-deadlock because
closing tries to acquire the write lock and the read lock can not be
upgraded to a write lock. To avoid this, we move the close invocation
outside of the try-with-resources that acquired the read lock. As an
extra guard against this, we document the problem and add an assertion
that we are not trying to invoke close while holding the read lock.
@jasontedor jasontedor deleted the translog-self-deadlock branch April 16, 2018 12:16
jasontedor added a commit that referenced this pull request Apr 16, 2018
* 6.x:
  [TEST] REST client request without leading '/' (#29471)
  Prevent accidental changes of default values (#29528)
  [Docs] Add definitions to glossary  (#29127)
  Avoid self-deadlock in the translog (#29520)
  Minor cleanup in NodeInfo.groovy
  Lazy configure build tasks that require older JDKs (#29519)
  Fix usage of NodeInfo#nodeVersion in build
  Control max size/count of warning headers (#28427) (#29516)
  Simplify snapshot check in root build file
  Make NodeInfo#nodeVersion strongly-typed as Version (#29515)
  Enable license header exclusions (#29379)
  Use proper Java version for BWC builds (#29493)
  Mute TranslogTests#testFatalIOExceptionsWhileWritingConcurrently
jasontedor added a commit that referenced this pull request Apr 16, 2018
* master:
  [TEST] REST client request without leading '/' (#29471)
  Using ObjectParser in UpdateRequest (#29293)
  Prevent accidental changes of default values (#29528)
  [Docs] Add definitions to glossary  (#29127)
  Avoid self-deadlock in the translog (#29520)
  Minor cleanup in NodeInfo.groovy
  Lazy configure build tasks that require older JDKs (#29519)
  Simplify snapshot check in root build file
  Make NodeInfo#nodeVersion strongly-typed as Version (#29515)
  Enable license header exclusions (#29379)
  Use proper Java version for BWC builds (#29493)
  Mute TranslogTests#testFatalIOExceptionsWhileWritingConcurrently
  Enable skipping fetching latest for BWC builds (#29497)
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Apr 17, 2018
* master: (96 commits)
  TEST: Mute testEnsureWeReconnect
  Mute full cluster restart test recovery test
  REST high-level client: add support for Indices Update Settings API [take 2] (elastic#29327)
  Plugins: Fix native controller confirmation for non-meta plugin (elastic#29434)
  Remove PipelineExecutionService#executeIndexRequest (elastic#29537)
  Fix overflow error in parsing of long geohashes (elastic#29418)
  Remove unused index.ttl.disable_purge setting (elastic#29527)
  FullClusterRestartIT.testRecovery should wait for all initializing shards
  Build: Fail if any libs depend on non-core libs (elastic#29336)
  [TEST] REST client request without leading '/' (elastic#29471)
  Using ObjectParser in UpdateRequest (elastic#29293)
  Prevent accidental changes of default values (elastic#29528)
  [Docs] Add definitions to glossary  (elastic#29127)
  Avoid self-deadlock in the translog (elastic#29520)
  Minor cleanup in NodeInfo.groovy
  Lazy configure build tasks that require older JDKs (elastic#29519)
  Simplify snapshot check in root build file
  Make NodeInfo#nodeVersion strongly-typed as Version (elastic#29515)
  Enable license header exclusions (elastic#29379)
  Use proper Java version for BWC builds (elastic#29493)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v6.3.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] TranslogTests#testFatalIOExceptionsWhileWritingConcurrently times out
5 participants