Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close translog writer if exception on write channel #29401

Merged

Conversation

jasontedor
Copy link
Member

@jasontedor jasontedor commented Apr 5, 2018

Today we close the translog write tragically if we experience any I/O exception on a write. These tragic closes lead to use closing the translog and failing the engine. Yet, there is one case that is missed which is when we touch the write channel during a read (checking if reading from the writer would put us past what has been flushed). This commit addresses this by closing the writer tragically if we encounter an I/O exception on the write channel while reading. This becomes interesting when we consider that this method is invoked from the engine through the translog as part of getting a document from the translog. This means we have to consider closing the translog here as well which will cascade up into us finally failing the engine.

Note that there is no semantic change to, for example, primary/replica resync and recovery. These actions will take a snapshot of the translog which syncs the translog to disk. If an I/O exception occurs during the sync we already close the writer tragically and once we have synced we do not ever read past the position that was synced while taking the snapshot.

Closes #29390

Today we close the translog write tragically if we experience any I/O
exception on a write. These tragic closes lead to use closing the
translog and failing the engine. Yet, there is one case that is missed
which is when we touch the write channel during a read (checking if
reading from the writer would put us past what has been flushed). This
commit addresses this by closing the writer tragically if we encounter
an I/O exception on the write channel while reading. This becomes
interesting when we consider that this method is invoked from the engine
through the translog as part of getting a document from the
translog. This means we have to consider closing the translog here as
well which will cascade up into us finally failing the engine.

Note that there is no semantic change to, for example, primary/replica
resync and recovery. These actions will take a snapshot of the translog
which syncs the translog to disk. If an I/O exception occurs during the
sync we already close the writer tragically and once we have synced we
do not ever read past the position that was synced while taking the
snapshot.
@jasontedor jasontedor added review v7.0.0 v6.3.0 :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Apr 5, 2018
@jasontedor jasontedor requested review from s1monw, ywelsch and dnhatn April 5, 2018 17:53
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@jasontedor
Copy link
Member Author

This addresses the test failure in #29390 exactly because the test is asserting that the translog is closed after a tragic event occurred on the writer but because of the missed handling of an I/O exception on the write channel in the read method, the translog will not be closed by the random readOperation that was added to TranslogThread#run.

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left one ask (and one optional ask)

return current.read(location);
} catch (final IOException e) {
closeOnTragicEvent(e);
throw e;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other calls to closeOnTragicEvent are all of the form

try {
    closeOnTragicEvent(ex);
} catch (final Exception inner) {
    ex.addSuppressed(inner);
}
throw ex;

which does not make any sense btw when you look at closeOnTragicEvent which already takes care of exceptions that happen during closing. Can you unify the code in this class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I would like to do this in an immediate follow-up to this one. For now I pushed: f9a88eb

}
}
} catch (final IOException e) {
closeWithTragicEvent(e);
throw e;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I think you should adopt the pattern we use throughout the rest of the class, i.e.,

try {
    closeWithTragicEvent(e);
} catch (Exception inner) {
    ex.addSuppressed(inner);
}
throw e;

so that we get the original cause.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah! Thrown off by the fact that the method this is in declares throws IOException. I will address.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 8ccaaf3.

@jasontedor
Copy link
Member Author

@ywelsch I pushed. I want to refactor the exception handling for TranslogWriter#closeWithTragicEvent and Translog#closeOnTragicEvent immediately after this PR. Can you take another look?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jasontedor jasontedor merged commit cb3295b into elastic:master Apr 6, 2018
@jasontedor jasontedor deleted the translog-writer-tragic-event-on-read branch April 6, 2018 14:37
jasontedor added a commit that referenced this pull request Apr 6, 2018
Today we close the translog write tragically if we experience any I/O
exception on a write. These tragic closes lead to use closing the
translog and failing the engine. Yet, there is one case that is missed
which is when we touch the write channel during a read (checking if
reading from the writer would put us past what has been flushed). This
commit addresses this by closing the writer tragically if we encounter
an I/O exception on the write channel while reading. This becomes
interesting when we consider that this method is invoked from the engine
through the translog as part of getting a document from the
translog. This means we have to consider closing the translog here as
well which will cascade up into us finally failing the engine.

Note that there is no semantic change to, for example, primary/replica
resync and recovery. These actions will take a snapshot of the translog
which syncs the translog to disk. If an I/O exception occurs during the
sync we already close the writer tragically and once we have synced we
do not ever read past the position that was synced while taking the
snapshot.
@jpountz jpountz added the >bug label Jun 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v6.3.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] TranslogTests#testFatalIOExceptionsWhileWritingConcurrently failure
6 participants