-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Close translog writer if exception on write channel #29401
Close translog writer if exception on write channel #29401
Conversation
Today we close the translog write tragically if we experience any I/O exception on a write. These tragic closes lead to use closing the translog and failing the engine. Yet, there is one case that is missed which is when we touch the write channel during a read (checking if reading from the writer would put us past what has been flushed). This commit addresses this by closing the writer tragically if we encounter an I/O exception on the write channel while reading. This becomes interesting when we consider that this method is invoked from the engine through the translog as part of getting a document from the translog. This means we have to consider closing the translog here as well which will cascade up into us finally failing the engine. Note that there is no semantic change to, for example, primary/replica resync and recovery. These actions will take a snapshot of the translog which syncs the translog to disk. If an I/O exception occurs during the sync we already close the writer tragically and once we have synced we do not ever read past the position that was synced while taking the snapshot.
Pinging @elastic/es-distributed |
This addresses the test failure in #29390 exactly because the test is asserting that the translog is closed after a tragic event occurred on the writer but because of the missed handling of an I/O exception on the write channel in the read method, the translog will not be closed by the random readOperation that was added to TranslogThread#run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left one ask (and one optional ask)
return current.read(location); | ||
} catch (final IOException e) { | ||
closeOnTragicEvent(e); | ||
throw e; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other calls to closeOnTragicEvent
are all of the form
try {
closeOnTragicEvent(ex);
} catch (final Exception inner) {
ex.addSuppressed(inner);
}
throw ex;
which does not make any sense btw when you look at closeOnTragicEvent
which already takes care of exceptions that happen during closing. Can you unify the code in this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I would like to do this in an immediate follow-up to this one. For now I pushed: f9a88eb
} | ||
} | ||
} catch (final IOException e) { | ||
closeWithTragicEvent(e); | ||
throw e; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I think you should adopt the pattern we use throughout the rest of the class, i.e.,
try {
closeWithTragicEvent(e);
} catch (Exception inner) {
ex.addSuppressed(inner);
}
throw e;
so that we get the original cause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bah! Thrown off by the fact that the method this is in declares throws IOException
. I will address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 8ccaaf3.
@ywelsch I pushed. I want to refactor the exception handling for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Today we close the translog write tragically if we experience any I/O exception on a write. These tragic closes lead to use closing the translog and failing the engine. Yet, there is one case that is missed which is when we touch the write channel during a read (checking if reading from the writer would put us past what has been flushed). This commit addresses this by closing the writer tragically if we encounter an I/O exception on the write channel while reading. This becomes interesting when we consider that this method is invoked from the engine through the translog as part of getting a document from the translog. This means we have to consider closing the translog here as well which will cascade up into us finally failing the engine. Note that there is no semantic change to, for example, primary/replica resync and recovery. These actions will take a snapshot of the translog which syncs the translog to disk. If an I/O exception occurs during the sync we already close the writer tragically and once we have synced we do not ever read past the position that was synced while taking the snapshot.
Today we close the translog write tragically if we experience any I/O exception on a write. These tragic closes lead to use closing the translog and failing the engine. Yet, there is one case that is missed which is when we touch the write channel during a read (checking if reading from the writer would put us past what has been flushed). This commit addresses this by closing the writer tragically if we encounter an I/O exception on the write channel while reading. This becomes interesting when we consider that this method is invoked from the engine through the translog as part of getting a document from the translog. This means we have to consider closing the translog here as well which will cascade up into us finally failing the engine.
Note that there is no semantic change to, for example, primary/replica resync and recovery. These actions will take a snapshot of the translog which syncs the translog to disk. If an I/O exception occurs during the sync we already close the writer tragically and once we have synced we do not ever read past the position that was synced while taking the snapshot.
Closes #29390