Close translog writer if exception on write channel #29401

jasontedor · 2018-04-05T17:53:42Z

Today we close the translog write tragically if we experience any I/O exception on a write. These tragic closes lead to use closing the translog and failing the engine. Yet, there is one case that is missed which is when we touch the write channel during a read (checking if reading from the writer would put us past what has been flushed). This commit addresses this by closing the writer tragically if we encounter an I/O exception on the write channel while reading. This becomes interesting when we consider that this method is invoked from the engine through the translog as part of getting a document from the translog. This means we have to consider closing the translog here as well which will cascade up into us finally failing the engine.

Note that there is no semantic change to, for example, primary/replica resync and recovery. These actions will take a snapshot of the translog which syncs the translog to disk. If an I/O exception occurs during the sync we already close the writer tragically and once we have synced we do not ever read past the position that was synced while taking the snapshot.

Closes #29390

Today we close the translog write tragically if we experience any I/O exception on a write. These tragic closes lead to use closing the translog and failing the engine. Yet, there is one case that is missed which is when we touch the write channel during a read (checking if reading from the writer would put us past what has been flushed). This commit addresses this by closing the writer tragically if we encounter an I/O exception on the write channel while reading. This becomes interesting when we consider that this method is invoked from the engine through the translog as part of getting a document from the translog. This means we have to consider closing the translog here as well which will cascade up into us finally failing the engine. Note that there is no semantic change to, for example, primary/replica resync and recovery. These actions will take a snapshot of the translog which syncs the translog to disk. If an I/O exception occurs during the sync we already close the writer tragically and once we have synced we do not ever read past the position that was synced while taking the snapshot.

elasticmachine · 2018-04-05T17:53:44Z

Pinging @elastic/es-distributed

jasontedor · 2018-04-05T17:55:03Z

This addresses the test failure in #29390 exactly because the test is asserting that the translog is closed after a tragic event occurred on the writer but because of the missed handling of an I/O exception on the write channel in the read method, the translog will not be closed by the random readOperation that was added to TranslogThread#run.

dnhatn

LGTM.

ywelsch

I've left one ask (and one optional ask)

ywelsch · 2018-04-06T10:59:01Z

server/src/main/java/org/elasticsearch/index/translog/Translog.java

+                    return current.read(location);
+                } catch (final IOException e) {
+                    closeOnTragicEvent(e);
+                    throw e;


The other calls to closeOnTragicEvent are all of the form

try { closeOnTragicEvent(ex); } catch (final Exception inner) { ex.addSuppressed(inner); } throw ex;

which does not make any sense btw when you look at closeOnTragicEvent which already takes care of exceptions that happen during closing. Can you unify the code in this class?

Okay, I would like to do this in an immediate follow-up to this one. For now I pushed: f9a88eb

ywelsch · 2018-04-06T11:01:54Z

server/src/main/java/org/elasticsearch/index/translog/TranslogWriter.java

                }
            }
+        } catch (final IOException e) {
+            closeWithTragicEvent(e);
+            throw e;


Here, I think you should adopt the pattern we use throughout the rest of the class, i.e.,

try { closeWithTragicEvent(e); } catch (Exception inner) { ex.addSuppressed(inner); } throw e;

so that we get the original cause.

Bah! Thrown off by the fact that the method this is in declares throws IOException. I will address.

I pushed 8ccaaf3.

jasontedor · 2018-04-06T12:50:00Z

@ywelsch I pushed. I want to refactor the exception handling for TranslogWriter#closeWithTragicEvent and Translog#closeOnTragicEvent immediately after this PR. Can you take another look?

ywelsch

LGTM

Today we close the translog write tragically if we experience any I/O exception on a write. These tragic closes lead to use closing the translog and failing the engine. Yet, there is one case that is missed which is when we touch the write channel during a read (checking if reading from the writer would put us past what has been flushed). This commit addresses this by closing the writer tragically if we encounter an I/O exception on the write channel while reading. This becomes interesting when we consider that this method is invoked from the engine through the translog as part of getting a document from the translog. This means we have to consider closing the translog here as well which will cascade up into us finally failing the engine. Note that there is no semantic change to, for example, primary/replica resync and recovery. These actions will take a snapshot of the translog which syncs the translog to disk. If an I/O exception occurs during the sync we already close the writer tragically and once we have synced we do not ever read past the position that was synced while taking the snapshot.

jasontedor added review v7.0.0 v6.3.0 :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Apr 5, 2018

jasontedor requested review from s1monw, ywelsch and dnhatn April 5, 2018 17:53

dnhatn approved these changes Apr 5, 2018

View reviewed changes

jasontedor mentioned this pull request Apr 5, 2018

[CI] TranslogTests#testFatalIOExceptionsWhileWritingConcurrently failure #29390

Closed

ywelsch suggested changes Apr 6, 2018

View reviewed changes

jasontedor added 2 commits April 6, 2018 08:36

Consistency

f9a88eb

Handle exception on close

8ccaaf3

ywelsch approved these changes Apr 6, 2018

View reviewed changes

jasontedor merged commit cb3295b into elastic:master Apr 6, 2018

jasontedor deleted the translog-writer-tragic-event-on-read branch April 6, 2018 14:37

This was referenced Apr 6, 2018

Simplify TranslogWriter#closeWithTragicEvent #29412

Merged

Simplify Translog#closeOnTragicEvent #29413

Merged

javanna mentioned this pull request Apr 13, 2018

[CI] TranslogTests#testFatalIOExceptionsWhileWritingConcurrently times out #29509

Closed

jasontedor mentioned this pull request Apr 15, 2018

Avoid self-deadlock in the translog #29520

Merged

jpountz added the >bug label Jun 13, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close translog writer if exception on write channel #29401

Close translog writer if exception on write channel #29401

jasontedor commented Apr 5, 2018 •

edited

Loading

elasticmachine commented Apr 5, 2018

jasontedor commented Apr 5, 2018

dnhatn left a comment

ywelsch left a comment

ywelsch Apr 6, 2018

jasontedor Apr 6, 2018

ywelsch Apr 6, 2018

jasontedor Apr 6, 2018

jasontedor Apr 6, 2018

jasontedor commented Apr 6, 2018

ywelsch left a comment

Close translog writer if exception on write channel #29401

Close translog writer if exception on write channel #29401

Conversation

jasontedor commented Apr 5, 2018 • edited Loading

elasticmachine commented Apr 5, 2018

jasontedor commented Apr 5, 2018

dnhatn left a comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Apr 6, 2018

Choose a reason for hiding this comment

jasontedor Apr 6, 2018

Choose a reason for hiding this comment

ywelsch Apr 6, 2018

Choose a reason for hiding this comment

jasontedor Apr 6, 2018

Choose a reason for hiding this comment

jasontedor Apr 6, 2018

Choose a reason for hiding this comment

jasontedor commented Apr 6, 2018

ywelsch left a comment

Choose a reason for hiding this comment

jasontedor commented Apr 5, 2018 •

edited

Loading