Enhance BufferedChecksumIndexInput error when a state file is empty #29358

gmoskovicz · 2018-04-03T19:58:30Z

Elasticsearch version (bin/elasticsearch --version): Any version

Plugins installed: [none]

JVM version (java -version): 1.8

OS version (uname -a if on a Unix-like system): Any

Provide logs (if relevant):

If a state file is empty after some issue (hardware issue, NFS issue and others) a node will fail to start and we will see something like the following in the logs:

[2018-04-03T18:12:25,227][ERROR][o.e.g.GatewayMetaState   ] [xxxx] failed to read local state, exiting...
org.elasticsearch.ElasticsearchException: java.io.IOException: failed to read [id:20, legacy:false, file:/xxx/elasticsearch/xxx/xxx/nodes/0/indices/indexname/_state/state-20.st]
	at org.elasticsearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:150) ~[elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:334) ~[elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.util.IndexFolderUpgrader.upgrade(IndexFolderUpgrader.java:90) ~[elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.util.IndexFolderUpgrader.upgradeIndicesIfNeeded(IndexFolderUpgrader.java:128) ~[elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:87) [elasticsearch-5.4.3.jar:5.4.3]
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:?]
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) [?:1.8.0_161]
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [?:1.8.0_161]
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423) [?:1.8.0_161]
	at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:49) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:116) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:47) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:825) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:43) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:59) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:50) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:191) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:183) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:818) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:183) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:176) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:161) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:96) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:96) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:70) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.ModulesBuilder.createInjector(ModulesBuilder.java:43) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.node.Node.<init>(Node.java:491) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.node.Node.<init>(Node.java:242) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:232) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:232) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:350) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:123) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:114) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:67) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:122) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.cli.Command.main(Command.java:88) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84) [elasticsearch-5.4.3.jar:5.4.3]
Caused by: java.io.IOException: failed to read [id:20, legacy:false, file:/xxx/elasticsearch/xxx/xxx/nodes/0/indices/indexname/_state/state-20.st]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:327) ~[elasticsearch-5.4.3.jar:5.4.3]
	... 37 more
Caused by: java.lang.IllegalStateException: class org.apache.lucene.store.BufferedChecksumIndexInput cannot seek backwards (pos=-16 getFilePointer()=0)
	at org.apache.lucene.store.ChecksumIndexInput.seek(ChecksumIndexInput.java:50) ~[lucene-core-6.5.1.jar:6.5.1 cd1f23c63abe03ae650c75ec8ccb37762806cc75 - jimczi - 2017-04-21 12:17:15]
	at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:519) ~[lucene-core-6.5.1.jar:6.5.1 cd1f23c63abe03ae650c75ec8ccb37762806cc75 - jimczi - 2017-04-21 12:17:15]
	at org.elasticsearch.gateway.MetaDataStateFormat.read(MetaDataStateFormat.java:189) ~[elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:322) ~[elasticsearch-5.4.3.jar:5.4.3]
	... 37 more

Discussing it with @jpountz , (pos=-16 getFilePointer()=0) seems to indicate that the file is empty, but so far we return a IOException without explaining that a state file is empty.

So my question here is:

Does it make sense to add some extra content to the log to explicitly mention that the file is empty?
Perhaps we can mention in the logs that when this happens the shards get corrupted and you need to recreate it (or remove it)?
Do we see any other option than [2]? For example starting the node and have an option to disregard that shard (possibly delete it?- yes doesn't sounds like a very good idea, but worth discussing it).

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-04-04T06:14:26Z

Pinging @elastic/es-core-infra

jpountz · 2018-04-04T08:15:36Z

I wonder that this might have improved thanks to LUCENE-7831, which is in Elasticsearch 5.6+ and 6.0+. This issue improved the error message (it would contain misplaced codec footer (file truncated?): length=0 but footerLength==16 now) but more importantly, made it a CorruptIndexException rather than an IllegalStateException. I believe this would make Elasticsearch fail the shard, wouldn't it?

ywelsch · 2018-04-04T11:22:53Z

@jpountz This is on startup where we read the index metadata (to possibly upgrade the index metadata). While a CorruptIndexException will better convey the issue, the overall behavior will still be the same (the node failing to start up). This is a good thing IMO, as it makes the user aware of file-system corruption. Ignoring this (while logging a warning) would be dangerous, as it could mean that the index just magically disappears from the cluster after a restart.

Does it make sense to add some extra content to the log to explicitly mention that the file is empty?

How will this help? Are we taking different actions based on whether the file is empty or corrupted otherwise?

Perhaps we can mention in the logs that when this happens the shards get corrupted and you need to recreate it (or remove it)?

In the above message it was not the shard that got corrupted, but the index metadata. If that index metadata is not available on any other node, the whole index will effectively be gone. I'm not sure what kind of generic recommendation we could give there. It's probably more on a case-by-case basis (restore from snapshot, reindex from primary source, ...).

Do we see any other option than [2]? For example starting the node and have an option to disregard that shard (possibly delete it?- yes doesn't sounds like a very good idea, but worth discussing it).

File system corruptions are serious, and the system should not try to automatically compensate for them, as this will hurt users even more further down the line.

jpountz · 2018-04-04T11:53:50Z

@ywelsch Thanks for clarifying! I see we are already doing the right thing at write time by writing to a temp file before renaming, so if I read you correctly there is nothing more to do since the error message was already improved via LUCENE-7831?

ywelsch · 2018-04-04T11:58:01Z

@jpountz yes, maybe worth adding a test case in ES?

jpountz · 2018-04-04T12:39:16Z

OK, making it a test adoptme.

jasontedor · 2018-04-04T12:51:04Z

@ywelsch Sorry, what is the test? That the exception message is that which comes from upstream? And the test breaks if upstream changes the message and we merely change the assertion? I am not following what the proposal is here.

ywelsch · 2018-04-04T13:02:36Z

@jasontedor That MetaDataStateFormat.read() throws a CorruptStateException if the file is truncated in any way (in particular as well when it is 0 bytes). We test this already in MetaDataStateFormatTests.testLoadState but the method that does the corruption (corruptFile) does not seem to generate a 0 length file as corruption possibility.

DaveCTurner · 2018-10-24T22:28:12Z

FAO @andrershov

elasticmachine · 2020-12-14T20:58:23Z

Pinging @elastic/es-distributed (Team:Distributed)

The metadata index is small and important and only read at startup. Today we rely on Lucene to spot if any of its components is corrupt, but Lucene does not necesssarily verify all checksums in order to catch a corruption. With this commit we run `CheckIndex` on the metadata index first, and fail on startup if a corruption is detected. Closes elastic#29358

DaveCTurner · 2021-05-19T13:37:29Z

We now keep metadata on disk in a Lucene index rather than in files whose checksum we verify ourselves. This mostly gives us more useful error messages on corruptions than the one in the OP, but I believe it isn't completely watertight since Lucene doesn't guarantee to verify every checksum since that would be desperately inefficient on large indices. However it should be fine on small things like the metadata index so I opened #73239 to explicitly check the checksums on this index first, which I think clears up this issue. At least, we now just expose the messages that Lucene reports so any remaining lack of clarity would need to be addressed in Lucene rather than in Elasticsearch.

The metadata index is small and important and only read at startup. Today we rely on Lucene to spot if any of its components is corrupt, but Lucene does not necesssarily verify all checksums in order to catch a corruption. With this commit we run `CheckIndex` on the metadata index first, and fail on startup if a corruption is detected. Closes #29358

The metadata index is small and important and only read at startup. Today we rely on Lucene to spot if any of its components is corrupt, but Lucene does not necesssarily verify all checksums in order to catch a corruption. With this commit we run `CheckIndex` on the metadata index first, and fail on startup if a corruption is detected. Closes elastic#29358

The metadata index is small and important and only read at startup. Today we rely on Lucene to spot if any of its components is corrupt, but Lucene does not necesssarily verify all checksums in order to catch a corruption. With this commit we run `CheckIndex` on the metadata index first, and fail on startup if a corruption is detected. Closes #29358

danielmitterdorfer added >enhancement :Core/Infra/Core Core issues without another label labels Apr 4, 2018

jpountz added >test Issues or PRs that are addressing/adding tests help wanted adoptme and removed >enhancement labels Apr 4, 2018

rjernst added the Team:Core/Infra Meta label for core/infra team label May 4, 2020

rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020

jimczi removed the needs:triage Requires assignment of a team area label label Jan 12, 2021

DaveCTurner mentioned this issue May 19, 2021

Run CheckIndex on metadata index before loading #73239

Merged

DaveCTurner closed this as completed in #73239 Jun 16, 2021

DaveCTurner mentioned this issue Jun 16, 2021

Run CheckIndex on metadata index before loading (#73239) #74173

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance BufferedChecksumIndexInput error when a state file is empty #29358

Enhance BufferedChecksumIndexInput error when a state file is empty #29358

gmoskovicz commented Apr 3, 2018

elasticmachine commented Apr 4, 2018

jpountz commented Apr 4, 2018 •

edited

Loading

ywelsch commented Apr 4, 2018

jpountz commented Apr 4, 2018

ywelsch commented Apr 4, 2018

jpountz commented Apr 4, 2018

jasontedor commented Apr 4, 2018

ywelsch commented Apr 4, 2018

DaveCTurner commented Oct 24, 2018

elasticmachine commented Dec 14, 2020

DaveCTurner commented May 19, 2021

Enhance BufferedChecksumIndexInput error when a state file is empty #29358

Enhance BufferedChecksumIndexInput error when a state file is empty #29358

Comments

gmoskovicz commented Apr 3, 2018

elasticmachine commented Apr 4, 2018

jpountz commented Apr 4, 2018 • edited Loading

ywelsch commented Apr 4, 2018

jpountz commented Apr 4, 2018

ywelsch commented Apr 4, 2018

jpountz commented Apr 4, 2018

jasontedor commented Apr 4, 2018

ywelsch commented Apr 4, 2018

DaveCTurner commented Oct 24, 2018

elasticmachine commented Dec 14, 2020

DaveCTurner commented May 19, 2021

jpountz commented Apr 4, 2018 •

edited

Loading