Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance BufferedChecksumIndexInput error when a state file is empty #29358

Closed
gmoskovicz opened this issue Apr 3, 2018 · 11 comments · Fixed by #73239
Closed

Enhance BufferedChecksumIndexInput error when a state file is empty #29358

gmoskovicz opened this issue Apr 3, 2018 · 11 comments · Fixed by #73239
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. help wanted adoptme Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test Issues or PRs that are addressing/adding tests

Comments

@gmoskovicz
Copy link
Contributor

Elasticsearch version (bin/elasticsearch --version): Any version

Plugins installed: [none]

JVM version (java -version): 1.8

OS version (uname -a if on a Unix-like system): Any

Provide logs (if relevant):

If a state file is empty after some issue (hardware issue, NFS issue and others) a node will fail to start and we will see something like the following in the logs:

[2018-04-03T18:12:25,227][ERROR][o.e.g.GatewayMetaState   ] [xxxx] failed to read local state, exiting...
org.elasticsearch.ElasticsearchException: java.io.IOException: failed to read [id:20, legacy:false, file:/xxx/elasticsearch/xxx/xxx/nodes/0/indices/indexname/_state/state-20.st]
	at org.elasticsearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:150) ~[elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:334) ~[elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.util.IndexFolderUpgrader.upgrade(IndexFolderUpgrader.java:90) ~[elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.util.IndexFolderUpgrader.upgradeIndicesIfNeeded(IndexFolderUpgrader.java:128) ~[elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:87) [elasticsearch-5.4.3.jar:5.4.3]
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:?]
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) [?:1.8.0_161]
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) [?:1.8.0_161]
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423) [?:1.8.0_161]
	at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:49) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:116) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:47) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:825) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:43) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:59) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:50) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:191) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:183) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:818) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:183) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:176) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:161) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:96) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:96) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:70) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.common.inject.ModulesBuilder.createInjector(ModulesBuilder.java:43) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.node.Node.<init>(Node.java:491) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.node.Node.<init>(Node.java:242) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:232) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:232) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:350) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:123) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:114) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:67) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:122) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.cli.Command.main(Command.java:88) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91) [elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84) [elasticsearch-5.4.3.jar:5.4.3]
Caused by: java.io.IOException: failed to read [id:20, legacy:false, file:/xxx/elasticsearch/xxx/xxx/nodes/0/indices/indexname/_state/state-20.st]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:327) ~[elasticsearch-5.4.3.jar:5.4.3]
	... 37 more
Caused by: java.lang.IllegalStateException: class org.apache.lucene.store.BufferedChecksumIndexInput cannot seek backwards (pos=-16 getFilePointer()=0)
	at org.apache.lucene.store.ChecksumIndexInput.seek(ChecksumIndexInput.java:50) ~[lucene-core-6.5.1.jar:6.5.1 cd1f23c63abe03ae650c75ec8ccb37762806cc75 - jimczi - 2017-04-21 12:17:15]
	at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:519) ~[lucene-core-6.5.1.jar:6.5.1 cd1f23c63abe03ae650c75ec8ccb37762806cc75 - jimczi - 2017-04-21 12:17:15]
	at org.elasticsearch.gateway.MetaDataStateFormat.read(MetaDataStateFormat.java:189) ~[elasticsearch-5.4.3.jar:5.4.3]
	at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:322) ~[elasticsearch-5.4.3.jar:5.4.3]
	... 37 more

Discussing it with @jpountz , (pos=-16 getFilePointer()=0) seems to indicate that the file is empty, but so far we return a IOException without explaining that a state file is empty.

So my question here is:

  1. Does it make sense to add some extra content to the log to explicitly mention that the file is empty?
  2. Perhaps we can mention in the logs that when this happens the shards get corrupted and you need to recreate it (or remove it)?
  3. Do we see any other option than [2]? For example starting the node and have an option to disregard that shard (possibly delete it?- yes doesn't sounds like a very good idea, but worth discussing it).
@danielmitterdorfer danielmitterdorfer added >enhancement :Core/Infra/Core Core issues without another label labels Apr 4, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@jpountz
Copy link
Contributor

jpountz commented Apr 4, 2018

I wonder that this might have improved thanks to LUCENE-7831, which is in Elasticsearch 5.6+ and 6.0+. This issue improved the error message (it would contain misplaced codec footer (file truncated?): length=0 but footerLength==16 now) but more importantly, made it a CorruptIndexException rather than an IllegalStateException. I believe this would make Elasticsearch fail the shard, wouldn't it?

@ywelsch
Copy link
Contributor

ywelsch commented Apr 4, 2018

@jpountz This is on startup where we read the index metadata (to possibly upgrade the index metadata). While a CorruptIndexException will better convey the issue, the overall behavior will still be the same (the node failing to start up). This is a good thing IMO, as it makes the user aware of file-system corruption. Ignoring this (while logging a warning) would be dangerous, as it could mean that the index just magically disappears from the cluster after a restart.

Does it make sense to add some extra content to the log to explicitly mention that the file is empty?

How will this help? Are we taking different actions based on whether the file is empty or corrupted otherwise?

Perhaps we can mention in the logs that when this happens the shards get corrupted and you need to recreate it (or remove it)?

In the above message it was not the shard that got corrupted, but the index metadata. If that index metadata is not available on any other node, the whole index will effectively be gone. I'm not sure what kind of generic recommendation we could give there. It's probably more on a case-by-case basis (restore from snapshot, reindex from primary source, ...).

Do we see any other option than [2]? For example starting the node and have an option to disregard that shard (possibly delete it?- yes doesn't sounds like a very good idea, but worth discussing it).

File system corruptions are serious, and the system should not try to automatically compensate for them, as this will hurt users even more further down the line.

@jpountz
Copy link
Contributor

jpountz commented Apr 4, 2018

@ywelsch Thanks for clarifying! I see we are already doing the right thing at write time by writing to a temp file before renaming, so if I read you correctly there is nothing more to do since the error message was already improved via LUCENE-7831?

@ywelsch
Copy link
Contributor

ywelsch commented Apr 4, 2018

@jpountz yes, maybe worth adding a test case in ES?

@jpountz
Copy link
Contributor

jpountz commented Apr 4, 2018

OK, making it a test adoptme.

@jpountz jpountz added >test Issues or PRs that are addressing/adding tests help wanted adoptme and removed >enhancement labels Apr 4, 2018
@jasontedor
Copy link
Member

@ywelsch Sorry, what is the test? That the exception message is that which comes from upstream? And the test breaks if upstream changes the message and we merely change the assertion? I am not following what the proposal is here.

@ywelsch
Copy link
Contributor

ywelsch commented Apr 4, 2018

@jasontedor That MetaDataStateFormat.read() throws a CorruptStateException if the file is truncated in any way (in particular as well when it is 0 bytes). We test this already in MetaDataStateFormatTests.testLoadState but the method that does the corruption (corruptFile) does not seem to generate a 0 length file as corruption possibility.

@DaveCTurner
Copy link
Contributor

FAO @andrershov

@rjernst rjernst added the Team:Core/Infra Meta label for core/infra team label May 4, 2020
@rjernst rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020
@jaymode jaymode added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. and removed :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team labels Dec 14, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@jimczi jimczi removed the needs:triage Requires assignment of a team area label label Jan 12, 2021
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 19, 2021
The metadata index is small and important and only read at startup.
Today we rely on Lucene to spot if any of its components is corrupt, but
Lucene does not necesssarily verify all checksums in order to catch a
corruption. With this commit we run `CheckIndex` on the metadata index
first, and fail on startup if a corruption is detected.

Closes elastic#29358
@DaveCTurner
Copy link
Contributor

We now keep metadata on disk in a Lucene index rather than in files whose checksum we verify ourselves. This mostly gives us more useful error messages on corruptions than the one in the OP, but I believe it isn't completely watertight since Lucene doesn't guarantee to verify every checksum since that would be desperately inefficient on large indices. However it should be fine on small things like the metadata index so I opened #73239 to explicitly check the checksums on this index first, which I think clears up this issue. At least, we now just expose the messages that Lucene reports so any remaining lack of clarity would need to be addressed in Lucene rather than in Elasticsearch.

DaveCTurner added a commit that referenced this issue Jun 16, 2021
The metadata index is small and important and only read at startup.
Today we rely on Lucene to spot if any of its components is corrupt, but
Lucene does not necesssarily verify all checksums in order to catch a
corruption. With this commit we run `CheckIndex` on the metadata index
first, and fail on startup if a corruption is detected.

Closes #29358
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 16, 2021
The metadata index is small and important and only read at startup.
Today we rely on Lucene to spot if any of its components is corrupt, but
Lucene does not necesssarily verify all checksums in order to catch a
corruption. With this commit we run `CheckIndex` on the metadata index
first, and fail on startup if a corruption is detected.

Closes elastic#29358
DaveCTurner added a commit that referenced this issue Jun 16, 2021
The metadata index is small and important and only read at startup.
Today we rely on Lucene to spot if any of its components is corrupt, but
Lucene does not necesssarily verify all checksums in order to catch a
corruption. With this commit we run `CheckIndex` on the metadata index
first, and fail on startup if a corruption is detected.

Closes #29358
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. help wanted adoptme Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test Issues or PRs that are addressing/adding tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants