Run CheckIndex on metadata index before loading #73239

DaveCTurner · 2021-05-19T13:29:42Z

The metadata index is small and important and only read at startup.
Today we rely on Lucene to spot if any of its components is corrupt, but
Lucene does not necesssarily verify all checksums in order to catch a
corruption. With this commit we run CheckIndex on the metadata index
first, and fail on startup if a corruption is detected.

Closes #29358

The metadata index is small and important and only read at startup. Today we rely on Lucene to spot if any of its components is corrupt, but Lucene does not necesssarily verify all checksums in order to catch a corruption. With this commit we run `CheckIndex` on the metadata index first, and fail on startup if a corruption is detected. Closes elastic#29358

elasticmachine · 2021-05-19T13:29:45Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2021-05-19T13:31:20Z

server/src/main/java/org/elasticsearch/gateway/PersistedClusterStateService.java

-                        "] in [" + dataPath + "] but expected [" + nodeId + "]");
+                        if (isClean == false) {
+                            if (logger.isErrorEnabled()) {
+                                outputStream.bytes().utf8ToString().lines().forEach(l -> logger.error("checkIndex: {}", l));


This isn't great: we materialise all the bytes, then convert them to a string, and then split them into lines. A streaming implementation is definitely possible but doesn't seem worth the effort here.

…ding

henningandersen

This looks good. I wonder if we want a tool that reads the state without check index and rewrites it fully? Perhaps an option to the "unsafe bootstrap master" tool?

henningandersen · 2021-06-07T09:10:54Z

server/src/main/java/org/elasticsearch/gateway/PersistedClusterStateService.java

+                            if (logger.isErrorEnabled()) {
+                                outputStream.bytes().utf8ToString().lines().forEach(l -> logger.error("checkIndex: {}", l));
+                            }
+                            throw new IllegalStateException("metadata index at [" + dataPath +


We often refer to this as either cluster state or global state. I wonder if we could call it global state metadata index, just to be sure this message is interpreted correctly by operators?

Cluster state includes the metadata contained in this index but also includes ephemeral things like the routing table; conversely the metadata in this index comprises global metadata and index metadata. "Metadata index" is more correct than either, but I tried a few ideas and settled on the index containing the cluster metadata, see 95378d2.

henningandersen · 2021-06-07T09:11:55Z

server/src/main/java/org/elasticsearch/gateway/PersistedClusterStateService.java

+                    onDiskState = loadOnDiskState(dataPath, directoryReader);
+
+                    if (nodeId.equals(onDiskState.nodeId) == false) {
+                        throw new IllegalStateException("unexpected node ID in metadata, found [" + onDiskState.nodeId +


Same comment on metadata as above, perhaps global state metadata?

You had to be doing something pretty weird to hit this message anyway, and doubly so today since the node IDs that we're comparing both come from the user-data from the latest commit of this index. Harmonised with the other message in 9e89fed.

henningandersen · 2021-06-07T09:12:12Z

server/src/main/java/org/elasticsearch/gateway/PersistedClusterStateService.java

-                        "] in [" + dataPath + "] but expected [" + nodeId + "]");
+                        if (isClean == false) {
+                            if (logger.isErrorEnabled()) {
+                                outputStream.bytes().utf8ToString().lines().forEach(l -> logger.error("checkIndex: {}", l));


…ding

DaveCTurner · 2021-06-08T08:23:47Z

I wonder if we want a tool that reads the state without check index and rewrites it fully?

I have my doubts that we could implement such a thing robustly enough to be useful. I ran the corruption test 2000 times and saw only 44 cases (<2.5%) in which the state was readable after corruption anyway. Of course this test makes a pretty small state so the chances of hitting something vital with a one-byte error are maximised, but even so many of the exceptions seen are failures to parse a document in the index. Conversely in reality a corruption likely hits multiple bytes, perhaps even a whole disk sector at once, so I reckon we'd not be able to read the state in almost every case.

With a larger state there's a better chance that there are multiple segments, so CheckIndex#exorciseIndex might yield something useful, but I still think the probability of success is very low.

We could also perhaps extract the global metadata from an otherwise-broken index with slightly higher probability since we always write it first so it should be near the start of the stored fields of whichever segment it's in, and then we could fall back on dangling indices to recover the index metadata. This all seems overly heroic tho, I'd rather not go down this path.

henningandersen

LGTM.

…ding

The metadata index is small and important and only read at startup. Today we rely on Lucene to spot if any of its components is corrupt, but Lucene does not necesssarily verify all checksums in order to catch a corruption. With this commit we run `CheckIndex` on the metadata index first, and fail on startup if a corruption is detected. Closes elastic#29358

The metadata index is small and important and only read at startup. Today we rely on Lucene to spot if any of its components is corrupt, but Lucene does not necesssarily verify all checksums in order to catch a corruption. With this commit we run `CheckIndex` on the metadata index first, and fail on startup if a corruption is detected. Closes #29358

* master: (284 commits) [DOCS] Update central reporting image (elastic#74195) [DOCS] SQL: Document `null` handing for string functions (elastic#74201) Fix Snapshot Docs Listing Query Params in Body Incorrectly (elastic#74196) [DOCS] EQL: Note EQL uses `fields` parameter (elastic#74194) Mute failing MixedClusterClientYamlTestSuiteIT test {p0=indices.split/20_source_mapping/Split index ignores target template mapping} test (elastic#74198) Cleanup Duplicate Constants in Snapshot XContent Params (elastic#74114) [DOC] Add watcher to the threadpool doc (elastic#73935) [Rest Api Compatibility] Validate Query typed api (elastic#74171) Replace deprecated `script.cache.*` settings with `script.context.$constext.cache_*` in documentation. (elastic#74144) Pin Alpine Linux version in Docker builds (elastic#74169) Fix clone API settings docs bug (elastic#74175) [ML] refactor internal datafeed management (elastic#74018) Disable query cache for FunctionScoreQuery and ScriptScoreQuery (elastic#74060) Fork the sending of file chunks during recovery (elastic#74164) RuntimeField.Builder should not extend FieldMapper.Builder (elastic#73840) Run CheckIndex on metadata index before loading (elastic#73239) Deprecate setting version on analyzers (elastic#74073) Add test with null transform id in stats request (elastic#74130) Order imports when reformatting (elastic#74059) Move deprecation code from xpack core to deprecation module. (elastic#74120) ...

DaveCTurner added >enhancement :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. v8.0.0 v7.14.0 labels May 19, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 19, 2021

DaveCTurner commented May 19, 2021

View reviewed changes

DaveCTurner mentioned this pull request May 19, 2021

Enhance BufferedChecksumIndexInput error when a state file is empty #29358

Closed

DaveCTurner requested a review from henningandersen May 19, 2021 14:29

DaveCTurner added 3 commits May 28, 2021 07:03

Merge branch 'master' into 2021-05-19-check-metadata-index-before-loa…

02b05af

…ding

Mention that the cause is an external force

268fbd5

fix test

e619f05

henningandersen reviewed Jun 7, 2021

View reviewed changes

DaveCTurner added 3 commits June 7, 2021 12:29

Merge branch 'master' into 2021-05-19-check-metadata-index-before-loa…

9fb364c

…ding

Expand 'metadata index'

95378d2

Reword node ID mismatch too

9e89fed

DaveCTurner requested a review from henningandersen June 8, 2021 08:23

henningandersen approved these changes Jun 15, 2021

View reviewed changes

Merge branch 'master' into 2021-05-19-check-metadata-index-before-loa…

e28fdf4

…ding

DaveCTurner merged commit a81fbb9 into elastic:master Jun 16, 2021

DaveCTurner deleted the 2021-05-19-check-metadata-index-before-loading branch June 16, 2021 09:27

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run CheckIndex on metadata index before loading #73239

Run CheckIndex on metadata index before loading #73239

DaveCTurner commented May 19, 2021

elasticmachine commented May 19, 2021

DaveCTurner May 19, 2021

henningandersen Jun 7, 2021

henningandersen left a comment

henningandersen Jun 7, 2021

DaveCTurner Jun 7, 2021

henningandersen Jun 7, 2021

DaveCTurner Jun 7, 2021

henningandersen Jun 7, 2021

DaveCTurner commented Jun 8, 2021

henningandersen left a comment

Run CheckIndex on metadata index before loading #73239

Run CheckIndex on metadata index before loading #73239

Conversation

DaveCTurner commented May 19, 2021

elasticmachine commented May 19, 2021

DaveCTurner May 19, 2021

Choose a reason for hiding this comment

henningandersen Jun 7, 2021

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen Jun 7, 2021

Choose a reason for hiding this comment

DaveCTurner Jun 7, 2021

Choose a reason for hiding this comment

henningandersen Jun 7, 2021

Choose a reason for hiding this comment

DaveCTurner Jun 7, 2021

Choose a reason for hiding this comment

henningandersen Jun 7, 2021

Choose a reason for hiding this comment

DaveCTurner commented Jun 8, 2021

henningandersen left a comment

Choose a reason for hiding this comment