-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail shard if IndexShard#storeStats runs into an IOException #32241
Fail shard if IndexShard#storeStats runs into an IOException #32241
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I left some minor comments.
@@ -917,6 +917,13 @@ public StoreStats storeStats() { | |||
try { | |||
return store.stats(); | |||
} catch (IOException e) { | |||
failShard("Failing shard because of exception during storeState " + e.getMessage(), e); | |||
//TODO should close method be called inside failShard? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing. Currently it ends up being closed by IndicesClusterStateService#failAndRemoveShard
. I agree with you that it's better that the shard will close itself (probably in IndexShard.ShardEventListener#onFailedEngine
) but that's something for another PR. Can you remove the close method call here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bleskes do you mean that in real application (not in unit tests), IndiciesClusterStateService will realize that shard has failed and will close the shard? That's why close call would no longer be needed.
null, new InternalEngineFactory(), () -> {}, EMPTY_EVENT_LISTENER); | ||
|
||
recoverShardFromStore(shard); | ||
boolean corruptIndexException = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: make this final and set it within the relevant if close? that way, you can't change it again by mistake.
exceptionToThrow.set(() -> new IOException("Test IOException")); | ||
} | ||
ElasticsearchException e = expectThrows(ElasticsearchException.class, shard::storeStats); | ||
assertThat(shard.state(), equalTo(IndexShardState.CLOSED)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can register a shard failure event listener and go with that for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bleskes You mean, that I can just check if shard failure event listener has triggered instead of checking shard state? If I remove shard closing from implementation, then yes, actually shard state would be opened. It was in original proposed patch by you, that's why I've included logic to close the shard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bleskes You mean, that I can just check if shard failure event listener has triggered instead of checking shard state?
yes
It was in original proposed patch by you, that's why I've included logic to close the shard.
I told you I wasn't sure of it's quality :)
boolean corruptIndexException = true; | ||
|
||
if (randomBoolean()) { | ||
exceptionToThrow.set(() -> new CorruptIndexException("Test CorruptIndexException", "Test resource")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 for checking this.
* @param globalCheckpointSyncer callback for syncing global checkpoints | ||
* @param indexEventListener index even listener | ||
* @param listeners an optional set of listeners to add to the shard | ||
* @param routing shard routing to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting
final DirectoryService directoryService = new DirectoryService(shardId, indexSettings) { | ||
@Override | ||
public Directory newDirectory() throws IOException { | ||
return newFSDirectory(shardPath.resolveIndex()); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the extra line?
@andrershov could you please add labels to this PR (area label, change type label and version label(s) )? |
|
||
//TODO Should it be invoked when closing the shard? | ||
try { | ||
directory.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bleskes what do you think about explicitly closing the directory here? I've added this line, because w/o it test fails and reports unclosed resources. But is not it a shard that is responsible for directory lifecycle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's closed by the store, which is, in turn, closed in IndexService#closeShard
. That method is called by the failure listener. This part of why I didn't want you to start changing the closing logic as it will have a lot of little side effects (but I agree it's has become a mess over time).
w.r.t to this test - the easiest is to wrap the store you create in a try with resources block, so it will be always closed.
I've fixed formatting and nit things. @bleskes need to get your reply to 3 questions and I'll fix the rest. |
@bleskes Fixed all the issues: 1) Removed close call for shard 2) Replaced CLOSE state check with checking that failure handler has triggered 3) Replaced explicit directory close call to try with resources for store |
Pinging @elastic/es-distributed |
@bleskes should it be included in 7.0 only? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@andrershov I think this is fine to into 6.4 as well |
@@ -104,7 +104,7 @@ public void testRestoreSnapshotWithExistingFiles() throws IOException { | |||
shardRouting, | |||
shard.shardPath(), | |||
shard.indexSettings().getIndexMetaData(), | |||
null, | |||
null, null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This is probably better on a separate line like the rest of the parameters.
@@ -2020,7 +2100,7 @@ public IndexSearcher wrap(IndexSearcher searcher) throws EngineException { | |||
ShardRoutingHelper.initWithSameId(shard.routingEntry(), RecoverySource.StoreRecoverySource.EXISTING_STORE_INSTANCE), | |||
shard.shardPath(), | |||
shard.indexSettings().getIndexMetaData(), | |||
wrapper, | |||
null, wrapper, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This is probably better on a separate line like the rest of the parameters.
@@ -917,6 +917,7 @@ public StoreStats storeStats() { | |||
try { | |||
return store.stats(); | |||
} catch (IOException e) { | |||
failShard("Failing shard because of exception during storeState " + e.getMessage(), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should say storeStats
, not storeState
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing, looking at the implementation of failShard
, we don't need to append e.getMessage()
here.
@@ -356,7 +364,7 @@ protected IndexShard reinitShard(IndexShard current, ShardRouting routing, Index | |||
routing, | |||
current.shardPath(), | |||
current.indexSettings().getIndexMetaData(), | |||
null, | |||
null, null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This is probably better on a separate line like the rest of the parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few nits, and caught one error in the shard failure message.
@jasontedor fixed the issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @andrershov.
@andrershov Please wait for all CI checks to complete before merging. This is a step we take to reduce the likelihood that we break the build for the rest of the team. 😄 |
* 6.x: Security: revert to old way of merging automata (#32254) Fix a test bug in RangeQueryBuilderTests introduced in the field aliases backport. Introduce Application Privileges with support for Kibana RBAC (#32309) Undo a debugging change that snuck in during the field aliases merge. [test] port linux package packaging tests (#31943) Painless: Update More Methods to New Naming Scheme (#32305) Tribe: Add error with secure settings copied to tribe (#32298) Add V_6_3_3 version constant Add ERR to ranking evaluation documentation (#32314) [DOCS] Added link to 6.3.2 RNs [DOCS] Updates 6.3.2 release notes with PRs from ml-cpp repo (#32334) [Kerberos] Add Kerberos authentication support (#32263) [ML] Extract persistent task methods from MlMetadata (#32319) Backport - Add Snapshots Status API to High Level Rest Client (#32295) Make release notes ignore the `>test-failure` label. (#31309) [DOCS] Adds release highlights for search for 6.4 (#32095) Allow Integ Tests to run in a FIPS-140 JVM (#32316) Add support for field aliases to 6.x. (#32184) Register ERR metric with NamedXContentRegistry (#32320) fixes broken build for third-party-tests (#32315) Relates #31918 / Closes infra/issues/6085 [DOCS] Rollup Caps API incorrectly mentions GET Jobs API (#32280) Rest HL client: Add put watch action (#32026) (#32191) Add WeightedAvg metric aggregation (#31037) Consistent encoder names (#29492) Switch monitoring to new style Requests (#32255) specify subdirs of lib, bin, modules in package (#32253) Rename ranking evaluation `quality_level` to `metric_score` (#32168) Add new permission for JDK11 to load JAAS libraries (#32132) Switch x-pack:core to new style Requests (#32252) Watcher: Store username on watch execution (#31873) Silence SSL reload test that fails on JDK 11 Painless: Clean up add methods in PainlessLookup (#32258) CCE when re-throwing "shard not available" exception in TransportShardMultiGetAction (#32185) Fail shard if IndexShard#storeStats runs into an IOException (#32241) Fix `range` queries on `_type` field for singe type indices (#31756) (#32161) AwaitsFix RecoveryIT#testHistoryUUIDIsGenerated Add new fields to monitoring template for Beats state (#32085) (#32273) [TEST] improve REST high-level client naming conventions check (#32244) Check that client methods match API defined in the REST spec (#31825)
* master: Security: revert to old way of merging automata (#32254) Networking: Fix test leaking buffer (#32296) Undo a debugging change that snuck in during the field aliases merge. Painless: Update More Methods to New Naming Scheme (#32305) [TEST] Fix assumeFalse -> assumeTrue in SSLReloadIntegTests Ingest: Support integer and long hex values in convert (#32213) Introduce fips_mode setting and associated checks (#32326) Add V_6_3_3 version constant [DOCS] Removed extraneous callout number. Rest HL client: Add put license action (#32214) Add ERR to ranking evaluation documentation (#32314) Introduce Application Privileges with support for Kibana RBAC (#32309) Build: Shadow x-pack:protocol into x-pack:plugin:core (#32240) [Kerberos] Add Kerberos authentication support (#32263) [ML] Extract persistent task methods from MlMetadata (#32319) Add Restore Snapshot High Level REST API Register ERR metric with NamedXContentRegistry (#32320) fixes broken build for third-party-tests (#32315) Allow Integ Tests to run in a FIPS-140 JVM (#31989) [DOCS] Rollup Caps API incorrectly mentions GET Jobs API (#32280) awaitsfix testRandomClusterStateUpdates [TEST] add version skip to weighted_avg tests Consistent encoder names (#29492) Add WeightedAvg metric aggregation (#31037) Switch monitoring to new style Requests (#32255) Rename ranking evaluation `quality_level` to `metric_score` (#32168) Fix a test bug around nested aggregations and field aliases. (#32287) Add new permission for JDK11 to load JAAS libraries (#32132) Silence SSL reload test that fails on JDK 11 [test] package pre-install java check (#32259) specify subdirs of lib, bin, modules in package (#32253) Switch x-pack:core to new style Requests (#32252) awaitsfix SSLConfigurationReloaderTests Painless: Clean up add methods in PainlessLookup (#32258) Fail shard if IndexShard#storeStats runs into an IOException (#32241) AwaitsFix RecoveryIT#testHistoryUUIDIsGenerated Remove unnecessary warning supressions (#32250) CCE when re-throwing "shard not available" exception in TransportShardMultiGetAction (#32185) Add new fields to monitoring template for Beats state (#32085)
Fix for #29008.
There are two places that I'm not sure about, I've marked them TODO.
Based on previous pull request #29078, which can be closed if this one goes through.