-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archive unknown or invalid settings on updates #28888
Archive unknown or invalid settings on updates #28888
Conversation
Today we can end up in a situation where the cluster state contains unknown or invalid settings. This can happen easily during a rolling upgrade. For example, consider two nodes that are on a version that considers the setting foo.bar to be known and valid. Assume one of these nodes is restarted on a higher version that considers foo.bar to now be either unknown or invalid, and then the second node is restarted too. Now, both nodes will be on a version that consider foo.bar to be unknown or invalid yet this setting will still be contained in the cluster state. This means that if a cluster settings update is applied and we validate the settings update with the existing settings then validation will fail. In such a state, the offending setting can not even be removed. This commit helps out with this situation by archiving any settings that are unknown or invalid at the time that a settings update is applied. This allows the setting update to go through, and the archived settings can be removed at a later time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minors. newlines are out of control here.
@@ -48,15 +54,35 @@ synchronized Settings getPersistentUpdate() { | |||
return persistentUpdates.build(); | |||
} | |||
|
|||
synchronized ClusterState updateSettings(final ClusterState currentState, Settings transientToApply, Settings persistentToApply) { | |||
synchronized ClusterState updateSettings( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get your newlines under control my friend
* - validate the incoming settings update combined with the existing known and valid settings | ||
* - merge in the archived unknown or invalid settings | ||
*/ | ||
final Tuple<Settings, Settings> partitionedTransientSettings = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have 2 methods Settings getValidSettings(Settings)
and Settings getInvalidSetting(Settings)
that would make it simpler to see what is valid and not instead of a tuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a change. I kept the tuple but clarified its usage. I hope this helps.
final Map.Entry<String, String> e, | ||
final IllegalArgumentException ex, | ||
final Logger logger) { | ||
logger.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
man newlines?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about painting it yellow. 😛
@s1monw I pushed some changes addressing your feedback; would you take another look? |
test this please |
* master: [TEST] AwaitsFix QueryRescorerIT.testRescoreAfterCollapse Decouple XContentType from StreamInput/Output (elastic#28927) Remove BytesRef usage from XContentParser and its subclasses (elastic#28792) [DOCS] Correct typo in configuration (elastic#28903) Fix incorrect datemath example (elastic#28904) Add a usage example of the JLH score (elastic#28905) Wrap stream passed to createParser in try-with-resources (elastic#28897) Rescore collapsed documents (elastic#28521) Fix (simple)_query_string to ignore removed terms (elastic#28871) [Docs] Fix typo in composite aggregation (elastic#28891) Try if tombstone is eligable for pruning before locking on it's key (elastic#28767)
(e, ex) -> logInvalidSetting(settingsType, e, ex, logger)); | ||
return Tuple.tuple( | ||
Settings.builder() | ||
.put(settingsWithUnknownOrInvalidArchived.filter(k -> k.startsWith(ARCHIVED_SETTINGS_PREFIX) == false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this distinction a bit confusing - why do we keep the "old" invalid settings but remove the new ones? In the next iteration will just be including the newly archived setting in the previous ones. I'm inclined to say - either remove all archived setting (both new and old) when validating or keep them all. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bleskes Consider a settings update to remove archived persistent settings:
{
"persistent": {
"archived.*": null
}
}
If there are existing persistent archived settings in the cluster state, these should be removed by this update. The strategy that we use to handle unknown or invalid settings is to add them at the end with the archived
prefix. If we do the same for existing archived settings, the application of the above settings update will not remove them and they will remain in the cluster state. Does that help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand you to say that you want the invalid settings not to be removed by the incoming request you gave. If so, nothing we do here will be "clean" (for example, after you called that command there may be archived settings in the cluster state, which is not what people expect). Not sure it's worth all the complexity. If you prefer it this way (I presume you do), I'm fine with keeping as is but please write an explicit test to demonstrate this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, after you called that command there may be archived settings in the cluster state, which is not what people expect
I disagree with this. If I list my cluster settings and I see archived.foo
, archived.bar
, and invalid.setting
, I do not expect PUT archived.*: null
(lazy) to remove invalid.setting
. Hence, the choice that I made here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a test in 3298664.
final Settings toApply = Settings.builder().put("dynamic.setting", "value").build(); | ||
final boolean applyTransient = randomBoolean(); | ||
final ClusterState clusterStateAfterUpdate; | ||
if (applyTransient) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took me quite a while to understand what's going on - on the one hand the persistent vs transient question is handled by parameters to this method (please rename the function
parameter if we keep it ;)) and on the other hand we have a boolean here that deals with it directly which may be inconsistent. On top of it all, the rest API and the transport layer allows changing both at once - something we don't test. I think the functional programming got a bit out of hand here, probably due to multiple iterations. Can we have a straight forward test that works with both persistent and transient (randomly combining them)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed d2bc7fd.
(e, ex) -> logInvalidSetting(settingsType, e, ex, logger)); | ||
return Tuple.tuple( | ||
Settings.builder() | ||
.put(settingsWithUnknownOrInvalidArchived.filter(k -> k.startsWith(ARCHIVED_SETTINGS_PREFIX) == false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand you to say that you want the invalid settings not to be removed by the incoming request you gave. If so, nothing we do here will be "clean" (for example, after you called that command there may be archived settings in the cluster state, which is not what people expect). Not sure it's worth all the complexity. If you prefer it this way (I presume you do), I'm fine with keeping as is but please write an explicit test to demonstrate this behavior.
* master: (28 commits) Maybe die before failing engine (elastic#28973) Remove special handling for _all in nodes info Remove Booleans use from XContent and ToXContent (elastic#28768) Update Gradle Testing Docs (elastic#28970) Make primary-replica resync failures less lenient (elastic#28534) Remove temporary file 10_basic.yml~ Use different pipeline id in test. (pipelines do not get removed between tests extending from ESIntegTestCase) Use fixture to test the repository-gcs plugin (elastic#28788) Use String.join() to describe a list of tasks (elastic#28941) Fixed incorrect test try-catch statement Plugins: Consolidate plugin and module loading code (elastic#28815) percolator: Take `matchAllDocs` and `verified` of the sub result into account when analyzing a function_score query. Build: Remove rest tests on archive distribution projects (elastic#28952) Remove FastStringReader in favor of vanilla StringReader (elastic#28944) Remove FastCharArrayReader and FastCharArrayWriter (elastic#28951) Continue registering pipelines after one pipeline parse failure. (elastic#28752) Build: Fix ability to ignore when no tests are run (elastic#28930) [rest-api-spec] update doc link for /_rank_eval Switch XContentBuilder from BytesStreamOutput to ByteArrayOutputStream (elastic#28945) Factor UnknownNamedObjectException into its own class (elastic#28931) ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx @jasontedor for the extra iteration. I left optional comments.
for (final Setting<String> invalidSetting : invalidSettings) { | ||
if (existingPersistentSettings.keys().contains(invalidSetting.getKey())) { | ||
assertThat( | ||
clusterStateAfterUpdate.metaData().persistentSettings().keySet(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check that it doesn't exist in the non archived form? alternatively check the total count of settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed e85ad01.
final Settings.Builder transientToApply = Settings.builder(); | ||
for (final Setting<String> dynamicSetting : dynamicSettings) { | ||
if (randomBoolean()) { | ||
persistentToApply.put(dynamicSetting.getKey(), "value"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we skip some and see they don't change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed e85ad01.
Today we can end up in a situation where the cluster state contains unknown or invalid settings. This can happen easily during a rolling upgrade. For example, consider two nodes that are on a version that considers the setting foo.bar to be known and valid. Assume one of these nodes is restarted on a higher version that considers foo.bar to now be either unknown or invalid, and then the second node is restarted too. Now, both nodes will be on a version that consider foo.bar to be unknown or invalid yet this setting will still be contained in the cluster state. This means that if a cluster settings update is applied and we validate the settings update with the existing settings then validation will fail. In such a state, the offending setting can not even be removed. This commit helps out with this situation by archiving any settings that are unknown or invalid at the time that a settings update is applied. This allows the setting update to go through, and the archived settings can be removed at a later time.
Today we can end up in a situation where the cluster state contains unknown or invalid settings. This can happen easily during a rolling upgrade. For example, consider two nodes that are on a version that considers the setting foo.bar to be known and valid. Assume one of these nodes is restarted on a higher version that considers foo.bar to now be either unknown or invalid, and then the second node is restarted too. Now, both nodes will be on a version that consider foo.bar to be unknown or invalid yet this setting will still be contained in the cluster state. This means that if a cluster settings update is applied and we validate the settings update with the existing settings then validation will fail. In such a state, the offending setting can not even be removed. This commit helps out with this situation by archiving any settings that are unknown or invalid at the time that a settings update is applied. This allows the setting update to go through, and the archived settings can be removed at a later time.
* master: Add search slowlog level to docs (elastic#29040) Add docs for error file configuration (elastic#29032) Archive unknown or invalid settings on updates (elastic#28888)
Today we can end up in a situation where the cluster state contains unknown or invalid settings. This can happen easily during a rolling upgrade. For example, consider two nodes that are on a version that considers the setting foo.bar to be known and valid. Assume one of these nodes is restarted on a higher version that considers foo.bar to now be either unknown or invalid, and then the second node is restarted too. Now, both nodes will be on a version that consider foo.bar to be unknown or invalid yet this setting will still be contained in the cluster state. This means that if a cluster settings update is applied and we validate the settings update with the existing settings then validation will fail. In such a state, the offending setting can not even be removed. This commit helps out with this situation by archiving any settings that are unknown or invalid at the time that a settings update is applied. This allows the setting update to go through, and the archived settings can be removed at a later time.
Relates #28609