Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAC] Change index bootstrapping strategy #113389

Merged

Conversation

Kerry350
Copy link
Contributor

@Kerry350 Kerry350 commented Sep 29, 2021

Summary

This PR implements #108941. These changes aim to support additive mappings changes only.

Note:

The changes here do not currently reflect these two ACs:

  • composed of the component templates matching the current version
  • with the current version in the _meta object

As I don't think we need them yet, and I also want to speak to Felix about them.

High level overview

(This may be helpful for those unfamiliar with the inner workings of this code)

  • The index bootstrapping can be roughly split into two phases, the "index level resources" and the "namespaces level resources".
    • The index level resources include the component templates and the ILM policy (if provided), these are not namespaced. This installation happens via installIndexLevelResources() and is invoked when initializeIndex() is called via the RuleDataService. Solutions, for example, will do this on plugin setup. You can think of these as the "building blocks" that will create the namespace level resources.
    • The namespace level resources are installed / updated when a namespace's writer is first accessed, and therefore initialised (please note that namespaces aren't used by any solution yet, but we default to default, so there is always a namespace). If there is no writer cached for that namespace, initializeWriter() will attempt to installAndUpdateNamespaceLevelResources().
      • installAndUpdateNamespaceLevelResources() will take our "building blocks" from before and install / update the namespaced index template, create a concrete write index (if needed), and attempt to update the mappings of any current concrete write indices (via first simulating the index template via simulateIndexTemplate(), and then attempting to put those mappings).
      • If for any reason installAndUpdateNamespaceLevelResources() cannot be completed then writing is completely disabled.

Testing notes

Putting aside the more obvious concerns (e.g. templates are installed correctly), the most important thing is that a developer accidentally performing a non-additive change does not break the overall alerting framework (as in, executor runs aren't broken).

Dev tools is likely to be your best friend here. Below are some useful snippets that you might want to check whilst testing (change logs to any other registration context):

  • Check which alerts indices exist: GET _cat/indices/.alerts-observability*
  • Get an index's template: GET /_index_template/.alerts-observability.logs.alerts-default*
  • Check on aliases: GET /_cat/aliases
  • Force a rollover: POST .alerts-observability.logs.alerts-default/_rollover (to check the ILM policy works as expected)
  • See mappings for alerts indices: GET .alerts-observability*/_mapping
  • Check a component template: GET /_component_template/.alerts-observability.logs.alerts-mappings (or any value from composed_of when checking the index template)

Things to check:

  • Are the resources installed correctly with a blank slate? Does writing alerts documents work?

  • If you make an additive change, for example to a component template, do things work as expected? (The index template should update, as should the mappings for any existing concrete write indices). You can introduce these types of additive changes here as an example: https://github.com/elastic/kibana/blob/master/x-pack/plugins/infra/server/services/rules/rule_data_client.ts#L31 (feel free to try other registration contexts / solutions)

  • If you make a non-additive change, for example changing the type of a field, do things work as expected? (There should be a very clearly logged error, writing should be disabled, and executor runs should be unaffected). Log output should look roughly like this in that scenario:

Screenshot 2021-09-28 at 14 11 59

@Kerry350 Kerry350 self-assigned this Sep 29, 2021
@Kerry350 Kerry350 added Feature:RAC label obsolete release_note:skip Skip the PR/issue when compiling release notes Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.16.0 v8.0.0 labels Sep 29, 2021
@jasonrhodes
Copy link
Member

Absolutley brilliant overview, @Kerry350 — thank you!

@mgiota
Copy link
Contributor

mgiota commented Sep 30, 2021

@Kerry350 Thanks a lot for the great summary!

I did some testing locally and here are my findings so far:

  • I loaded some existing test alerts data from the archive to my local cluster and did some changes to the existing alerts without any issues (change workflow status , add to a new case)
  • I created a new Inventory Rule and without doing any mapping change to the infra component template I got following errors in my console
server    log   [08:54:07.395] [error][plugins][ruleRegistry] Failed to PUT mapping for alias .alerts-observability.metrics.alerts-default: illegal_argument_exception: [illegal_argument_exception] Reason: mapper [kibana.alert.rule.from] cannot be changed from type [date] to [keyword]
server    log   [08:54:08.397] [info][plugins][ruleRegistry] Installing namespace-level resources and creating concrete index for .alerts-observability.metrics.alerts-default
server    log   [08:54:08.418] [error][plugins][ruleRegistry] Failed to PUT mapping for alias .alerts-observability.metrics.alerts-default: illegal_argument_exception: [illegal_argument_exception] Reason: mapper [kibana.alert.rule.from] cannot be changed from type [date] to [keyword]
server    log   [08:54:10.422] [info][plugins][ruleRegistry] Installing namespace-level resources and creating concrete index for .alerts-observability.metrics.alerts-default
server    log   [08:54:10.443] [error][plugins][ruleRegistry] Failed to PUT mapping for alias .alerts-observability.metrics.alerts-default: illegal_argument_exception: [illegal_argument_exception] Reason: mapper [kibana.alert.rule.from] cannot be changed from type [date] to [keyword]
server    log   [08:54:10.444] [error][plugins][ruleRegistry] There has been a catastrophic error trying to install namespace level resources. Writing can no longer continue, and has been disabled for the following registration context: observability.metrics. 
This may have been due to a non-additive change to the mappings, removal and type changes are not permitted. Full error: {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [kibana.alert.rule.from] cannot be changed from type [date] to [keyword]"}],"type":"illegal_argument_exception","reason":"mapper [kibana.alert.rule.from] cannot be changed from type [date] to [keyword]"},"status":400}

No new alerts appeared in the Alerts table as expected, but the Inventory Rule appeared as Active in the Rule and Connector page. Shouldn't the rule itself change state in the case of a catastrophic error or it should stay active?

Changes through Index Management
I must say though that before creating the new inventory rule type I was trying to mess around a bit with component templates through Index Management. I edited .alerts-observability.apm.alerts-mappings and changed an existing field type from object to flattened and then back to object. I have 2 questions here:

  • Why did changes to apm mappings affected metrics?
  • The fact that I was able to change field types through Index Mangement made me question our approach of non additive changes. Shall we disable the Edit option in the UI for .alerts* templates or do I miss something here?

I will restart my ES and do more testing. I will start from a clean state, I will create some new rules before messing around with component templates to verify that I get alerts and then I will start messing around with making more changes.

@mgiota
Copy link
Contributor

mgiota commented Sep 30, 2021

@Kerry350 Is there an easy way I can bring some apm data to my local cluster? I would like to create some apm rules and then try to change the apm component template https://github.com/elastic/kibana/blob/master/x-pack/plugins/apm/server/plugin.ts#L121.

@Kerry350
Copy link
Contributor Author

Kerry350 commented Sep 30, 2021

@mgiota Thank you for the testing ♥️

I created a new Inventory Rule and without doing any mapping change to the infra component template I got following errors in my console

Hmm, it's interesting this would happen with no changes.

[kibana.alert.rule.from] cannot be changed from type [date] to [keyword]"}] - technically it's doing exactly what we want it to, but I'd need to see why that date to keyword change was happening for you if you didn't change any of the templates etc.

No new alerts appeared in the Alerts table as expected, but the Inventory Rule appeared as Active in the Rule and Connector page. Shouldn't the rule itself change state in the case of a catastrophic error or it should stay active?

It should stay active, this is the result we want given these "catastrophic" scenarios. We don't want RAC / alerts as data etc to interfere with the overall alerting framework. Rule execution should carry on as normal, and actions and so on should still be dispatched. The active state on the Rules and Connectors page is determined by the alerting framework side of things.

For us to reach this stage the assumption is (or at least we assume) that some sort of bad mapping has made it past both PR review and QA and into the release. It should, hopefully, never happen, but if it does we definitely don't want RAC to stop peoples potentially critical alerts.

I must say though that before creating the new inventory rule type I was trying to mess around a bit with component templates through Index Management.

Ah, okay, then that might have resulted in the above state. Although not if you only touched APM resources 🤔

Are you 100% sure you didn't touch the .alerts-technical-mappings template too?

Why did changes to apm mappings affected metrics?

It shouldn't. I'll try fiddling with the Index Management page (I hadn't changed things there).

The fact that I was able to change field types through Index Mangement made me question our approach of non additive changes. Shall we disable the Edit option in the UI for .alerts* templates or do I miss something here?

This is a good question. As I say I hadn't considered people coming to that page and changing the component templates etc 😬

@jasonrhodes Would be interested in your thoughts on this bit?

@Kerry350
Copy link
Contributor Author

Is there an easy way I can bring some apm data to my local cluster? I would like to create some apm rules and then try to change the apm component template

Are you using the cluster scraping script from Chris?

If so you can add it to your config. I have the following:

"indices": [
    { "name": "filebeat", "pattern": "filebeat-*" },
    { "name": "metricbeat", "pattern": "metricbeat-*" },
    { "name": "apm", "pattern": "apm-*" },
    { "name": "uptime", "pattern": "heartbeat-*" }
   ]

@mgiota
Copy link
Contributor

mgiota commented Sep 30, 2021

I tested again and here are my findings. In a nutshell I identified that root of the problem was that I loaded existing alerts from the archive before creating a new rule.

Scenario 1

  • Restart ES
  • Run Kibana
  • Load existing alerts from the archive
  • Start metricbeat
  • Create a new inventory rule
  • I got following errors
server    log   [11:20:19.588] [info][plugins][ruleRegistry] Installing namespace-level resources and creating concrete index for .alerts-observability.metrics.alerts-default
server    log   [11:20:19.724] [error][plugins][ruleRegistry] Failed to PUT mapping for alias .alerts-observability.metrics.alerts-default: illegal_argument_exception: [illegal_argument_exception] Reason: mapper [kibana.alert.rule.from] cannot be changed from type [date] to [keyword]
server    log   [11:20:20.726] [info][plugins][ruleRegistry] Installing namespace-level resources and creating concrete index for .alerts-observability.metrics.alerts-default
server    log   [11:20:20.764] [error][plugins][ruleRegistry] Failed to PUT mapping for alias .alerts-observability.metrics.alerts-default: illegal_argument_exception: [illegal_argument_exception] Reason: mapper [kibana.alert.rule.from] cannot be changed from type [date] to [keyword]
server    log   [11:20:22.766] [info][plugins][ruleRegistry] Installing namespace-level resources and creating concrete index for .alerts-observability.metrics.alerts-default
server    log   [11:20:22.795] [error][plugins][ruleRegistry] Failed to PUT mapping for alias .alerts-observability.metrics.alerts-default: illegal_argument_exception: [illegal_argument_exception] Reason: mapper [kibana.alert.rule.from] cannot be changed from type [date] to [keyword]
server    log   [11:20:22.795] [error][plugins][ruleRegistry] There has been a catastrophic error trying to install namespace level resources. Writing can no longer continue, and has been disabled for the following registration context: observability.metrics. 
          This may have been due to a non-additive change to the mappings, removal and type changes are not permitted. Full error: {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [kibana.alert.rule.from] cannot be changed from type [date] to [keyword]"}],"type":"illegal_argument_exception","reason":"mapper [kibana.alert.rule.from] cannot be changed from type [date] to [keyword]"},"status":400}
server    log   [11:20:22.796] [error][plugins][ruleRegistry] The writer for the Rule Data Client for the observability.metrics registration context was not initialized properly, bulk() cannot continue. 
              Full error: {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [kibana.alert.rule.from] cannot be changed from type [date] to [keyword]"}],"type":"illegal_argument_exception","reason":"mapper [kibana.alert.rule.from] cannot be changed from type [date] to [keyword]"},"status":400}
server    log   [11:20:25.564] [info][actions][plugins] Server log: test_inventory_again - Panagiotas-MBP is in a state of ALERT;;Reason:;CPU usage is greater than a threshold of 7 (current value is 26.4%);

Scenario 2

  • Restart ES
  • Run Kibana
  • Start metricbeat
  • Create a new inventory rule
  • Alert appears in the alert table

The only difference between 2 scenarios is that in the first scenario I loaded some existing alerts before creating a new rule. @Kerry350 Do you think it is the test that leads to false alarm or is it indeed a problem? To me it looks like there was a non additive change to the kibana.alert.rule.from field since you generated the archive data and this non-additive change shouldn't happen in real life, because we will prevent it (we will disable write and print above error in the console while development).

But what if a non-additive change happens through Index Management UI? @jasonrhodes I would like to hear your thoughts on this.

@Kerry350
Copy link
Contributor Author

Kerry350 commented Sep 30, 2021

@mgiota Ah, okay, yes. I should have paid more attention to I loaded some existing test alerts data from the archive. When you use the ES Archiver it installs the index templates, as well as the documents themselves. The archive indeed currently has date for from in the mappings.json. Whereas the technical template has this as keyword: https://github.com/elastic/kibana/blob/master/x-pack/plugins/rule_registry/common/assets/field_maps/technical_rule_field_map.ts#L118. It does mean the data has changed since archived (I'll add it to my todo list to regenerate it...again 😅). Generally speaking I wouldn't use archive data for something that isn't tests for this reason (unless something extreme like lack of internet connection). It's good to have data that reflects master so to speak as this project is moving so quickly.

Change in question: 820b1ca

We can ignore this as an issue, it'll be fixed by having new archive data 👍

Moving forward we will also think about ways to have solid functional tests for these additive only mappings changes.

@Kerry350
Copy link
Contributor Author

(The good news is it caught the problem which is what we want)

@mgiota
Copy link
Contributor

mgiota commented Sep 30, 2021

@Kerry350 Exactly! That's what I was thinking right now. I will proceed with a bit more testing (do a few additive changes and verify everything works as expected).

Changes through Index Management is something we should think about more I guess.

@mgiota
Copy link
Contributor

mgiota commented Sep 30, 2021

Are you using the cluster scraping script from Chris?

Yep! I will try what you suggested!

@Kerry350
Copy link
Contributor Author

Update: We've just discussed the Index Management UI query in a meeting. The conclusion is that this is a far reaching problem that affects many different things, and there isn't really anything we can do about it. It isn't ideal, but we at least know it's a vector for problems now.

@mgiota
Copy link
Contributor

mgiota commented Oct 1, 2021

@Kerry350 I did a few additive changes to inventory and apm rule types and I verify that everything works as expected. Good job :-)

@mgiota
Copy link
Contributor

mgiota commented Oct 1, 2021

I have one question regarding Index Management UI and hidden indices. The .internal.alerts* indices are hidden indices (I need to toggle the Include hidden indices to see them). On the other hand the .alerts-observability*` component templates appear by default on the list. Isn't there the concept of hidden component templates as well?

@Kerry350
Copy link
Contributor Author

Kerry350 commented Oct 1, 2021

Thanks for the thorough testing, much appreciated 🙏

On the other hand the .alerts-observability*` component templates appear by default on the list. Isn't there the concept of hidden component templates as well?

That's a good question. My understanding is "no there isn't", but I'll look into that as I'm not 100%.

@Kerry350
Copy link
Contributor Author

Kerry350 commented Oct 4, 2021

@elasticmachine merge upstream

@Kerry350
Copy link
Contributor Author

Kerry350 commented Oct 11, 2021

@weltenwort Thank you for reviewing 🙏

⚠️ For the most part your suggestions should be implemented now. However, what's pushed here currently is broken (due to a question I have).

The backing indices are updated multiple times if multiple namespaces exist for the same registration context (I think).

This should be specific to a namespace now, I am still verifying this. Edit: Looks good 👍

The writer ignores the write flag for namespace-level initialization.

This is changed now.

Only the mappings of the backing indices are updated, but not the settings and metadata.

I am unsure about this one. And this is what's currently broken. I changed this to perform a:

const simulatedSettings = get(simulatedIndexTemplate, ['body', 'template', 'settings']);
await clusterClient.indices.putSettings({
        index,
        body: simulatedSettings,
});

(same as the mappings)

But this will result in the error:

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Can't update non dynamic settings [[index.number_of_shards]] for open indices [[.internal.alerts-observability.logs.alerts-default-000001/3tM4YRC9R9SbsgrTDkIXEw]]"}],"type":"illegal_argument_exception","reason":"Can't update non dynamic settings [[index.number_of_shards]] for open indices [[.internal.alerts-observability.logs.alerts-default-000001/3tM4YRC9R9SbsgrTDkIXEw]]"},"status":400}

The error makes sense, the technical component template provides the setting index.number_of_shards, and that particular setting is classified as static (https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#index-modules-settings), which means we can only update it on index creation time or when the index is closed.

Therefore, do we want to just limit settings updates to the dynamic settings? If so, is there some easy and reliable way to separate what is static and dynamic?

Last question is with regards to metadata, how will we update this for the backing indices? I understand updating this within the actual index template, but not for existing indices. We have the putMappings and putSettings APIs to update the current indices, am I missing something that allows updating the meta? Technically the _meta under mappings is updated when we do putMappings, but there's the top level _meta too, was this in reference to that? (This is possibly just a silly question).

@weltenwort
Copy link
Member

Only the mappings of the backing indices are updated, but not the settings and metadata.

But this will result in the error [...]

Therefore, do we want to just limit settings updates to the dynamic settings? If so, is there some easy and reliable way to separate what is static and dynamic?

Good point! I only see two options right now:

  1. Maintain an allow- or deny list to filter the settings by.
  2. Don't update the settings for now.

Updates of static index settings need to be treated like non-additive mapping updates, which is forbidden until we have a properly specified update process. So option 2 doesn't sound too bad right now. @jasonrhodes what do you think? Can we defer updating the settings and make it part of the requirements of the update process that supports rollovers/conflicting mapping updates?

Last question is with regards to metadata, how will we update this for the backing indices? I understand updating this within the actual index template, but not for existing indices. We have the putMappings and putSettings APIs to update the current indices, am I missing something that allows updating the meta? Technically the _meta under mappings is updated when we do putMappings, but there's the top level _meta too, was this in reference to that? (This is possibly just a silly question).

You're right, the _meta in mappings should be sufficient. I might have overlooked that and will double-check whether it's updated already.

@weltenwort
Copy link
Member

You're right, the _meta in mappings should be sufficient. I might have overlooked that and will double-check whether it's updated already.

The index _meta is updated as part of the mappings. Sorry for the confusion.

@jasonrhodes
Copy link
Member

So option 2 doesn't sound too bad right now. @jasonrhodes what do you think? Can we defer updating the settings and make it part of the requirements of the update process that supports rollovers/conflicting mapping updates?

Seems ok to me, so long as we can figure out how to clearly communicate this to anyone else working on these registry-connected templates. The only other solution I could think of is catching the error and inspecting it, and removing any settings updates that failed due to being static, but that sounds like a vector for a lot of confusing scenarios, potentially. Go with whichever you both think makes sense here but please think about how we can document this kind of thing for future engineers who work on rules?

@Kerry350
Copy link
Contributor Author

Kerry350 commented Oct 13, 2021

Seems ok to me, so long as we can figure out how to clearly communicate this to anyone else working on these registry-connected templates. The only other solution I could think of is catching the error and inspecting it, and removing any settings updates that failed due to being static, but that sounds like a vector for a lot of confusing scenarios, potentially.

Outside of this PR we (Kerry / Jason / Felix ) decided to go with option 2 for now. And this will be communicated via a code comment for now, as there isn't anywhere else more obvious to mention it. We may consider an "allow list" down the road, but this should be decided on almost in the same way as a schema.

@Kerry350
Copy link
Contributor Author

This should be good to go now, the three core issues that were brought up have been fixed (bar the settings issue, which is deferred).

@Kerry350 Kerry350 requested a review from weltenwort October 13, 2021 13:46
Copy link
Member

@weltenwort weltenwort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the comment below about deprecating the unsafe.indexUpgrade.enabled setting, I couldn't find a way to break it (within the known limitations). 👏 Nicely done!

Comment on lines 22 to 24
indexUpgrade: schema.object({
enabled: schema.boolean({ defaultValue: false }),
}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, maybe we should leave this in the schema and only flag it as "unused" as to not break Kibana startup?

  deprecations: ({ deprecate, unused }) => [
    deprecate('enabled', '8.0.0'),
    unused('unsafe.indexUpgrade.enabled'),
  ],

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

Screenshot 2021-10-13 at 16 58 47

@Kerry350
Copy link
Contributor Author

@weltenwort Thanks for the reviews 🙏

Will merge once we're greeeeen.

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
ruleRegistry 108 113 +5
Unknown metric groups

API count

id before after diff
ruleRegistry 131 136 +5

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @Kerry350

@Kerry350 Kerry350 added the auto-backport Deprecated - use backport:version if exact versions are needed label Oct 13, 2021
@Kerry350 Kerry350 merged commit b96f544 into elastic:master Oct 13, 2021
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Oct 13, 2021
* Change index bootstrapping to cater for non-additive changes only
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
7.x

This backport PR will be merged automatically after passing CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed Feature:RAC label obsolete release_note:skip Skip the PR/issue when compiling release notes Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.16.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants