Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support large saved object indices consuming 10s of GBs #147852

Open
rudolf opened this issue Dec 20, 2022 · 4 comments
Open

Support large saved object indices consuming 10s of GBs #147852

rudolf opened this issue Dec 20, 2022 · 4 comments
Labels
Feature:Migrations Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@rudolf
Copy link
Contributor

rudolf commented Dec 20, 2022

While #144035 will reduce the upgrade downtime of clusters with millions of saved objects large indices with GBs of data introduce new challenges. The general Elasticsearch guidance is to ensure that shards are between 10GB to 50GB in size while the saved object indices always use 1 shard only. We currently only have a handful of customers with .kibana indices > 10GB but this is likely to increase.

There are a few options to mitigate this problem:

  1. Config option to specify .kibana primary shards #156306
  2. Have all saved object indices use a high shard count by default.
    This consumes unecessary shards for small clusters but improves the scalability for larger clusters.
  3. Re-shard the indices of large clusters.
    Because this requires a reindex this will cause downtime. Given that the reason for a re-shard is a large cluster such downtime would be significant.
  4. Use a rollover index with an ILM size policy.
    • This would require significant changes to the saved objects repository. Update operations need to first search for the _id of the document being updated to locate the index in which the document resides. The update operation would need to be made directly against the index.
    • Would updateByQuery operations continue to work?
    • We would need to change the migration algorithm to use a mappings template instead of creating indices with explicit mappings (or perform the "rollover" manually from inside Kibana).
    • Deletes might cause unbalanced shards.
  5. Ask Elasticsearch for a zero downtime resharding API
  6. Perform zero downtime reindex during upgrade (powered by Elasticsearch or Kibana side)
@rudolf rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Migrations labels Dec 20, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@pgayvallet
Copy link
Contributor

Have all saved object indices use a high shard count by default.

First, I'd like to understand which reasonable maximum size we would expect to cover 99.9% of our customers usages. If we're talking about 100GB, my gut feeling would be that increasing the sharding count to 2 or 3 by default could be a very acceptable and pragmatic compromise?

Also, do we have any guess on the per-type repartition for large usages of saved objects? Because if it's not just one type taking 90% of the total size, splitting our indices per group of types (as we're currently discussing) would also help here.

Re-shard the indices of large clusters.

What about environments where this may not be acceptable to have downtime? I feel like this single statement makes this option a no-go, wdyt?

Use a rollover index with an ILM size policy

I agree that if nothing else works, it may be something that we would have to look at. The implications are so significant though, at various level of the SOR and migration systems, that we would need a very strong reason to take the option as being worth it ihmo.

@jasonrhodes
Copy link
Member

which reasonable maximum size we would expect to cover 99.9% of our customers usages

I'm nervous about this approach mainly because it sounds so similar to the "100K saved objects migration takes 10 minutes" capacity guessing problem that led to us wanting to restrict migrations. My question with this kind of thing would always be "what happens if a customer has 10x the upper limit of our expectations"? Is there a simple workaround for that scenario?

@rudolf
Copy link
Contributor Author

rudolf commented Jan 19, 2023

I agree with Pierre that (1) would at least buy us some time.

Manually resharding the index would always be a last resort workaround (for 10x or 100x the data size) but the downtime might be a non-negotiable for users.

But 1-3 aren't really good long term options. I've added (4) and (5) which are options we're exploring with the Elasticsearch team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Migrations Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

4 participants