-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot creations have huge heap footprint after abrupt full-cluster restart #89952
Closed
Tracked by
#77466
Labels
>bug
:Distributed Coordination/Snapshot/Restore
Anything directly related to the `_snapshot/*` APIs
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Comments
DaveCTurner
added
>bug
:Distributed Coordination/Snapshot/Restore
Anything directly related to the `_snapshot/*` APIs
labels
Sep 9, 2022
elasticsearchmachine
added
the
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
label
Sep 9, 2022
Pinging @elastic/es-distributed (Team:Distributed) |
97 tasks
original-brownbear
added a commit
to original-brownbear/elasticsearch
that referenced
this issue
Nov 22, 2022
Low effort imporovement to elastic#89952 mostly. We should be turning the index identifier lookup into an immutable map when parsing these right away. We do this conversion/copy for both the adding and removing of snapshots from the `IndexMetadataGenerations` later on anyways so this doesn't add any CPU cost overall. What it does however is save a massive amount of heap for single index snapshots (where the overhead of hash map over the immutable map is the greatest) when first parsing this structure from the repo and potentially having it duplicated on heap many times over due to elastic#89952.
original-brownbear
added a commit
that referenced
this issue
Nov 23, 2022
Low effort imporovement to #89952 mostly. We should be turning the index identifier lookup into an immutable map when parsing these right away. We do this conversion/copy for both the adding and removing of snapshots from the `IndexMetadataGenerations` later on anyways so this doesn't add any CPU cost overall. What it does however is save a massive amount of heap for single index snapshots (where the overhead of hash map over the immutable map is the greatest) when first parsing this structure from the repo and potentially having it duplicated on heap many times over due to #89952.
original-brownbear
added a commit
to original-brownbear/elasticsearch
that referenced
this issue
Nov 23, 2022
…ng repository instance This makes use of the new deduplicator infrastructure to move to more efficient deduplication mechanics. The existing solution hardly ever deduplicated because it would only deduplicate after the repository entered a consistent state. The adjusted solution is much simpler, in that it simply deduplicates such that only a single loading of `RepositoryData` will ever happen at a time, fixing memory issues from massively concurrent loading of the repo data as described in elastic#89952. closes elastic#89952
original-brownbear
added a commit
that referenced
this issue
Nov 23, 2022
…ng repository instance (#91851) This makes use of the new deduplicator infrastructure to move to more efficient deduplication mechanics. The existing solution hardly ever deduplicated because it would only deduplicate after the repository entered a consistent state. The adjusted solution is much simpler, in that it simply deduplicates such that only a single loading of `RepositoryData` will ever happen at a time, fixing memory issues from massively concurrent loading of the repo data as described in #89952. closes #89952
original-brownbear
added a commit
to original-brownbear/elasticsearch
that referenced
this issue
Nov 23, 2022
…ng repository instance (elastic#91851) This makes use of the new deduplicator infrastructure to move to more efficient deduplication mechanics. The existing solution hardly ever deduplicated because it would only deduplicate after the repository entered a consistent state. The adjusted solution is much simpler, in that it simply deduplicates such that only a single loading of `RepositoryData` will ever happen at a time, fixing memory issues from massively concurrent loading of the repo data as described in elastic#89952. closes elastic#89952
elasticsearchmachine
pushed a commit
that referenced
this issue
Jan 3, 2023
…-caching repository instance (#91851) (#91866) * Simplify and optimize deduplication of RepositoryData for a non-caching repository instance (#91851) This makes use of the new deduplicator infrastructure to move to more efficient deduplication mechanics. The existing solution hardly ever deduplicated because it would only deduplicate after the repository entered a consistent state. The adjusted solution is much simpler, in that it simply deduplicates such that only a single loading of `RepositoryData` will ever happen at a time, fixing memory issues from massively concurrent loading of the repo data as described in #89952. closes #89952 * fix compile
original-brownbear
added a commit
to original-brownbear/elasticsearch
that referenced
this issue
Jan 4, 2023
…-caching repository instance (elastic#91851) (elastic#91866) * Simplify and optimize deduplication of RepositoryData for a non-caching repository instance (elastic#91851) This makes use of the new deduplicator infrastructure to move to more efficient deduplication mechanics. The existing solution hardly ever deduplicated because it would only deduplicate after the repository entered a consistent state. The adjusted solution is much simpler, in that it simply deduplicates such that only a single loading of `RepositoryData` will ever happen at a time, fixing memory issues from massively concurrent loading of the repo data as described in elastic#89952. closes elastic#89952 * fix compile
elasticsearchmachine
pushed a commit
that referenced
this issue
Jan 4, 2023
…-caching repository instance (#91851) (#91866) (#92661) * Simplify and optimize deduplication of RepositoryData for a non-caching repository instance (#91851) This makes use of the new deduplicator infrastructure to move to more efficient deduplication mechanics. The existing solution hardly ever deduplicated because it would only deduplicate after the repository entered a consistent state. The adjusted solution is much simpler, in that it simply deduplicates such that only a single loading of `RepositoryData` will ever happen at a time, fixing memory issues from massively concurrent loading of the repo data as described in #89952. closes #89952 * fix compile
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
>bug
:Distributed Coordination/Snapshot/Restore
Anything directly related to the `_snapshot/*` APIs
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
If the cluster shuts down while updating the root repository data blob then it will set
BlobStoreRepository#uncleanStart
on startup, which causes Elasticsearch to skip the caching ofRepositoryData
in favour of reading the blob afresh from the repository each time it's needed.If on startup ILM finds indices waiting to move to the searchable snapshot phase then it will attempt to create snapshots of each such index. Each create-snapshot task holds a reference to the
RepositoryData
it captured when the task was submitted.The trouble is that each
RepositoryData
instance could be tens of MBs in size and whileuncleanStart
is set there is no sharing between these instances. In the case of this I saw,RepositoryData
was ~58MiB and there were 17 create-snapshot tasks in the queue, so these tasks alone consumed almost 1GiB of heap. There were also 6snapshot_meta
threads all busy loading more copies ofRepositoryData
with a total of 530MiB of local state.Relates #77466
Workaround
Clearing the
uncleanStart
flag should restore the caching (and hence sharing) ofRepositoryData
again:The text was updated successfully, but these errors were encountered: