-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track the count of failed invocations since last successful policy snapshot #88398
Conversation
Pinging @elastic/es-data-management (Team:Data Management) |
Hi @jbaiera, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks good to me, but I think we should make it non-null and treat the default missing value as 0 invocations, what do you think?
...gin/core/src/main/java/org/elasticsearch/xpack/core/slm/SnapshotLifecyclePolicyMetadata.java
Outdated
Show resolved
Hide resolved
@elasticmachine run elasticsearch-ci/docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* upstream/master: (38 commits) Simplify map copying (elastic#88432) Make DiffableUtils.diff implementation agnostic (elastic#88403) Ingest: Start separating Metadata from IngestSourceAndMetadata (elastic#88401) Move runtime fields base scripts out of scripting fields api package. (elastic#88488) Enable TRACE Logging for test and increase timeout (elastic#88477) Mute ReactiveStorageIT#testScaleDuringSplitOrClone (elastic#88480) Track the count of failed invocations since last successful policy snapshot (elastic#88398) Avoid noisy exceptions on data nodes when aborting snapshots (elastic#88476) Fix ReactiveStorageDeciderServiceTests testNodeSizeForDataBelowLowWatermark (elastic#88452) INFO logging of snapshot restore and completion (elastic#88257) unmute test (elastic#88454) Updatable API keys - noop check (elastic#88346) Corrected an incomplete sentence. (elastic#86542) Use consistent shard map type in IndexService (elastic#88465) Stop registering TestGeoShapeFieldMapperPlugin in ESIntegTestCase (elastic#88460) TSDB: RollupShardIndexer logging improvements (elastic#88416) Audit API key ID when create or grant API keys (elastic#88456) Bound random negative size test in SearchSourceBuilderTests#testNegativeSizeErrors (elastic#88457) Updatable API keys - logging audit trail event (elastic#88276) Polish reworked LoggedExec task (elastic#88424) ... # Conflicts: # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/v2/RollupShardIndexer.java
When an automated snapshot fails, the last failure for a policy is captured and stored in the cluster state. Similarly, we store the last successful snapshot invocation as well. We do not track how many invocations have passed between a successful snapshot and the most recent failure. These stats would be helpful for reporting on SLM policy health.
Instead of a fixed delay, snapshot lifecycle policies are scheduled using a cron expression which can produce variable execution times between snapshot attempts. This makes it difficult to select a window of time where continuous snapshot failure becomes indicative of a problem instead of a transient issue. By including the count of failed invocations since last success we can provide health reporting logic that allows for some transient failures while remaining agnostic of variable execution times that cron can produce.