Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large ILM Task Batches are Executed too Slowly #82708

Closed
Tracked by #77466
original-brownbear opened this issue Jan 18, 2022 · 1 comment · Fixed by #85405
Closed
Tracked by #77466

Large ILM Task Batches are Executed too Slowly #82708

original-brownbear opened this issue Jan 18, 2022 · 1 comment · Fixed by #85405
Assignees
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team

Comments

@original-brownbear
Copy link
Member

In many shards benchmarking we see a number of warnings about large ILM task batches getting executed too slowly and it in-fact breaks other cluster operations (namely index-auto-create) during benchmarking temporarily if and when a large ILM batch hits us with something like this:

[2022-01-18T09:39:06,714][WARN ][o.e.c.s.MasterService    ] [elasticsearch-0] took [2m/122081ms] to compute cluster state update for [ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-3-182-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@a906e483], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-21-422-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@e130fd1b], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-4-6-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@62ce7de], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-20-340-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@c3de0651], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-18-3-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@25c11a0c], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-13-391-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@b9a62620], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-5-255-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@fd367a62], ilm-move-to-step {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-18-490-2022.01.18-000001], currentStep [{"phase":"new","action":"complete","name":"complete"}], nextStep [{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"}]}[org.elasticsearch.xpack.ilm.MoveToNextStepUpdateTask@18b804b6], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-4-173-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@9affdaef], ilm-move-to-step {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-3-554-2022.01.18-000001], currentStep [{"phase":"new","action":"complete","name":"complete"}], nextStep [{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"}]}[org.elasticsearch.xpack.ilm.MoveToNextStepUpdateTask@21601c28], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-2-191-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@aec149a6], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-13-107-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@1ccfb1e7], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-6-337-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@a9bac761], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-7-419-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@6edb77e], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-21-331-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@22143dfe], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-23-42-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@7ab5c764], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-14-382-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@7a536a08], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-22-413-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@e50322f2], ilm-move-to-step {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-23-482-2022.01.18-000001], currentStep [{"phase":"new","action":"complete","name":"complete"}], nextStep [{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"}]}[org.elasticsearch.xpack.ilm.MoveToNextStepUpdateTask@11bc2770], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-20-4-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@7bef66f9], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-4-264-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@ea4956da], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-5-346-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@9b8ae320], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-6-428-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@fea41b42], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-7-237-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@35ad7636], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-15-373-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@148945d3], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-22-322-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@b89a87d3], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-13-51-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@53604882], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-10-29-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@2b8a46f], ilm-set-step-info {policy [ilm-history-ilm-policy], index [.ds-ilm-history-5-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@67d3f578], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-4-355-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@8b8ef995], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-11-125-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@7886bc1d], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-17-28-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@70aff260], ... (12890 in total, 12858 omitted) (18168 tasks in total)], which exceeds the warn threshold of [10s]

The slowness in executing these batches is almost exclusively a result of needlessly rebuilding the full cluster state over and over in each task.

image

Each of these tasks tends to only changes the per-index metadata but otherwise leaves the cluster state as is. We should fix the batching here to not have each task output a full cluster state and instead just apply tasks to a single builder in a loop to only build one cluster state instead of potentially thousands.

relates #77466

@original-brownbear original-brownbear added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management needs:triage Requires assignment of a team area label labels Jan 18, 2022
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Jan 18, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants