Large ILM Task Batches are Executed too Slowly #82708

original-brownbear · 2022-01-18T10:12:56Z

In many shards benchmarking we see a number of warnings about large ILM task batches getting executed too slowly and it in-fact breaks other cluster operations (namely index-auto-create) during benchmarking temporarily if and when a large ILM batch hits us with something like this:

[2022-01-18T09:39:06,714][WARN ][o.e.c.s.MasterService    ] [elasticsearch-0] took [2m/122081ms] to compute cluster state update for [ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-3-182-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@a906e483], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-21-422-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@e130fd1b], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-4-6-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@62ce7de], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-20-340-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@c3de0651], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-18-3-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@25c11a0c], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-13-391-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@b9a62620], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-5-255-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@fd367a62], ilm-move-to-step {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-18-490-2022.01.18-000001], currentStep [{"phase":"new","action":"complete","name":"complete"}], nextStep [{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"}]}[org.elasticsearch.xpack.ilm.MoveToNextStepUpdateTask@18b804b6], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-4-173-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@9affdaef], ilm-move-to-step {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-3-554-2022.01.18-000001], currentStep [{"phase":"new","action":"complete","name":"complete"}], nextStep [{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"}]}[org.elasticsearch.xpack.ilm.MoveToNextStepUpdateTask@21601c28], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-2-191-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@aec149a6], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-13-107-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@1ccfb1e7], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-6-337-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@a9bac761], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-7-419-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@6edb77e], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-21-331-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@22143dfe], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-23-42-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@7ab5c764], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-14-382-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@7a536a08], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-22-413-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@e50322f2], ilm-move-to-step {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-23-482-2022.01.18-000001], currentStep [{"phase":"new","action":"complete","name":"complete"}], nextStep [{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"}]}[org.elasticsearch.xpack.ilm.MoveToNextStepUpdateTask@11bc2770], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-20-4-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@7bef66f9], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-4-264-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@ea4956da], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-5-346-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@9b8ae320], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-6-428-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@fea41b42], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-7-237-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@35ad7636], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-15-373-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@148945d3], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-22-322-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@b89a87d3], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-13-51-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@53604882], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-10-29-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@2b8a46f], ilm-set-step-info {policy [ilm-history-ilm-policy], index [.ds-ilm-history-5-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@67d3f578], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-4-355-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@8b8ef995], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-11-125-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@7886bc1d], ilm-set-step-info {policy [auditbeat-quantitative], index [.ds-auditbeatquantitative-17-28-2022.01.18-000001], currentStep [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}]}[org.elasticsearch.xpack.ilm.SetStepInfoUpdateTask@70aff260], ... (12890 in total, 12858 omitted) (18168 tasks in total)], which exceeds the warn threshold of [10s]

The slowness in executing these batches is almost exclusively a result of needlessly rebuilding the full cluster state over and over in each task.

Each of these tasks tends to only changes the per-index metadata but otherwise leaves the cluster state as is. We should fix the batching here to not have each task output a full cluster state and instead just apply tasks to a single builder in a loop to only build one cluster state instead of potentially thousands.

relates #77466

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-01-18T10:12:58Z

Pinging @elastic/es-data-management (Team:Data Management)

original-brownbear added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management needs:triage Requires assignment of a team area label labels Jan 18, 2022

elasticmachine added the Team:Data Management Meta label for data/management team label Jan 18, 2022

original-brownbear mentioned this issue Jan 18, 2022

Fix Large Shard Count Scalability Issues #77466

Open

97 tasks

romseygeek removed the needs:triage Requires assignment of a team area label label Jan 18, 2022

joegallo self-assigned this Jan 28, 2022

This was referenced Mar 21, 2022

Optimize ImmutableOpenMap.Builder #85184

Merged

Speed up ILM cluster task execution #85405

Merged

joegallo closed this as completed in #85405 Mar 29, 2022

joegallo mentioned this issue Jan 4, 2023

Allow ILM to transition to implicit cached steps #91779

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large ILM Task Batches are Executed too Slowly #82708

Large ILM Task Batches are Executed too Slowly #82708

original-brownbear commented Jan 18, 2022

elasticmachine commented Jan 18, 2022

Large ILM Task Batches are Executed too Slowly #82708

Large ILM Task Batches are Executed too Slowly #82708

Comments

original-brownbear commented Jan 18, 2022

elasticmachine commented Jan 18, 2022