Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change put mapping priority to URGENT #42105

Closed
Bukhtawar opened this issue May 12, 2019 · 11 comments
Closed

Change put mapping priority to URGENT #42105

Bukhtawar opened this issue May 12, 2019 · 11 comments
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. feedback_needed

Comments

@Bukhtawar
Copy link
Contributor

Problem Description

Bulk Index request which change the mapping of the index can cause put mapping requests. These requests need to be send across to the master for processing and all this while other bulk requests are held up on the co-ordinator node for processing. This is a problem if there are high priority items with priority URGENT already queued up on master(too many shard-started events due to fluctuating nodes)as this would cause a surge in memory pressure for the coordinating nodes and bulk indexing to get slowed down

@DaveCTurner DaveCTurner added the :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label May 12, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor

too many shard-started events due to fluctuating nodes

Fluctuating nodes sounds like the more fundamental problem here. Why is this happening?

If your cluster is unstable then I would expect performance issues such as the one you describe, and although the performance characteristics would be different with the change you propose I don't think they would necessarily be better. Starting shards in a timely fashion is quite important.

The consequence of delaying a mapping update should be a bounded amount of buffering on the coordinator; if this bound is exceeded then the coordinator should start to reject requests. Is this what you're seeing?

@Bukhtawar
Copy link
Contributor Author

Bukhtawar commented May 12, 2019

By fluctuating nodes it could be a cluster scale up/down/zone changes that could also cause this essentially slowing down throughput. As far as the memory on the coordinator node is concerned here is another problem #35564. Changing the priority helps both the cases although they both would need their own specialized solution . Do you think there are major gains in keeping the priority HIGH other than lesser URGENT events. Also in general marking a shard started during relocation(cluster health GREEN all through) should it have greater precedence than a task that slows down ingestion

@DaveCTurner
Copy link
Contributor

By fluctuating nodes it could be a cluster scale up/down/zone changes that could also cause this

I would not expect scaling the cluster up or down to swamp the master with shard-started tasks as you describe. I would expect mechanisms like cluster.routing.allocation.node_concurrent_recoveries to limit the number of such tasks happening at any one time, unless you have disabled this mechanism. Can you describe the scaling event in more detail to clarify the sequence of events that lead to this problem?

As far as the memory on the coordinator node is concerned here is another problem #35564.

I think that issue is now mis-titled, because I expect the coordinating node not to run out of memory in such a situation in recent versions. That issue is making failed put-mapping calls fail faster, which would cause bulk rejections sooner. That doesn't sound like what you want. I've adjusted the title of the linked issue.

@Bukhtawar
Copy link
Contributor Author

Bukhtawar commented May 12, 2019

There is no notion of cluster_concurrent_recoveries to limit the number of relocation across the cluster which is what impacts the master espl when the cluster has more number of nodes typically around 50-100 and upto 10-15k shards
You are right the issue doesn't exactly correlate to the proposed solution(point is not queue it up for master unless needed) but makes sense till that is sorted out and hence the reference.
There are two major concerns here with the priority of the task due to pending tasks at the master

  1. additional memory pressure on coordinating node
  2. ingestion slow down.

@DaveCTurner
Copy link
Contributor

I don't understand why cluster.routing.allocation.node_concurrent_recoveries doesn't help here. With 100 nodes there should be at most 200 concurrent recoveries by default, across the cluster, and the master should be able to handle this easily. We need more information about the specific situation you're talking about here, ideally with logs or other data showing the details of the problem.

@Bukhtawar
Copy link
Contributor Author

Typically not all the 200 shards start at the same time(due to varying sizes) and once one shard does start, master queues up a shard-started event. The task is processed by master in batches and there could be multiple such batches owing to the different times at which they get queued up(say some shards start after master starts processing the batch of shards already started).
Now coming to the processing of a single batch. The complexity of a single allocation cycle(per batch) is roughly O(#shards) and here is a set of benchmarks I ran whilst working on improving the time with @ywelsch. The tp90 for single round of allocation for 10k shards came around few seconds and with 30k shards it is around 15-20s.
Now with back to back batches of URGENT shard-started events the mapping update would need to be put on hold.

@DaveCTurner
Copy link
Contributor

Ok I think I see. You seem to be saying that the master is receiving a stream of relatively few shard-started events over a longer period of time, which is starving the mapping updates of time on the master. Am I understanding this correctly?

If so, I don't think adjusting the priority of mapping updates will solve this problem. Even if mapping update tasks were top priority they'd still have to occasionally wait for a lower-priority task time since there is no pre-emption mechanism. If they have equal priority then you would still see starvation occurring because the tasks aren't executed fairly.

I also think that increasing the priority of things is the wrong way to address this kind of performance issue. The logical conclusion of that line of thinking is that everything becomes top priority, but then nothing works. Instead, I think you should look at the following areas:

  • your cluster is generating too many shard-started tasks because of fluctuating nodes. Why are they fluctuating?

  • you are using dynamic mapping updates, which will always be a scaling bottleneck since they require the involvement of the master.

  • your ingest process cannot cope with back-pressure from Elasticsearch.

  • allocation is slower than you would like on clusters like yours with tens of thousands of shards. You suggested some ideas for improvements there but it seems that they haven't been enough. Would you continue working on this area?

@Bukhtawar
Copy link
Contributor Author

Ok I think I see. You seem to be saying that the master is receiving a stream of relatively few shard-started events over a longer period of time, which is starving the mapping updates of time on the master. Am I understanding this correctly?

Yes that seems correct

If so, I don't think adjusting the priority of mapping updates will solve this problem. Even if mapping update tasks were top priority they'd still have to occasionally wait for a lower-priority task time since there is no pre-emption mechanism. If they have equal priority then you would still see starvation occurring because the tasks aren't executed fairly.

I understand there is no pre-emption, but the mapping update in the midst of streams of shard-started event will not need to be put on hold. It would be served once the ongoing cycle of shard allocation completes instead of waiting on the storm of events to finish. There is a significant time gap here.

I also think that increasing the priority of things is the wrong way to address this kind of performance issue. The logical conclusion of that line of thinking is that everything becomes top priority, but then nothing works. Instead, I think you should look at the following areas:

IMHO shard relocation should have relatively low priority as compared to mapping updates espl given that these mapping updates are directly impacting client traffic and that these updates are relatively faster and cheaper than shard-started. Anything that slows down indexing should be top priority and bumping up the priority has value in cutting down the wait time in queue.

Having said that fluctuating node was just one scenario, AZ/hardware changes are the most common scenarios. Working on the remaining part of the optimization is on my radar will get back on that. The optimization wouldn't cover all cases and worst case would still be few seconds per iteration

@DaveCTurner
Copy link
Contributor

It would be served once the ongoing cycle of shard allocation completes instead of waiting on the storm of events to finish

This is not true, because (as I said) the enqueued tasks aren't executed fairly.

Anything that slows down indexing should be top priority

I do not share this opinion. Starting shards promptly is vital for resilience, and I think it is correct to prioritise resilience over indexing performance as we do today. I will re-iterate that the thing that's slowing your indexing down is your use of dynamic mapping updates, which means that your indexing process needs the master's involvement. Solve that, and the problem you describe here will go away.

We've discussed this internally and concluded that adjusting the priorities as suggested here is not the right thing to do, so I'm closing this.

@Bukhtawar
Copy link
Contributor Author

Starting shards promptly is vital for resilience

I think it's of importance for unassigned shards but does this hold good for shards already assigned. The point here is why should indexing performance suffer if say a rebalance was underway.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jul 16, 2019
Today we reroute the cluster as part of the process of starting a shard, which
runs at `URGENT` priority. In large clusters, rerouting may take some time to
complete, and this means that a mere trickle of shard-started events can cause
starvation for other, lower-priority, tasks that are pending on the master.

However, it isn't really necessary to perform a reroute when starting a shard,
as long as one occurs eventually. This commit removes the inline reroute from
the process of starting a shard and replaces it with a deferred one that runs
at `NORMAL` priority, avoiding starvation of higher-priority tasks.

This may improve some of the situations related to elastic#42738 and elastic#42105.
DaveCTurner added a commit that referenced this issue Jul 18, 2019
* Defer reroute when starting shards

Today we reroute the cluster as part of the process of starting a shard, which
runs at `URGENT` priority. In large clusters, rerouting may take some time to
complete, and this means that a mere trickle of shard-started events can cause
starvation for other, lower-priority, tasks that are pending on the master.

However, it isn't really necessary to perform a reroute when starting a shard,
as long as one occurs eventually. This commit removes the inline reroute from
the process of starting a shard and replaces it with a deferred one that runs
at `NORMAL` priority, avoiding starvation of higher-priority tasks.

This may improve some of the situations related to #42738 and #42105.

* Specific test case for followup priority setting

We cannot set the priority in all InternalTestClusters because the deprecation
warning makes some tests unhappy. This commit adds a specific test instead.

* Checkstyle

* Cluster state always changed here

* Assert consistency of routing nodes

* Restrict setting only to reasonable priorities
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jul 18, 2019
* Defer reroute when starting shards

Today we reroute the cluster as part of the process of starting a shard, which
runs at `URGENT` priority. In large clusters, rerouting may take some time to
complete, and this means that a mere trickle of shard-started events can cause
starvation for other, lower-priority, tasks that are pending on the master.

However, it isn't really necessary to perform a reroute when starting a shard,
as long as one occurs eventually. This commit removes the inline reroute from
the process of starting a shard and replaces it with a deferred one that runs
at `NORMAL` priority, avoiding starvation of higher-priority tasks.

This may improve some of the situations related to elastic#42738 and elastic#42105.

* Specific test case for followup priority setting

We cannot set the priority in all InternalTestClusters because the deprecation
warning makes some tests unhappy. This commit adds a specific test instead.

* Checkstyle

* Cluster state always changed here

* Assert consistency of routing nodes

* Restrict setting only to reasonable priorities
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. feedback_needed
Projects
None yet
Development

No branches or pull requests

3 participants