-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reindex API: Reindex task es_rejected_execution_exception search queue failure #26153
Comments
That's a potentially dangerous assumption for us to make. Elasticsearch is not in a position to assume which functions are the most important to your business (servicing public-facing searches vs running a background reindex task). I'm confused by your exhaustion of search queues. Unless you are using slicing (which it appears you are not) I would assume there shouldn't be any parallelisation of searches and hence no exhaustion of search queues caused directly by reindex. Presumably it is other search loads that are contributing to the search thread pool exhaustion.
Is it possible the long delay observed here is tied to the same problem of thread-pool exhaustion - you have a large number of other concurrent search operations ongoing?
Do you have more than one data point for your test using other settings? |
This is probably worth another issue. Indeed, I had intended it to be the number of write operations per second (deletes for delete_by_query, updates for update_by_query, and indexes for reindex). And it writes the whole batch at once rather than attempting to smooth out the writes. So
Reindex has had code to do that for a very long time but it seems to not be working. You can see we even count the number of retries:
We must not be picking up the rejection that you are seeing somehow. We test for this on every build by gumming up the thread pools and starting a reindex, ungumming them, and asserting that the reindex succeeded and counted some retries. We're obviously doing something wrong though. @andy-elastic, are you interested in looking at this or should I have a look later on? |
@nik9000 yeah I'll take a look and see if I can find out why this isn't being retried. Setting @berglh to help me reproduce this, when you say
do you mean that the higher you set |
No worries, I'll write this up soon.
I figured there must be some mechanism for this already, thanks for explaining it. Another thought I had with the @andy-elastic There are two size directives in the Reindex API. My goal is to Reindex a relatively large index, so I'm referring to the size of the Scroll as specified in the
This setting will be limited by the I found that I would hit this error relatively quickly at the @markharwood All very good points. We have 10 unique Kibana instances running with multiple users. The actual search usage is pretty low and ad-hoc mostly when investigating something. We do have some dashboards that are periodically captured and displayed as images on some TV around our various IT departments, I don't believe these update anymore than once every 5 minutes. I also noticed this behaviour in relatively low ES utilisation, unfortunately I don't have any metrics regarding average search requests per second, but I'm pretty confident that the large majority (+90%) would be coming from the Reindex API. In terms of data points, I've been spending the past week attempting to Reindex after hitting the bug in Kibana 5.5.0 to weed out my field conflicts. |
When I set the 10000 Results
{
"completed": true,
"task": {
"node": "fmVI6xlZQCmhqZqVPIjfXA",
"id": 84157930,
"type": "transport",
"action": "indices:data/write/reindex",
"status": {
"total": 279063633,
"updated": 0,
"created": 26690000,
"deleted": 0,
"batches": 2669,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 2667995,
"requests_per_second": 10000,
"throttled_until_millis": 0
},
"description": "reindex from [anotherlargeindex] to [anotherlargeindex.es5]",
"start_time_in_millis": 1502431979655,
"running_time_in_nanos": 6195367033709,
"cancellable": true
},
"response": {
"took": 6195366,
"timed_out": false,
"total": 279063633,
"updated": 0,
"created": 26690000,
"deleted": 0,
"batches": 2669,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 2667995,
"requests_per_second": 10000,
"throttled_until_millis": 0,
"failures": [
{
"shard": -1,
"reason": {
"type": "es_rejected_execution_exception",
"reason": "rejected execution of org.elasticsearch.transport.TransportService$7@48752bce on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@53fe6f13[Running, pool size = 49, active threads = 49, queued tasks = 999, completed tasks = 7727372]]"
}
}
]
}
} 5000 Results
{
"_index": ".tasks",
"_type": "task",
"_id": "fmVI6xlZQCmhqZqVPIjfXA:81294668",
"_score": 1,
"_source": {
"completed": true,
"task": {
"node": "fmVI6xlZQCmhqZqVPIjfXA",
"id": 81294668,
"type": "transport",
"action": "indices:data/write/reindex",
"status": {
"total": 133317017,
"updated": 0,
"created": 133317017,
"deleted": 0,
"batches": 13332,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 26663379,
"requests_per_second": 5000,
"throttled_until_millis": 0
},
"description": "reindex from [largeindex] to [largeindex.es5]",
"start_time_in_millis": 1502414174752,
"running_time_in_nanos": 44458656303656,
"cancellable": true
},
"response": {
"took": 44458656,
"timed_out": false,
"total": 133317017,
"updated": 0,
"created": 133317017,
"deleted": 0,
"batches": 13332,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 26663379,
"requests_per_second": 5000,
"throttled_until_millis": 0,
"failures": []
}
}
} |
Proposal for disambiguation of requests_per_second as discusses in [Reindex API: Reindex task es_rejected_execution_exception search queue failure elastic#26153](elastic#26153 (comment)).
Thanks! I have #26185 on my list of things to review again today.
And my past mistakes continue to haunt me. One of them really should be called |
I was able to reliably reproduce this by
Getting a task status similar to the original posted - failure with search queue rejected, no retries marked
It seems like the index being created in 2.x is the determining factor here, I was not able to reproduce this with only indices created in 5.x. I'll see if I can make the reproduction steps a little simpler and rule out some other factors |
I was wrong about it only being indices created in 2.x, I'm able to reproduce it with indices created in 5.x now. Not sure what I was doing differently before, I must not have set the queue size low enough. It looks like this is reproducible in 5.5.1 on a single node cluster with indices of any size. It does not reproduce on indices with only a single shard. I'll see if I can reproduce it in the test environment and find a cause |
I think I found why search requests aren't being retried here. When it gets the response back from a scroll, it retries if the request generates an exception. However, if the request completes with failures, it doesn't retry but still terminates the reindex when handling the scroll response. So I think what's happening here is a scroll request is completing with failures. That said, I'm not sure if the cause of the failures @berglh is seeing are the same as what I've been reproducing here, because they look a little different. Mine consistently have a shard and node id associated, and I haven't seen any with the default In my case, it's that some shards (but not all) are failing in the search request (because the queue is full). This doesn't get caught in our tests because it only uses one shard, so all the searches that fail have all shards failed. @nik9000 any ideas about what could cause a search failure with |
Not really. So about terminating the scroll when a shard fails, I don't know that you can (#26433) retry on the shard level. I'd forgotten about this when the issues came up in the first place.... I think, though, that the |
@colings86 pinged me about this issue, wanting to make sure that it is still appropriate to have this in @andyb-elastic's So this is officially blocked waiting on #26472. |
@nik9000 right, I don't think we need any more feedback here, and we're waiting for #26472 For more background, the reason we can't fix this while it uses scrolls is because it will lose some documents if we retry and continue. When a scroll fails on some shards and returns partial results, there's no current way to rewind so that the reader makes sure to get those missing documents. In the context of the reindex API, losing documents is clearly very incorrect behavior, so the right thing to do is fail when this happens, even though it's unfortunately very inconvenient. When we replace reindex's use of scroll with #26472 we'll be able to retry this failure condition without losing documents. |
@andyb-elastic is this something you are still working on? |
@dakrone not actively, we're waiting on a new API from #26472. I think we can close this as that feature isn't on the roadmap yet. Additional feedback is always welcome. In the meantime, users encountering this problem with scrolls should use |
Environment
I have a 10 data node, 5 master only node cluster with the request handled by a data node.
The index in question has 5 primary shards and 1 replica.
Problem Description
Upon requesting a task to reindex an
elasticsearch 2.x
created index with a size of~50 GB
and a doc count of133047546
, the task completes with a status oftrue
even though elasticserach produced an error. TheEsThreadPoolExecutor
error reported alludes the search queue capacity as being exceeded, presumably by the Scroll search requests contending for queue space.To me, it appears that the Reindex API is stopping the task on any error. I can sort of appreciate you don't want something to silently fail, but actually I believe it should be the job of the Scroll client (in this case the Reindex API) to identify the search queue has been exceeded and continue to retry.
This is kind of touched on in this Github issue: Reindex API : improve robustness in case of error
Increasing the Scroll size of the Reindex improves the ability of the Reindex API to make it most of the way through the process. However, on a large enough index, I continue to hit this problem.
My suggestion is that the Reindex API should retry on this soft error.
Supplementary Problem
In addition to this, I have a problem with the description of the
requests_per_second
URI parameter in the documentation: Reindex API: URL Parameters.I interpret this instruction as the value of
requests_per_second
limiting the number of Scroll searches or ES bulk writes as"requests"
conducted per second.What I experienced actually was that setting
requests_per_second
to0.5
resulted in a wait time of~15500
seconds for a bulk size of10000
. It seems like this setting actually restricts the number of searchresults
per second or is creating bucket loads of write requests for a Scroll size of10000
.I tried to use this to limit the impact of the Reindex on the seach queues, but not until I set this value to
5000
for a Scroll size of10000
did I start to see the kind of rate limiting that I am after. I can open another issue for this if required, not sure if there is a bug in ES bulk writing or just a disambiguation problem.Reproduction Command
Output
The text was updated successfully, but these errors were encountered: