Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reindex API: Reindex task es_rejected_execution_exception search queue failure #26153

Closed
berglh opened this issue Aug 11, 2017 · 14 comments
Closed
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search.

Comments

@berglh
Copy link
Contributor

berglh commented Aug 11, 2017

Environment

I have a 10 data node, 5 master only node cluster with the request handled by a data node.
The index in question has 5 primary shards and 1 replica.

Product Version
Elasticsearch (Official Container) 5.5.1 (Official Elastic Container)
Plugins X-Pack Removed
OpenJDK 1.8.0_141
Docker 17.05.0-ce, build 89658be
Oracle Linux Server 7.3
Kernel Linux 3.10.0-514.21.2.el7.x86_64

Problem Description

Upon requesting a task to reindex an elasticsearch 2.x created index with a size of ~50 GB and a doc count of 133047546, the task completes with a status of true even though elasticserach produced an error. The EsThreadPoolExecutor error reported alludes the search queue capacity as being exceeded, presumably by the Scroll search requests contending for queue space.

To me, it appears that the Reindex API is stopping the task on any error. I can sort of appreciate you don't want something to silently fail, but actually I believe it should be the job of the Scroll client (in this case the Reindex API) to identify the search queue has been exceeded and continue to retry.

This is kind of touched on in this Github issue: Reindex API : improve robustness in case of error

Increasing the Scroll size of the Reindex improves the ability of the Reindex API to make it most of the way through the process. However, on a large enough index, I continue to hit this problem.

My suggestion is that the Reindex API should retry on this soft error.

Supplementary Problem

In addition to this, I have a problem with the description of the requests_per_second URI parameter in the documentation: Reindex API: URL Parameters.

requests_per_second can be set to any positive decimal number (1.4, 6, 1000, etc) and throttles the number of requests per second that the reindex issues or it can be set to -1 to disabled throttling. The throttling is done waiting between bulk batches so that it can manipulate the scroll timeout. The wait time is the difference between the time it took the batch to complete and the time requests_per_second * requests_in_the_batch. Since the batch isn’t broken into multiple bulk requests large batch sizes will cause Elasticsearch to create many requests and then wait for a while before starting the next set. This is "bursty" instead of "smooth". The default is -1.

I interpret this instruction as the value of requests_per_second limiting the number of Scroll searches or ES bulk writes as "requests" conducted per second.

What I experienced actually was that setting requests_per_second to 0.5 resulted in a wait time of ~15500 seconds for a bulk size of 10000. It seems like this setting actually restricts the number of search results per second or is creating bucket loads of write requests for a Scroll size of 10000.

I tried to use this to limit the impact of the Reindex on the seach queues, but not until I set this value to 5000 for a Scroll size of 10000 did I start to see the kind of rate limiting that I am after. I can open another issue for this if required, not sure if there is a bug in ES bulk writing or just a disambiguation problem.

Reproduction Command

POST _reindex?wait_for_completion=false&wait_for_active_shards=all
{
  "source": {
    "index": "largeindex",
    "size": 10000,
    "query": {
      "bool": {
        "must": [
          {
            "range": {
              "@timestamp": {
                "gte": "2016-01-01T00:00:00.000",
                "lte": "2017-10-11T00:00:00.000",
                "time_zone": "+10:00"
              }
            }
          }
        ]
      }
    }
  },
  "dest": {
    "index": "largeindex.es5"
  }
}

Output

{
  "_index" : ".tasks",
  "_type" : "task",
  "_id" : "fmVI6xlZQCmhqZqVPIjfXA:71121809",
  "_score" : 1.0,
  "_source" : {
    "completed" : true,
    "task" : {
      "node" : "fmVI6xlZQCmhqZqVPIjfXA",
      "id" : 71121809,
      "type" : "transport",
      "action" : "indices:data/write/reindex",
      "status" : {
        "total" : 133047546,
        "updated" : 0,
        "created" : 74140000,
        "deleted" : 0,
        "batches" : 7414,
        "version_conflicts" : 0,
        "noops" : 0,
        "retries" : {
          "bulk" : 0,
          "search" : 0
        },
        "throttled_millis" : 0,
        "requests_per_second" : -1.0,
        "throttled_until_millis" : 0
      },
      "description" : "reindex from [largeindex] to [largeindex.es5]",
      "start_time_in_millis" : 1502331869750,
      "running_time_in_nanos" : 9825436625697,
      "cancellable" : true
    },
    "response" : {
      "took" : 9825436,
      "timed_out" : false,
      "total" : 133047546,
      "updated" : 0,
      "created" : 74140000,
      "deleted" : 0,
      "batches" : 7414,
      "version_conflicts" : 0,
      "noops" : 0,
      "retries" : {
        "bulk" : 0,
        "search" : 0
      },
      "throttled_millis" : 0,
      "requests_per_second" : -1.0,
      "throttled_until_millis" : 0,
      "failures" : [
        {
          "shard" : -1,
          "reason" : {
            "type" : "es_rejected_execution_exception",
            "reason" : "rejected execution of org.elasticsearch.transport.TcpTransport$RequestHandler@63806e47 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@229e062f[Running, pool size = 49, active threads = 49, queued tasks = 1000, completed tasks = 6373350]]"
          }
        }
      ]
    }
  }
}
@berglh berglh changed the title Reindex API: Reindex task completes as true after search queue failure Reindex API: Reindex task es_rejected_execution_exception search queue failure Aug 11, 2017
@markharwood
Copy link
Contributor

I believe it should be the job of the Scroll client (in this case the Reindex API) to identify the search queue has been exceeded and continue to retry.

That's a potentially dangerous assumption for us to make. Elasticsearch is not in a position to assume which functions are the most important to your business (servicing public-facing searches vs running a background reindex task).

I'm confused by your exhaustion of search queues. Unless you are using slicing (which it appears you are not) I would assume there shouldn't be any parallelisation of searches and hence no exhaustion of search queues caused directly by reindex. Presumably it is other search loads that are contributing to the search thread pool exhaustion.

Supplementary Problem In addition to this...

Is it possible the long delay observed here is tied to the same problem of thread-pool exhaustion - you have a large number of other concurrent search operations ongoing?

What I experienced actually was that setting requests_per_second to 0.5 resulted in a wait time of ~15500 seconds

Do you have more than one data point for your test using other settings?

@nik9000
Copy link
Member

nik9000 commented Aug 11, 2017

In addition to this, I have a problem with the description of the requests_per_second URI parameter in the documentation: Reindex API: URL Parameters.

This is probably worth another issue. Indeed, I had intended it to be the number of write operations per second (deletes for delete_by_query, updates for update_by_query, and indexes for reindex). And it writes the whole batch at once rather than attempting to smooth out the writes. So .5 would write all 10000 documents and then sleep 20000 seconds - the amount of time that the write took. The delay you see is expected. I think that problem is one of documentation. The docs should say "index requests" or "delete requests" or "update requests".

My suggestion is that the Reindex API should retry on this soft error.

Reindex has had code to do that for a very long time but it seems to not be working. You can see we even count the number of retries:

      "retries" : {
        "bulk" : 0,
        "search" : 0
      },

We must not be picking up the rejection that you are seeing somehow. We test for this on every build by gumming up the thread pools and starting a reindex, ungumming them, and asserting that the reindex succeeded and counted some retries. We're obviously doing something wrong though.

@andy-elastic, are you interested in looking at this or should I have a look later on?

@andyb-elastic
Copy link
Contributor

@nik9000 yeah I'll take a look and see if I can find out why this isn't being retried. Setting thread_pool.search.queue_size very low would be enough to cause the search queue to fill up right? It looks like that's what we do in RetryTests

@berglh to help me reproduce this, when you say

Increasing the Scroll size of the Reindex improves the ability of the Reindex API to make it most of the way through the process. However, on a large enough index, I continue to hit this problem.

do you mean that the higher you set size in the reindex request, the more likely it is to finish without this failure?

@berglh
Copy link
Contributor Author

berglh commented Aug 12, 2017

@nik9000

This is probably worth another issue. Indeed, I had intended it to be the number of write operations per second (deletes for delete_by_query, updates for update_by_query, and indexes for reindex). And it writes the whole batch at once rather than attempting to smooth out the writes. So .5 would write all 10000 documents and then sleep 20000 seconds - the amount of time that the write took. The delay you see is expected. I think that problem is one of documentation. The docs should say "index requests" or "delete requests" or "update requests".

No worries, I'll write this up soon.

We must not be picking up the rejection that you are seeing somehow.

I figured there must be some mechanism for this already, thanks for explaining it. Another thought I had with the -1 default requests_per_second setting was that I wonder if the Reindex API just keeps queuing up the Scroll searches until it fills the queue, but with a -1 search size effecting the depth calculation. I wonder if it was then blowing it out to try to queue 1001 scroll requests, thus resulting in the error. My intuition would be that if this thought had any substance it would result in a max scroll request of 999, and not 1001. I have no evidence that this is the case, just a passing thought, I don't see how it should effect the queue depth anyway if that is being managed separately - just coming up with ideas here.

@andy-elastic There are two size directives in the Reindex API. My goal is to Reindex a relatively large index, so I'm referring to the size of the Scroll as specified in the source object:

By default _reindex uses scroll batches of 1000. You can change the batch size with the size field in the source element.

This setting will be limited by the index.max_result_window which is 10000 by default if i remember correctly; I'm running the stock value for this setting.

I found that I would hit this error relatively quickly at the 1000 default setting, although never at the exact same point. I could never managed to Reindex a whole index with the default settings. When I increased it to the maximum 10000 Scroll size, it would make it through a lot more documents before hitting the error, allowing to Reindex many of my smaller indices.

@markharwood All very good points. We have 10 unique Kibana instances running with multiple users. The actual search usage is pretty low and ad-hoc mostly when investigating something. We do have some dashboards that are periodically captured and displayed as images on some TV around our various IT departments, I don't believe these update anymore than once every 5 minutes. I also noticed this behaviour in relatively low ES utilisation, unfortunately I don't have any metrics regarding average search requests per second, but I'm pretty confident that the large majority (+90%) would be coming from the Reindex API.

In terms of data points, I've been spending the past week attempting to Reindex after hitting the bug in Kibana 5.5.0 to weed out my field conflicts.

@berglh
Copy link
Contributor Author

berglh commented Aug 14, 2017

When I set the requests_per_second to 10000, I still get the es_rejected_execution_exception error.
When I set it to 5000, it did complete, however, took a much longer time.

10000 Results

Item Result
Total Docs 26690000
Total Batches 2669
Task Completion Time 6195 s
Time per Batch 2.32s
Overall EPS 4308
Time Throttled 2668 s
Time Throttled per Batch 1s
Time Working 3527 s
Time Working per Scroll 1.32 s
Working EPS 7567.3
Time Throttle to Work Ratio 0.75
{
  "completed": true,
  "task": {
    "node": "fmVI6xlZQCmhqZqVPIjfXA",
    "id": 84157930,
    "type": "transport",
    "action": "indices:data/write/reindex",
    "status": {
      "total": 279063633,
      "updated": 0,
      "created": 26690000,
      "deleted": 0,
      "batches": 2669,
      "version_conflicts": 0,
      "noops": 0,
      "retries": {
        "bulk": 0,
        "search": 0
      },
      "throttled_millis": 2667995,
      "requests_per_second": 10000,
      "throttled_until_millis": 0
    },
    "description": "reindex from [anotherlargeindex] to [anotherlargeindex.es5]",
    "start_time_in_millis": 1502431979655,
    "running_time_in_nanos": 6195367033709,
    "cancellable": true
  },
  "response": {
    "took": 6195366,
    "timed_out": false,
    "total": 279063633,
    "updated": 0,
    "created": 26690000,
    "deleted": 0,
    "batches": 2669,
    "version_conflicts": 0,
    "noops": 0,
    "retries": {
      "bulk": 0,
      "search": 0
    },
    "throttled_millis": 2667995,
    "requests_per_second": 10000,
    "throttled_until_millis": 0,
    "failures": [
      {
        "shard": -1,
        "reason": {
          "type": "es_rejected_execution_exception",
          "reason": "rejected execution of org.elasticsearch.transport.TransportService$7@48752bce on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@53fe6f13[Running, pool size = 49, active threads = 49, queued tasks = 999, completed tasks = 7727372]]"
        }
      }
    ]
  }
}

5000 Results

Item Result
Total Docs 133317017
Total Batches 13332
Task Completion Time 44458.6 s
Time per Batch 3.33s
Overall EPS 2998.7
Time Throttled 26663 s
Time Throttled per Batch 1.99s
Time Working 17795 s
Time Working per Batch 1.33 s
Working EPS 7491.6
Time Throttle to Work Ratio 1.49
{
  "_index": ".tasks",
  "_type": "task",
  "_id": "fmVI6xlZQCmhqZqVPIjfXA:81294668",
  "_score": 1,
  "_source": {
    "completed": true,
    "task": {
      "node": "fmVI6xlZQCmhqZqVPIjfXA",
      "id": 81294668,
      "type": "transport",
      "action": "indices:data/write/reindex",
      "status": {
        "total": 133317017,
        "updated": 0,
        "created": 133317017,
        "deleted": 0,
        "batches": 13332,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled_millis": 26663379,
        "requests_per_second": 5000,
        "throttled_until_millis": 0
      },
      "description": "reindex from [largeindex] to [largeindex.es5]",
      "start_time_in_millis": 1502414174752,
      "running_time_in_nanos": 44458656303656,
      "cancellable": true
    },
    "response": {
      "took": 44458656,
      "timed_out": false,
      "total": 133317017,
      "updated": 0,
      "created": 133317017,
      "deleted": 0,
      "batches": 13332,
      "version_conflicts": 0,
      "noops": 0,
      "retries": {
        "bulk": 0,
        "search": 0
      },
      "throttled_millis": 26663379,
      "requests_per_second": 5000,
      "throttled_until_millis": 0,
      "failures": []
    }
  }
}

berglh added a commit to berglh/elasticsearch that referenced this issue Aug 14, 2017
Proposal for disambiguation of requests_per_second as discusses in [Reindex API: Reindex task es_rejected_execution_exception search queue failure elastic#26153](elastic#26153 (comment)).
@nik9000
Copy link
Member

nik9000 commented Aug 15, 2017

No worries, I'll write this up soon.

Thanks! I have #26185 on my list of things to review again today.

There are two size directives in the Reindex API.

And my past mistakes continue to haunt me. One of them really should be called scroll_size. It'd be so much less confusing.

@andyb-elastic
Copy link
Contributor

I was able to reliably reproduce this by

  • Creating a ~40gb index with 5 shards, 1 replica on a 2.4.5 cluster with 3 nodes
  • Upgrading all nodes to 5.5.1
  • Setting thread_pool.search.queue_size to 1
  • Reindexing to a new index with
    POST localhost:9200/_reindex?wait_for_completion=false&wait_for_active_shards=all
    {
      "source": {
        "index": "nyc_taxis",
        "size":1000
      },
      "dest": {
        "index": "nyc_taxis_dest_1000"
      }
    }
    

Getting a task status similar to the original posted - failure with search queue rejected, no retries marked

{
  "completed" : true,
  "task" : {
    "node" : "M_P0k50JTU6bfF2oVrvCyw",
    "id" : 1040,
    "type" : "transport",
    "action" : "indices:data/write/reindex",
    "status" : {
      "total" : 63438115,
      "updated" : 0,
      "created" : 2000,
      "deleted" : 0,
      "batches" : 2,
      "version_conflicts" : 0,
      "noops" : 0,
      "retries" : {
        "bulk" : 0,
        "search" : 0
      },
      "throttled_millis" : 0,
      "requests_per_second" : -1.0,
      "throttled_until_millis" : 0
    },
    "description" : "reindex from [nyc_taxis] to [nyc_taxis_dest_1000]",
    "start_time_in_millis" : 1504029858572,
    "running_time_in_nanos" : 2412376075,
    "cancellable" : true
  },
  "response" : {
    "took" : 2404,
    "timed_out" : false,
    "total" : 63438115,
    "updated" : 0,
    "created" : 2000,
    "deleted" : 0,
    "batches" : 2,
    "version_conflicts" : 0,
    "noops" : 0,
    "retries" : {
      "bulk" : 0,
      "search" : 0
    },
    "throttled_millis" : 0,
    "requests_per_second" : -1.0,
    "throttled_until_millis" : 0,
    "failures" : [
      {
        "index" : "nyc_taxis",
        "shard" : 3,
        "node" : "fiLlxgSaRN6VEezWoAvfnA",
        "reason" : {
          "type" : "es_rejected_execution_exception",
          "reason" : "rejected execution of org.elasticsearch.transport.TcpTransport$RequestHandler@51feb839 on EsThreadPoolExecutor[search, queue capacity = 1, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@7ca35db8[Running, pool size = 13, active threads = 1, queued tasks = 0, completed tasks = 15]]"
        }
      }
    ]
  }
}

It seems like the index being created in 2.x is the determining factor here, I was not able to reproduce this with only indices created in 5.x. I'll see if I can make the reproduction steps a little simpler and rule out some other factors

@andyb-elastic
Copy link
Contributor

I was wrong about it only being indices created in 2.x, I'm able to reproduce it with indices created in 5.x now. Not sure what I was doing differently before, I must not have set the queue size low enough.

It looks like this is reproducible in 5.5.1 on a single node cluster with indices of any size. It does not reproduce on indices with only a single shard.

I'll see if I can reproduce it in the test environment and find a cause

@andyb-elastic
Copy link
Contributor

I think I found why search requests aren't being retried here. When it gets the response back from a scroll, it retries if the request generates an exception. However, if the request completes with failures, it doesn't retry but still terminates the reindex when handling the scroll response. So I think what's happening here is a scroll request is completing with failures.

That said, I'm not sure if the cause of the failures @berglh is seeing are the same as what I've been reproducing here, because they look a little different. Mine consistently have a shard and node id associated, and I haven't seen any with the default shard: -1 like the ones @berglh posted.

In my case, it's that some shards (but not all) are failing in the search request (because the queue is full). This doesn't get caught in our tests because it only uses one shard, so all the searches that fail have all shards failed.

@nik9000 any ideas about what could cause a search failure with shard: -1?

@nik9000
Copy link
Member

nik9000 commented Sep 5, 2017

@nik9000 any ideas about what could cause a search failure with shard: -1?

Not really. -1 means to me that it was a failure in the scroll request itself and not any of the shards and we should catch that and retry it.....

So about terminating the scroll when a shard fails, I don't know that you can (#26433) retry on the shard level. I'd forgotten about this when the issues came up in the first place.... I think, though, that the _jim API ought to allow for retries (#26472) when it is built....

@nik9000
Copy link
Member

nik9000 commented Oct 27, 2017

@colings86 pinged me about this issue, wanting to make sure that it is still appropriate to have this in feedback_needed. The retryable shard failures doesn't need feedback - we're just waiting on #26472 which @jimczi is going to have a look at before too long.

@andyb-elastic's shard: -1 could use some more investigation because we're not sure how that one comes about, but for now I don't think of that as part of this issue.

So this is officially blocked waiting on #26472.

@andyb-elastic
Copy link
Contributor

andyb-elastic commented Oct 27, 2017

@nik9000 right, I don't think we need any more feedback here, and we're waiting for #26472

For more background, the reason we can't fix this while it uses scrolls is because it will lose some documents if we retry and continue. When a scroll fails on some shards and returns partial results, there's no current way to rewind so that the reader makes sure to get those missing documents. In the context of the reindex API, losing documents is clearly very incorrect behavior, so the right thing to do is fail when this happens, even though it's unfortunately very inconvenient.

When we replace reindex's use of scroll with #26472 we'll be able to retry this failure condition without losing documents.

@lcawl lcawl added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Reindex API labels Feb 13, 2018
@dakrone
Copy link
Member

dakrone commented Mar 20, 2018

@andyb-elastic is this something you are still working on?

@andyb-elastic
Copy link
Contributor

@dakrone not actively, we're waiting on a new API from #26472. I think we can close this as that feature isn't on the roadmap yet. Additional feedback is always welcome.

In the meantime, users encountering this problem with scrolls should use search_after if they can accept not having a point-in-time-view like scrolls provide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search.
Projects
None yet
Development

No branches or pull requests

7 participants