[ML] Job opening fails during .ml-state creation #36271

pheyos · 2018-12-05T15:51:36Z

Found in version

7.0.0 b3663

Steps to reproduce
Perform the follwoing steps on a new created instance / cluster (i.e. the .ml-state index does not exist yet):

Create a machine learning job and start the datafeed
When the lookback completes and the job is being closed, open a second job

Expected result

The second job is opened without errors

Actual result

When the first job is closed, the index .ml-state is created. With bad luck on timing it happens that the index is not yet green when the second job should open, such that the opening fails with the message

Could not open job because no suitable nodes were found, allocation explanation
[Not opening job [remote_ip_request_rate], because not all primary shards are active
for the following indices [.ml-state]]

Additional information

This happens particularly often when using the nginx recognizer module to create multiple jobs at a time

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-12-05T15:51:37Z

Pinging @elastic/ml-core

dimitris-athanasiou · 2018-12-06T10:53:14Z

This is an interesting problem. The .ml-state index is created from a template when the first state document is being indexed. This means it requires at least one job was created and run up to the point it persists its state (in particular, up to the point the job was closed or enough time passed for a periodic persist to occur).

In addition, when a job is opened, we validate that the .ml-state index primary shards are active if the index exists. The reason we need to do this validation is to ensure that when a job is relocated, allocation will be delayed until the .ml-state (and other needed indices) are available. This is crucial as in case of nodes dropping it is highly possible the cluster will also miss data nodes, some of them containing shards of the ML indices.

Having explained the above, we can now understand why the reported problem happens. The flow of the recognizer is as follows:

Create all jobs
Create all datafeeds
Serially open each job and start its datafeed

The last bit is key as it means that by the time we open the 2nd, 3rd, etc. job, the previous are already running. Then it is possible that timing gives rise to the issue. One of the previous jobs starts persisting its state while another job is being created. The .ml-state index exists in time for the validation but it has just been created so its primaries are not active yet and the validation fails.

Note this specific scenario may only happen once: when the cluster is new. I will try to list options for solving this in a subsequent comment.

dimitris-athanasiou · 2018-12-06T11:34:16Z

I am listing solutions I could come up with here. I have a clear favourite after discussing this with some folk from the elasticsearch team but I'll list the others too (even the bad ones).

Change the UI to first open all jobs and then start the datafeeds

This is a superficial solution. It would fix the problem with our QA tests. It would also reduce the chances of the problem happening to our users. But it doesn't prevent the problem happening in some other way.

Skip the primary-shard-active validation when a job is opened for the first time.

When the very first job is opened we don't do that check as the .ml-state index does not exist. We could remove the check entirely when jobs are opened (as opposed to jobs being reallocated).

Wait for yellow status for the indices we need

This seems to be the most suitable solution. The index health API allows us to do this easily. Note that we cannot do that during the node selection as that should not be blocking. But we can do it in the master operation of TransportOpenJobAction before we start the persistent task. It would still be possible that the index becomes unassigned by the time we try to assign the job but that would imply perturbations in the cluster and it would probably be ok to fail opening the job.

What are your thoughts @elastic/ml-core ?

benwtrent · 2018-12-07T22:34:15Z

I second the the Wait for yellow status for the indices we need.

This is essentially what we are doing anyways, except we are relying on the user to wait an amount of time before opening a second job.

If the status persists to not be at least yellow (with some acceptable timeout), or if it downgrades before the persistent task can be assigned to a node, throwing an error is acceptable (as you have stated).

Unanswered question:

How long should we wait for the indices to be in the yellow status?
From where is this timeout value provided?

dimitris-athanasiou · 2018-12-10T10:51:29Z

How long should we wait for the indices to be in the yellow status?

The default timeout for the health status API is 30 seconds. That should be more than enough for a healthy cluster to activate the primaries of a freshly created index.

From where is this timeout value provided?

That is a good question. We could reuse the open job request's timeout. But then we need to do more work to properly account for the timeout through the action's different steps. Unfortunately, the infrastructure to do this properly isn't quite there. We give it a try, just accumulate timeouts, or just apply the default timeout separately from the action's timeout. The first one should be the correct behaviour but we need to balance out effort making this call.

benwtrent · 2018-12-10T13:43:13Z

The first one should be the correct behaviour but we need to balance out effort making this call.

I agree, since the default timeout for waiting against the health status API is 30 seconds, that is probably OK.

The only thing that SLIGHTLY concerns me is when the user sets the timeout on the API to less than 30 seconds.

However, that should be rare as the default value is high (30 min) and this would nudge users to providing a higher timeout value where an additional 30 seconds of wait would not be that big of an issue.

I say lets keep it simple and not worry about accumulating the timeouts just yet. If this turns out to be painful, then we can do the extra work on down the line. I think the bigger concern is that this particular bug even exists. An API not timing out exactly when you request it to seems like a smaller issue to me.

benwtrent · 2019-01-30T17:13:29Z

@pheyos This issue should be fix due to: #37483

Can you verify?

pheyos · 2019-01-31T12:23:09Z

@benwtrent I've tried in diverse test environments, but with your fix I was not able to reproduce this issue any more 🎉 thanks for fixing this!

pheyos added >bug :ml Machine learning labels Dec 5, 2018

dimitris-athanasiou self-assigned this Dec 6, 2018

pheyos closed this as completed Jan 31, 2019

sorenlouv mentioned this issue Nov 16, 2022

[APM] api test unparallelized ml jobs creation elastic/kibana#145337

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Job opening fails during .ml-state creation #36271

[ML] Job opening fails during .ml-state creation #36271

pheyos commented Dec 5, 2018

elasticmachine commented Dec 5, 2018

dimitris-athanasiou commented Dec 6, 2018

dimitris-athanasiou commented Dec 6, 2018

benwtrent commented Dec 7, 2018

dimitris-athanasiou commented Dec 10, 2018

benwtrent commented Dec 10, 2018

benwtrent commented Jan 30, 2019

pheyos commented Jan 31, 2019

[ML] Job opening fails during .ml-state creation #36271

[ML] Job opening fails during .ml-state creation #36271

Comments

pheyos commented Dec 5, 2018

elasticmachine commented Dec 5, 2018

dimitris-athanasiou commented Dec 6, 2018

dimitris-athanasiou commented Dec 6, 2018

benwtrent commented Dec 7, 2018

dimitris-athanasiou commented Dec 10, 2018

benwtrent commented Dec 10, 2018

benwtrent commented Jan 30, 2019

pheyos commented Jan 31, 2019