Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Job opening fails during .ml-state creation #36271

Closed
pheyos opened this issue Dec 5, 2018 · 8 comments
Closed

[ML] Job opening fails during .ml-state creation #36271

pheyos opened this issue Dec 5, 2018 · 8 comments
Assignees
Labels
>bug :ml Machine learning

Comments

@pheyos
Copy link
Member

pheyos commented Dec 5, 2018

Found in version

  • 7.0.0 b3663

Steps to reproduce
Perform the follwoing steps on a new created instance / cluster (i.e. the .ml-state index does not exist yet):

  • Create a machine learning job and start the datafeed
  • When the lookback completes and the job is being closed, open a second job

Expected result

  • The second job is opened without errors

Actual result

  • When the first job is closed, the index .ml-state is created. With bad luck on timing it happens that the index is not yet green when the second job should open, such that the opening fails with the message
Could not open job because no suitable nodes were found, allocation explanation
[Not opening job [remote_ip_request_rate], because not all primary shards are active
for the following indices [.ml-state]]

Additional information

  • This happens particularly often when using the nginx recognizer module to create multiple jobs at a time
@pheyos pheyos added >bug :ml Machine learning labels Dec 5, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@dimitris-athanasiou
Copy link
Contributor

This is an interesting problem. The .ml-state index is created from a template when the first state document is being indexed. This means it requires at least one job was created and run up to the point it persists its state (in particular, up to the point the job was closed or enough time passed for a periodic persist to occur).

In addition, when a job is opened, we validate that the .ml-state index primary shards are active if the index exists. The reason we need to do this validation is to ensure that when a job is relocated, allocation will be delayed until the .ml-state (and other needed indices) are available. This is crucial as in case of nodes dropping it is highly possible the cluster will also miss data nodes, some of them containing shards of the ML indices.

Having explained the above, we can now understand why the reported problem happens. The flow of the recognizer is as follows:

  • Create all jobs
  • Create all datafeeds
  • Serially open each job and start its datafeed

The last bit is key as it means that by the time we open the 2nd, 3rd, etc. job, the previous are already running. Then it is possible that timing gives rise to the issue. One of the previous jobs starts persisting its state while another job is being created. The .ml-state index exists in time for the validation but it has just been created so its primaries are not active yet and the validation fails.

Note this specific scenario may only happen once: when the cluster is new. I will try to list options for solving this in a subsequent comment.

@dimitris-athanasiou
Copy link
Contributor

I am listing solutions I could come up with here. I have a clear favourite after discussing this with some folk from the elasticsearch team but I'll list the others too (even the bad ones).

  • Change the UI to first open all jobs and then start the datafeeds

This is a superficial solution. It would fix the problem with our QA tests. It would also reduce the chances of the problem happening to our users. But it doesn't prevent the problem happening in some other way.

  • Skip the primary-shard-active validation when a job is opened for the first time.

When the very first job is opened we don't do that check as the .ml-state index does not exist. We could remove the check entirely when jobs are opened (as opposed to jobs being reallocated).

  • Wait for yellow status for the indices we need

This seems to be the most suitable solution. The index health API allows us to do this easily. Note that we cannot do that during the node selection as that should not be blocking. But we can do it in the master operation of TransportOpenJobAction before we start the persistent task. It would still be possible that the index becomes unassigned by the time we try to assign the job but that would imply perturbations in the cluster and it would probably be ok to fail opening the job.

What are your thoughts @elastic/ml-core ?

@dimitris-athanasiou dimitris-athanasiou self-assigned this Dec 6, 2018
@benwtrent
Copy link
Member

I second the the Wait for yellow status for the indices we need.

This is essentially what we are doing anyways, except we are relying on the user to wait an amount of time before opening a second job.

If the status persists to not be at least yellow (with some acceptable timeout), or if it downgrades before the persistent task can be assigned to a node, throwing an error is acceptable (as you have stated).

Unanswered question:

  • How long should we wait for the indices to be in the yellow status?
  • From where is this timeout value provided?

@dimitris-athanasiou
Copy link
Contributor

How long should we wait for the indices to be in the yellow status?

The default timeout for the health status API is 30 seconds. That should be more than enough for a healthy cluster to activate the primaries of a freshly created index.

From where is this timeout value provided?

That is a good question. We could reuse the open job request's timeout. But then we need to do more work to properly account for the timeout through the action's different steps. Unfortunately, the infrastructure to do this properly isn't quite there. We give it a try, just accumulate timeouts, or just apply the default timeout separately from the action's timeout. The first one should be the correct behaviour but we need to balance out effort making this call.

@benwtrent
Copy link
Member

The first one should be the correct behaviour but we need to balance out effort making this call.

I agree, since the default timeout for waiting against the health status API is 30 seconds, that is probably OK.

The only thing that SLIGHTLY concerns me is when the user sets the timeout on the API to less than 30 seconds.

However, that should be rare as the default value is high (30 min) and this would nudge users to providing a higher timeout value where an additional 30 seconds of wait would not be that big of an issue.

I say lets keep it simple and not worry about accumulating the timeouts just yet. If this turns out to be painful, then we can do the extra work on down the line. I think the bigger concern is that this particular bug even exists. An API not timing out exactly when you request it to seems like a smaller issue to me.

@benwtrent
Copy link
Member

@pheyos This issue should be fix due to: #37483

Can you verify?

@pheyos
Copy link
Member Author

pheyos commented Jan 31, 2019

@benwtrent I've tried in diverse test environments, but with your fix I was not able to reproduce this issue any more 🎉 thanks for fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning
Projects
None yet
Development

No branches or pull requests

4 participants