Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Starting trained models with a ridiculously large queue_capacity crashes the node #89555

Closed
dolaru opened this issue Aug 23, 2022 · 2 comments · Fixed by #89573
Closed
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team

Comments

@dolaru
Copy link
Member

dolaru commented Aug 23, 2022

Elasticsearch Version

Found in 8.4.0

Installed Plugins

No response

Java Version

bundled

OS Version

Darwin Kernel Version 21.6.0

Problem Description

Although unlikely, if a user tries to start a trained model deployment with a ridiculously large queue_capacity (for example 9999999999999, the node that tries to allocate the model will immediately crash.

Furthermore, the node will continue to crash when restarted, as soon as the node attempts to allocate the trained model deployment.

Steps to Reproduce

  1. Import a trained model
  2. Start the trained model deployment with queue_capacity=9999999999999
  3. Notice that the ML node that attempted to start the trained model deployment has crashed
  4. Try restarting the node
  5. Notice that the node continues the crash as soon as it tries to allocate the trained model deployment

Logs (if relevant)

The Elasticsearch instance crashes immediately after logging the attempt to start the trained model deployment. No logs after that.

@dolaru dolaru added >bug :ml Machine learning labels Aug 23, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Aug 23, 2022
@dimitris-athanasiou
Copy link
Contributor

I suspect the same issue exists for the enrich.coordinator_proxy.queue_capacity setting for the enrich plugin. Pinging @elastic/es-data-management .

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Aug 24, 2022
When starting a trained model deployment, a queue is created.
If the queue_capacity is too large, it can lead to OOM and a node
crash.

This commit adds validation that the queue_capacity cannot be more
than 1M.

Closes elastic#89555
dimitris-athanasiou added a commit that referenced this issue Aug 24, 2022
When starting a trained model deployment, a queue is created.
If the queue_capacity is too large, it can lead to OOM and a node
crash.

This commit adds validation that the queue_capacity cannot be more
than 1M.

Closes #89555
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Aug 25, 2022
When starting a trained model deployment, a queue is created.
If the queue_capacity is too large, it can lead to OOM and a node
crash.

This commit adds validation that the queue_capacity cannot be more
than 1M.

Closes elastic#89555
dimitris-athanasiou added a commit that referenced this issue Aug 25, 2022
)

When starting a trained model deployment, a queue is created.
If the queue_capacity is too large, it can lead to OOM and a node
crash.

This commit adds validation that the queue_capacity cannot be more
than 1M.

Closes #89555
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants