[ML] Starting trained models with a ridiculously large queue_capacity crashes the node #89555

dolaru · 2022-08-23T16:27:49Z

Elasticsearch Version

Found in 8.4.0

Installed Plugins

No response

Java Version

bundled

OS Version

Darwin Kernel Version 21.6.0

Problem Description

Although unlikely, if a user tries to start a trained model deployment with a ridiculously large queue_capacity (for example 9999999999999, the node that tries to allocate the model will immediately crash.

Furthermore, the node will continue to crash when restarted, as soon as the node attempts to allocate the trained model deployment.

Steps to Reproduce

Import a trained model
Start the trained model deployment with queue_capacity=9999999999999
Notice that the ML node that attempted to start the trained model deployment has crashed
Try restarting the node
Notice that the node continues the crash as soon as it tries to allocate the trained model deployment

Logs (if relevant)

The Elasticsearch instance crashes immediately after logging the attempt to start the trained model deployment. No logs after that.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2022-08-23T16:28:12Z

Pinging @elastic/ml-core (Team:ML)

dimitris-athanasiou · 2022-08-24T08:13:12Z

I suspect the same issue exists for the enrich.coordinator_proxy.queue_capacity setting for the enrich plugin. Pinging @elastic/es-data-management .

When starting a trained model deployment, a queue is created. If the queue_capacity is too large, it can lead to OOM and a node crash. This commit adds validation that the queue_capacity cannot be more than 1M. Closes elastic#89555

When starting a trained model deployment, a queue is created. If the queue_capacity is too large, it can lead to OOM and a node crash. This commit adds validation that the queue_capacity cannot be more than 1M. Closes #89555

When starting a trained model deployment, a queue is created. If the queue_capacity is too large, it can lead to OOM and a node crash. This commit adds validation that the queue_capacity cannot be more than 1M. Closes elastic#89555

) When starting a trained model deployment, a queue is created. If the queue_capacity is too large, it can lead to OOM and a node crash. This commit adds validation that the queue_capacity cannot be more than 1M. Closes #89555

dolaru added >bug :ml Machine learning labels Aug 23, 2022

elasticsearchmachine added the Team:ML Meta label for the ML team label Aug 23, 2022

dimitris-athanasiou mentioned this issue Aug 24, 2022

[ML] Validate trained model deployment queue_capacity limit #89573

Merged

dimitris-athanasiou closed this as completed in #89573 Aug 24, 2022

dimitris-athanasiou mentioned this issue Aug 25, 2022

[8.4][ML] Validate trained model deployment queue_capacity limit #89611

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Starting trained models with a ridiculously large queue_capacity crashes the node #89555

[ML] Starting trained models with a ridiculously large queue_capacity crashes the node #89555

dolaru commented Aug 23, 2022

elasticsearchmachine commented Aug 23, 2022

dimitris-athanasiou commented Aug 24, 2022

[ML] Starting trained models with a ridiculously large queue_capacity crashes the node #89555

[ML] Starting trained models with a ridiculously large queue_capacity crashes the node #89555

Comments

dolaru commented Aug 23, 2022

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

elasticsearchmachine commented Aug 23, 2022

dimitris-athanasiou commented Aug 24, 2022