[ML] Improve audit message for hard_limit #38034

pheyos · 2019-01-30T14:45:39Z

When ML jobs are hitting the hard_limit an audit message is generated. This message contains details about where the limit was hit. But currently, this number is not very helpful, because it only reflects the latest model_bytes of the job, which does not include the memory of the next stage of the analysis.

Example:

Model memory limit setting: 6mb
Reported model_bytes: 3.8mb
Job audit message: Job memory status changed to hard_limit at 1.4mb [...]

This could be confusing for the user.

Feature request: Adjust the audit message to better explain why the hard_limit is hit.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-01-30T14:45:40Z

Pinging @elastic/ml-core

droberts195 · 2019-01-30T14:56:11Z

@grabowskit saw the same thing with the Iowa liquor store data on 7th September last year.

I imagine it's a common problem when a large number of new split field values are seen for the first time in the same bucket.

In order to make the message sensible we need our model_size_stats to include not just current model_bytes but also something like max_predicted_model_bytes. It would populate that new field with the highest estimate it had ever made of how much memory it would need to do the next analysis step, even if as a result of making that estimate it decided not to do that analysis step and instead put the job into the hard_limit state.

@edsavage at some point (it's not mega urgent) please could you do a more in depth assessment of how we could feed back this extra statistic from the autodetect process into the model_size_stats?

Improve the hard_limit memory audit message by reporting how many bytes over the configured memory limit the job was at the point of the last allocation failure. Previously the model memory usage was reported, however this was inaccurate and hence of limited use - primarily because the total memory used by the model can decrease significantly after the models status is changed to hard_limit but before the model size stats are reported from autodetect to ES. While this PR contains the changes to the format of the hard_limit audit message it is dependent on modifications to the ml-cpp backend to send additional data fields in the model size stats message. These changes will follow in a subsequent PR. It is worth noting that this PR must be merged prior to the ml-cpp one, to keep CI tests happy. Relates elastic#38034

Add the current model memory limit and the number of bytes in excess of that at the point of the last allocation failure to the model size stats. These will be used to construct a (hopefully) more informative hard_limit audit message. The reported memory usage is also scaled to take into account the byte limit margin, which is in play in the initial period of a jobs' lifetime and is used to scale down the high memory limit. This should give a more accurate representation of how close the memory usage is to the high limit. relates elastic/elasticsearch#42086 closes elastic/elasticsearch#38034

pheyos added the :ml Machine learning label Jan 30, 2019

droberts195 added the >enhancement label Feb 22, 2019

droberts195 assigned edsavage Feb 22, 2019

edsavage mentioned this issue May 10, 2019

[ML] Improve hard_limit audit message #42086

Merged

edsavage mentioned this issue May 17, 2019

[ML] Improve hard_limit audit message elastic/ml-cpp#486

Merged

edsavage closed this as completed in elastic/ml-cpp#486 May 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Improve audit message for hard_limit #38034

[ML] Improve audit message for hard_limit #38034

pheyos commented Jan 30, 2019

elasticmachine commented Jan 30, 2019

droberts195 commented Jan 30, 2019

[ML] Improve audit message for hard_limit #38034

[ML] Improve audit message for hard_limit #38034

Comments

pheyos commented Jan 30, 2019

elasticmachine commented Jan 30, 2019

droberts195 commented Jan 30, 2019