Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Improve audit message for hard_limit #38034

Closed
pheyos opened this issue Jan 30, 2019 · 2 comments · Fixed by elastic/ml-cpp#486
Closed

[ML] Improve audit message for hard_limit #38034

pheyos opened this issue Jan 30, 2019 · 2 comments · Fixed by elastic/ml-cpp#486
Assignees
Labels
>enhancement :ml Machine learning

Comments

@pheyos
Copy link
Member

pheyos commented Jan 30, 2019

When ML jobs are hitting the hard_limit an audit message is generated. This message contains details about where the limit was hit. But currently, this number is not very helpful, because it only reflects the latest model_bytes of the job, which does not include the memory of the next stage of the analysis.

Example:

  • Model memory limit setting: 6mb
  • Reported model_bytes: 3.8mb
  • Job audit message: Job memory status changed to hard_limit at 1.4mb [...]

This could be confusing for the user.

Feature request: Adjust the audit message to better explain why the hard_limit is hit.

@pheyos pheyos added the :ml Machine learning label Jan 30, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@droberts195
Copy link
Contributor

@grabowskit saw the same thing with the Iowa liquor store data on 7th September last year.

I imagine it's a common problem when a large number of new split field values are seen for the first time in the same bucket.

In order to make the message sensible we need our model_size_stats to include not just current model_bytes but also something like max_predicted_model_bytes. It would populate that new field with the highest estimate it had ever made of how much memory it would need to do the next analysis step, even if as a result of making that estimate it decided not to do that analysis step and instead put the job into the hard_limit state.

@edsavage at some point (it's not mega urgent) please could you do a more in depth assessment of how we could feed back this extra statistic from the autodetect process into the model_size_stats?

edsavage added a commit to edsavage/elasticsearch that referenced this issue May 10, 2019
Improve the hard_limit memory audit message by reporting how many bytes
over the configured memory limit the job was at the point of the last
allocation failure.

Previously the model memory usage was reported, however this was
inaccurate and hence of limited use -  primarily because the total
memory used by the model can decrease significantly after the models
status is changed to hard_limit but before the model size stats are
reported from autodetect to ES.

While this PR contains the changes to the format of the hard_limit audit
message it is dependent on modifications to the ml-cpp backend to
send additional data fields in the model size stats message. These
changes will follow in a subsequent PR. It is worth noting that this PR
must be merged prior to the ml-cpp one, to keep CI tests happy.

Relates elastic#38034
edsavage added a commit to edsavage/ml-cpp that referenced this issue May 18, 2019
Add the current model memory limit and the number of bytes in
excess of that at the point of the last allocation failure to the model
size stats. These will be used to construct a (hopefully) more
informative hard_limit audit message.

The reported memory usage is also scaled to take into account the byte
limit margin, which is in play in the initial period of a jobs' lifetime
and is used to scale down the high memory limit. This should give a more
accurate representation of how close the memory usage is to the high
limit.

relates elastic/elasticsearch#42086

closes elastic/elasticsearch#38034
edsavage added a commit to elastic/ml-cpp that referenced this issue May 18, 2019
Add the current model memory limit and the number of bytes in
excess of that at the point of the last allocation failure to the model
size stats. These will be used to construct a (hopefully) more
informative hard_limit audit message.

The reported memory usage is also scaled to take into account the byte
limit margin, which is in play in the initial period of a jobs' lifetime
and is used to scale down the high memory limit. This should give a more
accurate representation of how close the memory usage is to the high
limit.

relates elastic/elasticsearch#42086

closes elastic/elasticsearch#38034
edsavage added a commit to edsavage/ml-cpp that referenced this issue May 18, 2019
Add the current model memory limit and the number of bytes in
excess of that at the point of the last allocation failure to the model
size stats. These will be used to construct a (hopefully) more
informative hard_limit audit message.

The reported memory usage is also scaled to take into account the byte
limit margin, which is in play in the initial period of a jobs' lifetime
and is used to scale down the high memory limit. This should give a more
accurate representation of how close the memory usage is to the high
limit.

relates elastic/elasticsearch#42086

closes elastic/elasticsearch#38034
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants