-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Add new categorization stats to model_size_stats #51879
[ML] Add new categorization stats to model_size_stats #51879
Conversation
This change adds support for the following new model_size_stats fields: - categorized_doc_count - total_category_count - frequent_category_count - rare_category_count - categorization_status
Pinging @elastic/ml-core (:ml) |
This is labelled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Questions outside this PR:
Is there a plan to add a dead
flag to the categories that are superseded? Or are dead
categories just not returned via the results
API?
Similarly, I wonder if including some simple distribution/ranking data in the categories is also good. Knowing common/rare categories might be useful
@@ -183,6 +183,31 @@ protected Table getTableWithHeader(RestRequest request) { | |||
TableColumnAttributeBuilder.builder("number of bucket allocation failures", false) | |||
.setAliases("mbaf", "modelBucketAllocationFailures") | |||
.build()); | |||
table.addCell("model.categorization_status", | |||
TableColumnAttributeBuilder.builder("current categorization status") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is configured to be included by default, you will probably have to update the cat job yaml test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, since only a small proportion of jobs use categorization I should probably change this one so it's not included by default.
This is a possibility. The dead categories could contain a field that says which other category killed them.
They have to be returned forever, because we could have generated anomalies that reference them. (It would be unusual, as dead categories tend to occur early on in a job's lifetime, before we start generating anomalies, but nevertheless possible.) Still, I agree it would be useful to have the information.
This has been discussed before, and the problem is how often to update the information. The running C++ process already knows both the match count and rank of every category. However, if the category documents were rewritten every time one of these changed then that would lead to every input document causing a result to be written, and the resulting indexing load would be unacceptable. So the match count and rank would need to be updated just occasionally. We would need to make sure that the way "occasionally" was done didn't invalidate the likely uses of these statistics. |
The new stats will be available from 7.7.0, but the BWC tests in master need disabling until the backport PR is merged. Relates elastic#51879 Relates elastic#52009
This change adds support for the following new model_size_stats fields: - categorized_doc_count - total_category_count - frequent_category_count - rare_category_count - dead_category_count - categorization_status Backport of #51879
In elastic#51146 a rudimentary check for poor categorization was added to 7.6. This change replaces that warning based on a Java-side check with a new one based on the categorization_status field that the ML C++ sets. categorization_status was added in 7.7 and above by elastic#51879, so this new warning based on more advanced conditions will also be in 7.7 and above. Closes elastic#50749
…2195) In #51146 a rudimentary check for poor categorization was added to 7.6. This change replaces that warning based on a Java-side check with a new one based on the categorization_status field that the ML C++ sets. categorization_status was added in 7.7 and above by #51879, so this new warning based on more advanced conditions will also be in 7.7 and above. Closes #50749
…2195) In #51146 a rudimentary check for poor categorization was added to 7.6. This change replaces that warning based on a Java-side check with a new one based on the categorization_status field that the ML C++ sets. categorization_status was added in 7.7 and above by #51879, so this new warning based on more advanced conditions will also be in 7.7 and above. Closes #50749
This change adds support for the following new model_size_stats
fields:
Relates #50749