Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add new categorization stats to model_size_stats #51879

Merged
merged 6 commits into from
Feb 6, 2020

Conversation

droberts195
Copy link
Contributor

@droberts195 droberts195 commented Feb 4, 2020

This change adds support for the following new model_size_stats
fields:

  • categorized_doc_count
  • total_category_count
  • frequent_category_count
  • rare_category_count
  • dead_category_count
  • categorization_status

Relates #50749

This change adds support for the following new model_size_stats
fields:

- categorized_doc_count
- total_category_count
- frequent_category_count
- rare_category_count
- categorization_status
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195
Copy link
Contributor Author

This is labelled >non-issue because a corresponding ml-cpp PR will populate the stats and the release note will be added as part of that PR.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions outside this PR:

Is there a plan to add a dead flag to the categories that are superseded? Or are dead categories just not returned via the results API?

Similarly, I wonder if including some simple distribution/ranking data in the categories is also good. Knowing common/rare categories might be useful

@@ -183,6 +183,31 @@ protected Table getTableWithHeader(RestRequest request) {
TableColumnAttributeBuilder.builder("number of bucket allocation failures", false)
.setAliases("mbaf", "modelBucketAllocationFailures")
.build());
table.addCell("model.categorization_status",
TableColumnAttributeBuilder.builder("current categorization status")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is configured to be included by default, you will probably have to update the cat job yaml test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, since only a small proportion of jobs use categorization I should probably change this one so it's not included by default.

@droberts195
Copy link
Contributor Author

Is there a plan to add a dead flag to the categories that are superseded?

This is a possibility. The dead categories could contain a field that says which other category killed them.

Or are dead categories just not returned via the results API?

They have to be returned forever, because we could have generated anomalies that reference them. (It would be unusual, as dead categories tend to occur early on in a job's lifetime, before we start generating anomalies, but nevertheless possible.) Still, I agree it would be useful to have the information.

Similarly, I wonder if including some simple distribution/ranking data in the categories is also good. Knowing common/rare categories might be useful

This has been discussed before, and the problem is how often to update the information. The running C++ process already knows both the match count and rank of every category. However, if the category documents were rewritten every time one of these changed then that would lead to every input document causing a result to be written, and the resulting indexing load would be unacceptable. So the match count and rank would need to be updated just occasionally. We would need to make sure that the way "occasionally" was done didn't invalidate the likely uses of these statistics.

@droberts195 droberts195 merged commit 72346b9 into elastic:master Feb 6, 2020
@droberts195 droberts195 deleted the categorizer_model_stats branch February 6, 2020 17:08
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Feb 7, 2020
The new stats will be available from 7.7.0, but the
BWC tests in master need disabling until the backport
PR is merged.

Relates elastic#51879
Relates elastic#52009
droberts195 added a commit that referenced this pull request Feb 10, 2020
The new stats will be available from 7.7.0, but the
BWC tests in master need disabling until the backport
PR is merged.

Relates #51879
Relates #52009
droberts195 added a commit that referenced this pull request Feb 10, 2020
This change adds support for the following new model_size_stats
fields:

- categorized_doc_count
- total_category_count
- frequent_category_count
- rare_category_count
- dead_category_count
- categorization_status

Backport of #51879
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Feb 11, 2020
In elastic#51146 a rudimentary check for poor categorization was added to
7.6.

This change replaces that warning based on a Java-side check with
a new one based on the categorization_status field that the ML C++
sets.  categorization_status was added in 7.7 and above by elastic#51879,
so this new warning based on more advanced conditions will also be
in 7.7 and above.

Closes elastic#50749
droberts195 added a commit that referenced this pull request Feb 11, 2020
…2195)

In #51146 a rudimentary check for poor categorization was added to
7.6.

This change replaces that warning based on a Java-side check with
a new one based on the categorization_status field that the ML C++
sets.  categorization_status was added in 7.7 and above by #51879,
so this new warning based on more advanced conditions will also be
in 7.7 and above.

Closes #50749
droberts195 added a commit that referenced this pull request Feb 11, 2020
…2195)

In #51146 a rudimentary check for poor categorization was added to
7.6.

This change replaces that warning based on a Java-side check with
a new one based on the categorization_status field that the ML C++
sets.  categorization_status was added in 7.7 and above by #51879,
so this new warning based on more advanced conditions will also be
in 7.7 and above.

Closes #50749
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants