-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Add new categorization stats to model_size_stats #989
[ML] Add new categorization stats to model_size_stats #989
Conversation
This change adds support for the following new model_size_stats fields: - categorized_doc_count - total_category_count - frequent_category_count - rare_category_count - dead_category_count - categorization_status Relates #50749
This PR will fail CI until elastic/elasticsearch#51879 is merged. The Java side change must be merged first, but I wanted to open this now so they can be reviewed together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a pass through: looks good Dave. I made some minor tidy up suggestions.
testIter != uniqueTokenIds.end()) { | ||
if (commonIter->first < testIter->first) { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It strikes me that this becomes somewhat cleaner using std::find
, i.e.
while (commonIter != m_CommonUniqueTokenIds.end()) {
testIter = std::find(testIter, uniqueTokenIds.end(), *commonIter,
[](const auto& lhs, const auto& rhs) { return lhs->first >= rhs->first; });
if (testIter == uniqueTokenIds.end() || *testIter != *commonIter) {
return false;
}
++commonIter;
}
I guess this is just moving well tested code to a different place, so feel free to leave as if you'd rather not change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I more-or-less did this. The second part of the condition has to be testIter->first != commonIter->first || testIter->second != commonIter->second
instead of simply *testIter != *commonIter
because in some cases one is a std::pair<const std::size_t, std::size_t>
and the other is a std::pair<std::size_t, std::size_t>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, great. It's a shame comparison doesn't work with const. That feels like an error in the std::pair implementation.
include/model/CResourceMonitor.h
Outdated
std::size_t s_FrequentCategories; | ||
std::size_t s_RareCategories; | ||
std::size_t s_DeadCategories; | ||
model_t::ECategorizationStatus s_CategorizationStatus; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you're only extending, but should this provide defaults? It seems like this object is a recipe for uninitialised variables as it stands.
BOOST_TEST_REQUIRE(modelSizeStats.HasMember("model_bytes_exceeded")); | ||
BOOST_REQUIRE_EQUAL(int64_t(0), modelSizeStats["model_bytes_exceeded"].GetInt64()); | ||
BOOST_TEST_REQUIRE(modelSizeStats.HasMember("model_bytes_memory_limit")); | ||
BOOST_REQUIRE_EQUAL(int64_t(50000), | ||
modelSizeStats["model_bytes_memory_limit"].GetInt64()); | ||
BOOST_TEST_REQUIRE(modelSizeStats.HasMember("categorized_doc_count")); | ||
BOOST_REQUIRE_EQUAL(int64_t(1000), modelSizeStats["categorized_doc_count"].GetInt64()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe include cstdint
and std::int64_t
based on our guidelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file was also dodgy before my changes because many expected results should have been unsigned rather than signed. I copied that bug into my new lines, but I'll correct that throughout the file at the same time as adding std::
.
lib/model/CResourceMonitor.cc
Outdated
res.s_FrequentCategories = 0; | ||
res.s_RareCategories = 0; | ||
res.s_DeadCategories = 0; | ||
res.s_CategorizationStatus = model_t::E_CategorizationStatusOk; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make all these defaults in the class definition.
lib/model/CResourceMonitor.cc
Outdated
// - At least 100 messages have been categorized | ||
// and one of the following holds: | ||
// - There is only 1 category | ||
// - More than 90% of categories have 1 message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// - More than 90% of categories have 1 message | |
// - More than 90% of categories are rare |
lib/model/CResourceMonitor.cc
Outdated
(res.s_TotalCategories == 1 || 10 * res.s_RareCategories > 9 * res.s_TotalCategories || | ||
2 * res.s_TotalCategories > res.s_CategorizedMessages || | ||
res.s_FrequentCategories == 0 || 2 * res.s_DeadCategories > res.s_TotalCategories)) { | ||
res.s_CategorizationStatus = model_t::E_CategorizationStatusPoor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like an odd place for this logic: could it perhaps be better encapsulated. Maybe a static member on some class related to categorisation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Come to think of it I can just put this in CTokenListDataCategorizerBase::updateModelSizeStats()
. I was thinking we'd only want to do this calculation once if we start doing per-partition categorization, but since the stats it's based on will accumulate over the partitions the last partition we calculated the status for would use the overall numbers and that's the value that would get reported. The cost is that we evaluate some 5 simple comparisons per-partition instead of once but that's not going to take long and I agree the encapsulation is much better moving this formula into categorization code.
(Plus of course while we're not doing per-partition categorization the cost concerns are all theoretical anyway as we only have one categorizer object.)
compCategory.maxMatchingStringLen() >= rawStringLen && | ||
compCategory.isMissingCommonTokenWeightZero(m_WorkTokenUniqueIds) && | ||
compCategory.containsCommonTokensInOrder(m_WorkTokenIds)); | ||
bool matchesSearch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: braces initialization
@@ -539,6 +528,47 @@ std::size_t CTokenListDataCategorizerBase::memoryUsage() const { | |||
return mem; | |||
} | |||
|
|||
void CTokenListDataCategorizerBase::updateMemoryResults(CResourceMonitor::SResults& results) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about calling this updateMemoryResults
. This includes other information now. Maybe updateTelemetry
? I guess there is a slight issue because we've subverted CResourceMonitor
to do more than monitor resource usage. Perhaps we should think about renaming this in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLion makes these renamings pretty trivial so how about updateModelSizeStats
? Then naming will be consistent through C++, Java and JavaScript, which might make it easier to trace how these numbers flow through the code.
Many of the new stats are related to resource use - certainly total_category_count
, rare_category_count
and dead_category_count
are indicators of how much memory categorization is using, and whether that usage is suboptimal. To put the other stats somewhere else would just create complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok cool. I definitely think that's better.
To put the other stats somewhere else would just create complexity.
Btw, I completely agree we don't want these somewhere else. It's more I wonder if we should think about renaming CResourceMonitor
at some point. I'm happy for that not to happen in this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ to not renaming CResourceMonitor
in this PR. That would touch a very large number of files unrelated to categorization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you're waiting for Java changes to land, but LGTM when you're ready.
This change adds support for the following new model_size_stats fields: - categorized_doc_count - total_category_count - frequent_category_count - rare_category_count - dead_category_count - categorization_status Backport of elastic#989
This change adds support for the following new model_size_stats fields: - categorized_doc_count - total_category_count - frequent_category_count - rare_category_count - dead_category_count - categorization_status Backport of #989
This change adds support for the following new model_size_stats
fields:
Relates elastic/elasticsearch#50749