-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require that the dependent variable column has at most 2 distinct values in classfication analysis. #47858
Require that the dependent variable column has at most 2 distinct values in classfication analysis. #47858
Conversation
Pinging @elastic/ml-core (:ml) |
run elasticsearch-ci/2 |
} else { | ||
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder().size(0); | ||
for (String fieldName : fieldCardinalityLimits.keySet()) { | ||
searchSourceBuilder.aggregation(AggregationBuilders.cardinality(fieldName).field(fieldName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should set precision_threshold
for the cardinality agg to equal fieldCardinalityLimits.get(fieldName) + 1
. This will greatly reduce the memory utilization for the aggregation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ActionListener<SearchResponse> checkCardinalityHandler = ActionListener.wrap( | ||
searchResponse -> { | ||
Map<String, Long> fieldCardinalityLimits = config.getAnalysis().getFieldCardinalityLimits(); | ||
if (fieldCardinalityLimits.isEmpty() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
searchResponse != null
I think is better. The response should NEVER be null unless the caller passed back null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
Aggregations aggs = searchResponse.getAggregations(); | ||
if (aggs == null) { | ||
listener.onFailure(ExceptionsHelper.badRequestException("aggs == null")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users are gonna see this, they will have no idea what this means...
We should make it something better. Unexpected null response when gathering field cardinalities
or something. Additionally, I don't think this is a badRequest
meaning the user did something wrong. It seems to be that this is an internal server error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hesitating with using serverError
but I agree it might be better here.
Done.
Long limit = entry.getValue(); | ||
Cardinality cardinality = aggs.get(fieldName); | ||
if (cardinality == null) { | ||
listener.onFailure(ExceptionsHelper.badRequestException("cardinality == null")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, user should get a better error than this, and I am not convinced it is badRequest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if (cardinality.getValue() > limit) { | ||
listener.onFailure( | ||
ExceptionsHelper.badRequestException( | ||
"Field [{}] must have at most [{}] distinct values but there were [{}]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Field [{}] must have at most [{}] distinct values but there were [{}]", | |
"Field [{}] must have at most [{}] distinct values but there were at least [{}]", |
Cardinality is approximate. So, we should give the number it returned as the bottom limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -173,7 +180,7 @@ private static void validateIndexAndExtractFields(Client client, | |||
ActionListener<ExtractedFields> listener) { | |||
AtomicInteger docValueFieldsLimitHolder = new AtomicInteger(); | |||
|
|||
// Step 3. Extract fields (if possible) and notify listener | |||
// Step 4. Extract fields (if possible) and notify listener |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the target field is a required field, we should probably verify that the index has all the fields required BEFORE we gather the cardinality limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
…ues in classfication analysis.
c153bfb
to
ae6201d
Compare
run elasticsearch-ci/packaging-sample-matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…ues in classfication analysis. (elastic#47858)
…ues in classfication analysis. (elastic#47858)
Since classification analysis in 7.5 focuses only on binomial (i.e. with 2 classes) classification, we should validate that the cardinality of the dependent variable is at most two.
Relates #46735