Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require that the dependent variable column has at most 2 distinct values in classfication analysis. #47858

Merged

Conversation

przemekwitek
Copy link
Contributor

Since classification analysis in 7.5 focuses only on binomial (i.e. with 2 classes) classification, we should validate that the cardinality of the dependent variable is at most two.

Relates #46735

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@przemekwitek
Copy link
Contributor Author

run elasticsearch-ci/2

@benwtrent benwtrent self-requested a review October 10, 2019 14:58
} else {
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder().size(0);
for (String fieldName : fieldCardinalityLimits.keySet()) {
searchSourceBuilder.aggregation(AggregationBuilders.cardinality(fieldName).field(fieldName));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should set precision_threshold for the cardinality agg to equal fieldCardinalityLimits.get(fieldName) + 1. This will greatly reduce the memory utilization for the aggregation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ActionListener<SearchResponse> checkCardinalityHandler = ActionListener.wrap(
searchResponse -> {
Map<String, Long> fieldCardinalityLimits = config.getAnalysis().getFieldCardinalityLimits();
if (fieldCardinalityLimits.isEmpty() == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

searchResponse != null I think is better. The response should NEVER be null unless the caller passed back null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
Aggregations aggs = searchResponse.getAggregations();
if (aggs == null) {
listener.onFailure(ExceptionsHelper.badRequestException("aggs == null"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users are gonna see this, they will have no idea what this means...

We should make it something better. Unexpected null response when gathering field cardinalities or something. Additionally, I don't think this is a badRequest meaning the user did something wrong. It seems to be that this is an internal server error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hesitating with using serverError but I agree it might be better here.

Done.

Long limit = entry.getValue();
Cardinality cardinality = aggs.get(fieldName);
if (cardinality == null) {
listener.onFailure(ExceptionsHelper.badRequestException("cardinality == null"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, user should get a better error than this, and I am not convinced it is badRequest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (cardinality.getValue() > limit) {
listener.onFailure(
ExceptionsHelper.badRequestException(
"Field [{}] must have at most [{}] distinct values but there were [{}]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Field [{}] must have at most [{}] distinct values but there were [{}]",
"Field [{}] must have at most [{}] distinct values but there were at least [{}]",

Cardinality is approximate. So, we should give the number it returned as the bottom limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -173,7 +180,7 @@ private static void validateIndexAndExtractFields(Client client,
ActionListener<ExtractedFields> listener) {
AtomicInteger docValueFieldsLimitHolder = new AtomicInteger();

// Step 3. Extract fields (if possible) and notify listener
// Step 4. Extract fields (if possible) and notify listener
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the target field is a required field, we should probably verify that the index has all the fields required BEFORE we gather the cardinality limits.

Copy link
Contributor Author

@przemekwitek przemekwitek Oct 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@przemekwitek przemekwitek force-pushed the protect_against_more_than_two_classes branch from c153bfb to ae6201d Compare October 11, 2019 07:25
@przemekwitek
Copy link
Contributor Author

run elasticsearch-ci/packaging-sample-matrix

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@przemekwitek przemekwitek merged commit 390a829 into elastic:master Oct 11, 2019
@przemekwitek przemekwitek deleted the protect_against_more_than_two_classes branch October 11, 2019 11:50
przemekwitek added a commit to przemekwitek/elasticsearch that referenced this pull request Oct 11, 2019
przemekwitek added a commit that referenced this pull request Oct 11, 2019
howardhuanghua pushed a commit to TencentCloudES/elasticsearch that referenced this pull request Oct 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants