Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Make ml_standard tokenizer the default for new categorization jobs #72805

Merged

Conversation

droberts195
Copy link
Contributor

@droberts195 droberts195 commented May 6, 2021

Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer, and will incorporate the new first_non_blank_line char
filter so that categorization is based purely on the first non-blank
line of each message.

The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs, email addresses and filesystem paths as single tokens.

It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.

To opt out of using first_non_blank_line as a default char filter,
you must explicitly specify a categorization_analyzer that does not
include it.

If no categorization_analyzer is specified but categorization_filters
are specified then the categorization filters are converted to char
filters applied that are applied after first_non_blank_line.

Closes elastic/ml-cpp#1724

Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer.

The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs and filesystem paths as single tokens.

It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label May 6, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)


Map<String, Object> firstLineOnlyCharFilter = new HashMap<>();
firstLineOnlyCharFilter.put("type", "pattern_replace");
firstLineOnlyCharFilter.put("pattern", "\\\\n.*");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should be concerned if for some reason the first line is blank (e.g. the message in _source starts with \n)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested this, and it doesn't work. I think it should be firstLineOnlyCharFilter.put("pattern", "\\n.*");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it didn't work for logs like:

log line 1
log line 2

I think the double escape looks for the literal character \n
So, it probably would work on lots

log line 1\nlog line 2

@@ -588,14 +588,13 @@ public void testUpdateJob() throws Exception {
.setDescription("My description") // <2>
.setAnalysisLimits(new AnalysisLimits(1000L, null)) // <3>
.setBackgroundPersistInterval(TimeValue.timeValueHours(3)) // <4>
.setCategorizationFilters(Arrays.asList("categorization-filter")) // <5>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at JobUpdate.java, this is still valid. Why is it removed here in the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work any more, because you cannot configure both categorization filters and a categorization analyzer, and now every newly created job has a categorization analyzer.

You cannot update that categorization analyzer because that could completely change the way categorization is done, and ruin the stability of the categories.

Arguably we should never have let people update the categorization filters for the same reason. It's like we don't allow people to change the detectors after a job was created - if they change something so fundamental to what the job is doing then it's not really the same job after the update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, then it seems later PR should deprecate setting them and then we can remove it in 8.

"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
"GMT", "UTC"));
builder.addTokenFilter(tokenFilter);
return new CategorizationAnalyzerConfig.Builder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This builder format is so much nicer++

Comment on lines +241 to +242
} else if (analysisConfig.getCategorizationFieldName() != null
&& minNodeVersion.onOrAfter(MIN_NODE_VERSION_FOR_STANDARD_CATEGORIZATION_ANALYZER)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does seem like a good option.

I was wonder though if it was considered to use the Job.version as the predicate around which default categorization to use?

Do we ever update Job.version?

Ah, looking at MlConfigMigrator

        // These jobs cannot be opened, we rely on the missing version
        // to indicate this.
        // See TransportOpenJobAction.validate()
        if (jobVersion != null) {
            builder.setJobVersion(Version.CURRENT);
        }

I suppose that job.version is indeed mutable :/

@droberts195 droberts195 merged commit 0059c59 into elastic:master Jun 1, 2021
@droberts195 droberts195 deleted the ml_standard_tokenizer_for_new_cat_jobs branch June 1, 2021 14:11
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Jun 1, 2021
Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer, and will incorporate the new first_non_blank_line char
filter so that categorization is based purely on the first non-blank
line of each message.

The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs, email addresses and filesystem paths as single tokens.

It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.

To opt out of using first_non_blank_line as a default char filter,
you must explicitly specify a categorization_analyzer that does not
include it.

If no categorization_analyzer is specified but categorization_filters
are specified then the categorization filters are converted to char
filters applied that are applied after first_non_blank_line.

Backport of elastic#72805
droberts195 added a commit that referenced this pull request Jun 2, 2021
…bs (#73605)

Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer, and will incorporate the new first_non_blank_line char
filter so that categorization is based purely on the first non-blank
line of each message.

The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs, email addresses and filesystem paths as single tokens.

It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.

To opt out of using first_non_blank_line as a default char filter,
you must explicitly specify a categorization_analyzer that does not
include it.

If no categorization_analyzer is specified but categorization_filters
are specified then the categorization filters are converted to char
filters applied that are applied after first_non_blank_line.

Backport of #72805
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning Team:ML Meta label for the ML team v7.14.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Higher weighting for tokens at the beginning of messages during categorization
4 participants