-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Make ml_standard tokenizer the default for new categorization jobs #72805
[ML] Make ml_standard tokenizer the default for new categorization jobs #72805
Conversation
Categorization jobs created once the entire cluster is upgraded to version 7.14 or higher will default to using the new ml_standard tokenizer rather than the previous default of the ml_classic tokenizer. The difference between the ml_classic and ml_standard tokenizers is that ml_classic splits on slashes and colons, so creates multiple tokens from URLs and filesystem paths, whereas ml_standard attempts to keep URLs and filesystem paths as single tokens. It is still possible to config the ml_classic tokenizer if you prefer: just provide a categorization_analyzer within your analysis_config and whichever tokenizer you choose (which could be ml_classic or any other Elasticsearch tokenizer) will be used.
Pinging @elastic/ml-core (Team:ML) |
|
||
Map<String, Object> firstLineOnlyCharFilter = new HashMap<>(); | ||
firstLineOnlyCharFilter.put("type", "pattern_replace"); | ||
firstLineOnlyCharFilter.put("pattern", "\\\\n.*"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should be concerned if for some reason the first line is blank (e.g. the message in _source starts with \n
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested this, and it doesn't work. I think it should be firstLineOnlyCharFilter.put("pattern", "\\n.*");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, it didn't work for logs like:
log line 1
log line 2
I think the double escape looks for the literal character \n
So, it probably would work on lots
log line 1\nlog line 2
@@ -588,14 +588,13 @@ public void testUpdateJob() throws Exception { | |||
.setDescription("My description") // <2> | |||
.setAnalysisLimits(new AnalysisLimits(1000L, null)) // <3> | |||
.setBackgroundPersistInterval(TimeValue.timeValueHours(3)) // <4> | |||
.setCategorizationFilters(Arrays.asList("categorization-filter")) // <5> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at JobUpdate.java
, this is still valid. Why is it removed here in the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work any more, because you cannot configure both categorization filters and a categorization analyzer, and now every newly created job has a categorization analyzer.
You cannot update that categorization analyzer because that could completely change the way categorization is done, and ruin the stability of the categories.
Arguably we should never have let people update the categorization filters for the same reason. It's like we don't allow people to change the detectors after a job was created - if they change something so fundamental to what the job is doing then it's not really the same job after the update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, then it seems later PR should deprecate setting them and then we can remove it in 8.
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", | ||
"GMT", "UTC")); | ||
builder.addTokenFilter(tokenFilter); | ||
return new CategorizationAnalyzerConfig.Builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This builder format is so much nicer++
} else if (analysisConfig.getCategorizationFieldName() != null | ||
&& minNodeVersion.onOrAfter(MIN_NODE_VERSION_FOR_STANDARD_CATEGORIZATION_ANALYZER)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does seem like a good option.
I was wonder though if it was considered to use the Job.version
as the predicate around which default categorization to use?
Do we ever update Job.version
?
Ah, looking at MlConfigMigrator
// These jobs cannot be opened, we rely on the missing version
// to indicate this.
// See TransportOpenJobAction.validate()
if (jobVersion != null) {
builder.setJobVersion(Version.CURRENT);
}
I suppose that job.version is indeed mutable :/
Categorization jobs created once the entire cluster is upgraded to version 7.14 or higher will default to using the new ml_standard tokenizer rather than the previous default of the ml_classic tokenizer, and will incorporate the new first_non_blank_line char filter so that categorization is based purely on the first non-blank line of each message. The difference between the ml_classic and ml_standard tokenizers is that ml_classic splits on slashes and colons, so creates multiple tokens from URLs and filesystem paths, whereas ml_standard attempts to keep URLs, email addresses and filesystem paths as single tokens. It is still possible to config the ml_classic tokenizer if you prefer: just provide a categorization_analyzer within your analysis_config and whichever tokenizer you choose (which could be ml_classic or any other Elasticsearch tokenizer) will be used. To opt out of using first_non_blank_line as a default char filter, you must explicitly specify a categorization_analyzer that does not include it. If no categorization_analyzer is specified but categorization_filters are specified then the categorization filters are converted to char filters applied that are applied after first_non_blank_line. Backport of elastic#72805
…bs (#73605) Categorization jobs created once the entire cluster is upgraded to version 7.14 or higher will default to using the new ml_standard tokenizer rather than the previous default of the ml_classic tokenizer, and will incorporate the new first_non_blank_line char filter so that categorization is based purely on the first non-blank line of each message. The difference between the ml_classic and ml_standard tokenizers is that ml_classic splits on slashes and colons, so creates multiple tokens from URLs and filesystem paths, whereas ml_standard attempts to keep URLs, email addresses and filesystem paths as single tokens. It is still possible to config the ml_classic tokenizer if you prefer: just provide a categorization_analyzer within your analysis_config and whichever tokenizer you choose (which could be ml_classic or any other Elasticsearch tokenizer) will be used. To opt out of using first_non_blank_line as a default char filter, you must explicitly specify a categorization_analyzer that does not include it. If no categorization_analyzer is specified but categorization_filters are specified then the categorization filters are converted to char filters applied that are applied after first_non_blank_line. Backport of #72805
Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer, and will incorporate the new first_non_blank_line char
filter so that categorization is based purely on the first non-blank
line of each message.
The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs, email addresses and filesystem paths as single tokens.
It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.
To opt out of using first_non_blank_line as a default char filter,
you must explicitly specify a categorization_analyzer that does not
include it.
If no categorization_analyzer is specified but categorization_filters
are specified then the categorization filters are converted to char
filters applied that are applied after first_non_blank_line.
Closes elastic/ml-cpp#1724