-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Make ml_standard tokenizer the default for new categorization jobs #72805
Merged
droberts195
merged 20 commits into
elastic:master
from
droberts195:ml_standard_tokenizer_for_new_cat_jobs
Jun 1, 2021
Merged
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
29f759b
[ML] Make ml_standard tokenizer the default for new categorization jobs
droberts195 94414bf
Fixing tests
droberts195 c016c4e
Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs
droberts195 98babb0
Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs
droberts195 61529b5
Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs
droberts195 7727f9b
Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs
droberts195 512b7ae
Incorporating first line only into default ML analyzer
droberts195 2b4f6b4
Fix tests
droberts195 586d937
Don't run analyze rest test in security suite
droberts195 5567bf4
Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs
droberts195 39bb837
Fixing first line filter and adding tests
droberts195 6eebe22
Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs
droberts195 e54acb1
Fixes for an expected value beginning with a dollar sign
droberts195 5678c96
Fix more YAML test regex problems
droberts195 14df66a
Switching to a dedicated char filter
droberts195 9619c62
Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs
droberts195 4d25e97
Bug fixes
droberts195 8f711b2
Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs
droberts195 724e25a
Improve comment and remove unused import
droberts195 bdbb6e9
Skip analyzer tests in ml-with-security
droberts195 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -145,38 +145,39 @@ static CategorizationAnalyzerConfig buildFromXContentFragment(XContentParser par | |
} | ||
|
||
/** | ||
* Create a <code>categorization_analyzer</code> that mimics what the tokenizer and filters built into the ML C++ | ||
* code do. This is the default analyzer for categorization to ensure that people upgrading from previous versions | ||
* Create a <code>categorization_analyzer</code> that mimics what the tokenizer and filters built into the original ML | ||
* C++ code do. This is the default analyzer for categorization to ensure that people upgrading from old versions | ||
* get the same behaviour from their categorization jobs before and after upgrade. | ||
* @param categorizationFilters Categorization filters (if any) from the <code>analysis_config</code>. | ||
* @return The default categorization analyzer. | ||
*/ | ||
public static CategorizationAnalyzerConfig buildDefaultCategorizationAnalyzer(List<String> categorizationFilters) { | ||
|
||
CategorizationAnalyzerConfig.Builder builder = new CategorizationAnalyzerConfig.Builder(); | ||
|
||
if (categorizationFilters != null) { | ||
for (String categorizationFilter : categorizationFilters) { | ||
Map<String, Object> charFilter = new HashMap<>(); | ||
charFilter.put("type", "pattern_replace"); | ||
charFilter.put("pattern", categorizationFilter); | ||
builder.addCharFilter(charFilter); | ||
} | ||
} | ||
|
||
builder.setTokenizer("ml_classic"); | ||
|
||
Map<String, Object> tokenFilter = new HashMap<>(); | ||
tokenFilter.put("type", "stop"); | ||
tokenFilter.put("stopwords", Arrays.asList( | ||
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", | ||
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun", | ||
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December", | ||
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", | ||
"GMT", "UTC")); | ||
builder.addTokenFilter(tokenFilter); | ||
return new CategorizationAnalyzerConfig.Builder() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This builder format is so much nicer++ |
||
.addCategorizationFilters(categorizationFilters) | ||
.setTokenizer("ml_classic") | ||
.addDateWordsTokenFilter() | ||
.build(); | ||
} | ||
|
||
return builder.build(); | ||
/** | ||
* Create a <code>categorization_analyzer</code> that will be used for newly created jobs where no categorization | ||
* analyzer is explicitly provided. This analyzer differs from the default one in that it uses the <code>ml_standard</code> | ||
* tokenizer instead of the <code>ml_classic</code> tokenizer, and it only considers the first non-blank line of each message. | ||
* This analyzer is <em>not</em> used for jobs that specify no categorization analyzer, as that would break jobs that were | ||
* originally run in older versions. Instead, this analyzer is explicitly added to newly created jobs once the entire cluster | ||
* is upgraded to version 7.14 or above. | ||
* @param categorizationFilters Categorization filters (if any) from the <code>analysis_config</code>. | ||
* @return The standard categorization analyzer. | ||
*/ | ||
public static CategorizationAnalyzerConfig buildStandardCategorizationAnalyzer(List<String> categorizationFilters) { | ||
|
||
return new CategorizationAnalyzerConfig.Builder() | ||
.addCharFilter("first_non_blank_line") | ||
.addCategorizationFilters(categorizationFilters) | ||
.setTokenizer("ml_standard") | ||
.addDateWordsTokenFilter() | ||
.build(); | ||
} | ||
|
||
private final String analyzer; | ||
|
@@ -311,6 +312,18 @@ public Builder addCharFilter(Map<String, Object> charFilter) { | |
return this; | ||
} | ||
|
||
public Builder addCategorizationFilters(List<String> categorizationFilters) { | ||
if (categorizationFilters != null) { | ||
for (String categorizationFilter : categorizationFilters) { | ||
Map<String, Object> charFilter = new HashMap<>(); | ||
charFilter.put("type", "pattern_replace"); | ||
charFilter.put("pattern", categorizationFilter); | ||
addCharFilter(charFilter); | ||
} | ||
} | ||
return this; | ||
} | ||
|
||
public Builder setTokenizer(String tokenizer) { | ||
this.tokenizer = new NameOrDefinition(tokenizer); | ||
return this; | ||
|
@@ -331,6 +344,19 @@ public Builder addTokenFilter(Map<String, Object> tokenFilter) { | |
return this; | ||
} | ||
|
||
Builder addDateWordsTokenFilter() { | ||
Map<String, Object> tokenFilter = new HashMap<>(); | ||
tokenFilter.put("type", "stop"); | ||
tokenFilter.put("stopwords", Arrays.asList( | ||
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", | ||
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun", | ||
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December", | ||
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", | ||
"GMT", "UTC")); | ||
addTokenFilter(tokenFilter); | ||
return this; | ||
} | ||
|
||
/** | ||
* Create a config validating only structure, not exact analyzer/tokenizer/filter names | ||
*/ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at
JobUpdate.java
, this is still valid. Why is it removed here in the test?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work any more, because you cannot configure both categorization filters and a categorization analyzer, and now every newly created job has a categorization analyzer.
You cannot update that categorization analyzer because that could completely change the way categorization is done, and ruin the stability of the categories.
Arguably we should never have let people update the categorization filters for the same reason. It's like we don't allow people to change the detectors after a job was created - if they change something so fundamental to what the job is doing then it's not really the same job after the update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, then it seems later PR should deprecate setting them and then we can remove it in 8.