[ML] Make ml_standard tokenizer the default for new categorization jobs #72805

droberts195 · 2021-05-06T10:42:12Z

Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer, and will incorporate the new first_non_blank_line char
filter so that categorization is based purely on the first non-blank
line of each message.

The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs, email addresses and filesystem paths as single tokens.

It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.

To opt out of using first_non_blank_line as a default char filter,
you must explicitly specify a categorization_analyzer that does not
include it.

If no categorization_analyzer is specified but categorization_filters
are specified then the categorization filters are converted to char
filters applied that are applied after first_non_blank_line.

Closes elastic/ml-cpp#1724

Categorization jobs created once the entire cluster is upgraded to version 7.14 or higher will default to using the new ml_standard tokenizer rather than the previous default of the ml_classic tokenizer. The difference between the ml_classic and ml_standard tokenizers is that ml_classic splits on slashes and colons, so creates multiple tokens from URLs and filesystem paths, whereas ml_standard attempts to keep URLs and filesystem paths as single tokens. It is still possible to config the ml_classic tokenizer if you prefer: just provide a categorization_analyzer within your analysis_config and whichever tokenizer you choose (which could be ml_classic or any other Elasticsearch tokenizer) will be used.

elasticmachine · 2021-05-06T10:42:15Z

Pinging @elastic/ml-core (Team:ML)

benwtrent · 2021-05-27T14:11:04Z

...e/src/main/java/org/elasticsearch/xpack/core/ml/job/config/CategorizationAnalyzerConfig.java

+
+        Map<String, Object> firstLineOnlyCharFilter = new HashMap<>();
+        firstLineOnlyCharFilter.put("type", "pattern_replace");
+        firstLineOnlyCharFilter.put("pattern", "\\\\n.*");


I wonder if we should be concerned if for some reason the first line is blank (e.g. the message in _source starts with \n)?

Just tested this, and it doesn't work. I think it should be firstLineOnlyCharFilter.put("pattern", "\\n.*");

I mean, it didn't work for logs like:

log line 1 log line 2

I think the double escape looks for the literal character \n
So, it probably would work on lots

log line 1\nlog line 2

benwtrent · 2021-05-27T14:12:25Z

...high-level/src/test/java/org/elasticsearch/client/documentation/MlClientDocumentationIT.java

@@ -588,14 +588,13 @@ public void testUpdateJob() throws Exception {
                .setDescription("My description") // <2>
                .setAnalysisLimits(new AnalysisLimits(1000L, null)) // <3>
                .setBackgroundPersistInterval(TimeValue.timeValueHours(3)) // <4>
-                .setCategorizationFilters(Arrays.asList("categorization-filter")) // <5>


Looking at JobUpdate.java, this is still valid. Why is it removed here in the test?

It doesn't work any more, because you cannot configure both categorization filters and a categorization analyzer, and now every newly created job has a categorization analyzer.

You cannot update that categorization analyzer because that could completely change the way categorization is done, and ruin the stability of the categories.

Arguably we should never have let people update the categorization filters for the same reason. It's like we don't allow people to change the detectors after a job was created - if they change something so fundamental to what the job is doing then it's not really the same job after the update.

Cool, then it seems later PR should deprecate setting them and then we can remove it in 8.

benwtrent · 2021-05-27T14:12:56Z

...e/src/main/java/org/elasticsearch/xpack/core/ml/job/config/CategorizationAnalyzerConfig.java

-                "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
-                "GMT", "UTC"));
-        builder.addTokenFilter(tokenFilter);
+        return new CategorizationAnalyzerConfig.Builder()


This builder format is so much nicer++

benwtrent · 2021-05-27T14:20:06Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/JobManager.java

+        } else if (analysisConfig.getCategorizationFieldName() != null
+            && minNodeVersion.onOrAfter(MIN_NODE_VERSION_FOR_STANDARD_CATEGORIZATION_ANALYZER)) {


This does seem like a good option.

I was wonder though if it was considered to use the Job.version as the predicate around which default categorization to use?

Do we ever update Job.version?

Ah, looking at MlConfigMigrator

// These jobs cannot be opened, we rely on the missing version // to indicate this. // See TransportOpenJobAction.validate() if (jobVersion != null) { builder.setJobVersion(Version.CURRENT); }

I suppose that job.version is indeed mutable :/

Categorization jobs created once the entire cluster is upgraded to version 7.14 or higher will default to using the new ml_standard tokenizer rather than the previous default of the ml_classic tokenizer, and will incorporate the new first_non_blank_line char filter so that categorization is based purely on the first non-blank line of each message. The difference between the ml_classic and ml_standard tokenizers is that ml_classic splits on slashes and colons, so creates multiple tokens from URLs and filesystem paths, whereas ml_standard attempts to keep URLs, email addresses and filesystem paths as single tokens. It is still possible to config the ml_classic tokenizer if you prefer: just provide a categorization_analyzer within your analysis_config and whichever tokenizer you choose (which could be ml_classic or any other Elasticsearch tokenizer) will be used. To opt out of using first_non_blank_line as a default char filter, you must explicitly specify a categorization_analyzer that does not include it. If no categorization_analyzer is specified but categorization_filters are specified then the categorization filters are converted to char filters applied that are applied after first_non_blank_line. Backport of elastic#72805

…bs (#73605) Categorization jobs created once the entire cluster is upgraded to version 7.14 or higher will default to using the new ml_standard tokenizer rather than the previous default of the ml_classic tokenizer, and will incorporate the new first_non_blank_line char filter so that categorization is based purely on the first non-blank line of each message. The difference between the ml_classic and ml_standard tokenizers is that ml_classic splits on slashes and colons, so creates multiple tokens from URLs and filesystem paths, whereas ml_standard attempts to keep URLs, email addresses and filesystem paths as single tokens. It is still possible to config the ml_classic tokenizer if you prefer: just provide a categorization_analyzer within your analysis_config and whichever tokenizer you choose (which could be ml_classic or any other Elasticsearch tokenizer) will be used. To opt out of using first_non_blank_line as a default char filter, you must explicitly specify a categorization_analyzer that does not include it. If no categorization_analyzer is specified but categorization_filters are specified then the categorization filters are converted to char filters applied that are applied after first_non_blank_line. Backport of #72805

droberts195 added >enhancement :ml Machine learning v8.0.0 v7.14.0 labels May 6, 2021

elasticmachine added the Team:ML Meta label for the ML team label May 6, 2021

droberts195 added 6 commits May 6, 2021 19:19

Fixing tests

94414bf

Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs

c016c4e

Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs

98babb0

Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs

61529b5

Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs

7727f9b

Incorporating first line only into default ML analyzer

512b7ae

droberts195 mentioned this pull request May 27, 2021

[ML] Higher weighting for tokens at the beginning of messages during categorization elastic/ml-cpp#1724

Closed

Fix tests

2b4f6b4

benwtrent reviewed May 27, 2021

View reviewed changes

droberts195 added 5 commits May 27, 2021 16:01

Don't run analyze rest test in security suite

586d937

Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs

5567bf4

Fixing first line filter and adding tests

39bb837

Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs

6eebe22

Fixes for an expected value beginning with a dollar sign

e54acb1

benwtrent approved these changes May 28, 2021

View reviewed changes

droberts195 added 7 commits May 28, 2021 17:25

Fix more YAML test regex problems

5678c96

Switching to a dedicated char filter

14df66a

Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs

9619c62

Bug fixes

4d25e97

Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs

8f711b2

Improve comment and remove unused import

724e25a

Skip analyzer tests in ml-with-security

bdbb6e9

pheyos mentioned this pull request Jun 1, 2021

[ML] Functional tests - categorization tests are temporarily skipped elastic/kibana#101056

Closed

pheyos mentioned this pull request Jun 1, 2021

[ML] Functional tests - disable categorization tests elastic/kibana#101057

Merged

droberts195 merged commit 0059c59 into elastic:master Jun 1, 2021

droberts195 deleted the ml_standard_tokenizer_for_new_cat_jobs branch June 1, 2021 14:11

droberts195 mentioned this pull request Jun 1, 2021

[ML] Make ml_standard tokenizer the default for new categorization jobs #73605

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Make ml_standard tokenizer the default for new categorization jobs #72805

[ML] Make ml_standard tokenizer the default for new categorization jobs #72805

droberts195 commented May 6, 2021 •

edited

Loading

elasticmachine commented May 6, 2021

benwtrent May 27, 2021

benwtrent May 27, 2021

benwtrent May 27, 2021

benwtrent May 27, 2021

droberts195 May 27, 2021

benwtrent May 28, 2021

benwtrent May 27, 2021

benwtrent May 27, 2021

		} else if (analysisConfig.getCategorizationFieldName() != null
		&& minNodeVersion.onOrAfter(MIN_NODE_VERSION_FOR_STANDARD_CATEGORIZATION_ANALYZER)) {

[ML] Make ml_standard tokenizer the default for new categorization jobs #72805

[ML] Make ml_standard tokenizer the default for new categorization jobs #72805

Conversation

droberts195 commented May 6, 2021 • edited Loading

elasticmachine commented May 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 commented May 6, 2021 •

edited

Loading