[ML] Make ml_standard tokenizer the default for new categorization jobs

Categorization jobs created once the entire cluster is upgraded to version 7.14 or higher will default to using the new ml_standard tokenizer rather than the previous default of the ml_classic tokenizer, and will incorporate the new first_non_blank_line char filter so that categorization is based purely on the first non-blank line of each message. The difference between the ml_classic and ml_standard tokenizers is that ml_classic splits on slashes and colons, so creates multiple tokens from URLs and filesystem paths, whereas ml_standard attempts to keep URLs, email addresses and filesystem paths as single tokens. It is still possible to config the ml_classic tokenizer if you prefer: just provide a categorization_analyzer within your analysis_config and whichever tokenizer you choose (which could be ml_classic or any other Elasticsearch tokenizer) will be used. To opt out of using first_non_blank_line as a default char filter, you must explicitly specify a categorization_analyzer that does not include it. If no categorization_analyzer is specified but categorization_filters are specified then the categorization filters are converted to char filters applied that are applied after first_non_blank_line. Backport of elastic#72805
droberts195 · Jun 1, 2021 · 7db6980 · 7db6980
1 parent 8c0cd75
commit 7db6980
Show file tree

Hide file tree

Showing 21 changed files with 686 additions and 97 deletions.
diff --git a/...h-level/src/test/java/org/elasticsearch/client/documentation/MlClientDocumentationIT.java b/...h-level/src/test/java/org/elasticsearch/client/documentation/MlClientDocumentationIT.java
@@ -588,14 +588,13 @@ public void testUpdateJob() throws Exception {
                 .setDescription("My description") // <2>
                 .setAnalysisLimits(new AnalysisLimits(1000L, null)) // <3>
                 .setBackgroundPersistInterval(TimeValue.timeValueHours(3)) // <4>
-                .setCategorizationFilters(Arrays.asList("categorization-filter")) // <5>
-                .setDetectorUpdates(Arrays.asList(detectorUpdate)) // <6>
-                .setGroups(Arrays.asList("job-group-1")) // <7>
-                .setResultsRetentionDays(10L) // <8>
-                .setModelPlotConfig(new ModelPlotConfig(true, null, true)) // <9>
-                .setModelSnapshotRetentionDays(7L) // <10>
-                .setCustomSettings(customSettings) // <11>
-                .setRenormalizationWindowDays(3L) // <12>
+                .setDetectorUpdates(Arrays.asList(detectorUpdate)) // <5>
+                .setGroups(Arrays.asList("job-group-1")) // <6>
+                .setResultsRetentionDays(10L) // <7>
+                .setModelPlotConfig(new ModelPlotConfig(true, null, true)) // <8>
+                .setModelSnapshotRetentionDays(7L) // <9>
+                .setCustomSettings(customSettings) // <10>
+                .setRenormalizationWindowDays(3L) // <11>
                 .build();
             // end::update-job-options
 

diff --git a/docs/java-rest/high-level/ml/update-job.asciidoc b/docs/java-rest/high-level/ml/update-job.asciidoc
@@ -35,14 +35,13 @@ include-tagged::{doc-tests-file}[{api}-options]
 <2> Updated description.
 <3> Updated analysis limits. 
 <4> Updated background persistence interval.
-<5> Updated analysis config's categorization filters.
-<6> Updated detectors through the `JobUpdate.DetectorUpdate` object.
-<7> Updated group membership.
-<8> Updated result retention.
-<9> Updated model plot configuration.
-<10> Updated model snapshot retention setting.
-<11> Updated custom settings.
-<12> Updated renormalization window.
+<5> Updated detectors through the `JobUpdate.DetectorUpdate` object.
+<6> Updated group membership.
+<7> Updated result retention.
+<8> Updated model plot configuration.
+<9> Updated model snapshot retention setting.
+<10> Updated custom settings.
+<11> Updated renormalization window.
 
 Included with these options are specific optional `JobUpdate.DetectorUpdate` updates.
 ["source","java",subs="attributes,callouts,macros"]

diff --git a/docs/reference/ml/anomaly-detection/apis/get-ml-info.asciidoc b/docs/reference/ml/anomaly-detection/apis/get-ml-info.asciidoc
@@ -49,7 +49,10 @@ This is a possible response:
   "defaults" : {
     "anomaly_detectors" : {
       "categorization_analyzer" : {
-        "tokenizer" : "ml_classic",
+        "char_filter" : [
+          "first_non_blank_line"
+        ],
+        "tokenizer" : "ml_standard",
         "filter" : [
           {
             "type" : "stop",

diff --git a/docs/reference/ml/anomaly-detection/ml-configuring-categories.asciidoc b/docs/reference/ml/anomaly-detection/ml-configuring-categories.asciidoc
@@ -21,8 +21,8 @@ of possible messages:
 Categorization is tuned to work best on data like log messages by taking token
 order into account, including stop words, and not considering synonyms in its
 analysis. Complete sentences in human communication or literary text (for
-example email, wiki pages, prose, or other human-generated content) can be 
-extremely diverse in structure. Since categorization is tuned for machine data, 
+example email, wiki pages, prose, or other human-generated content) can be
+extremely diverse in structure. Since categorization is tuned for machine data,
 it gives poor results for human-generated data. It would create so many
 categories that they couldn't be handled effectively. Categorization is _not_
 natural language processing (NLP).
@@ -32,7 +32,7 @@ volume and pattern is normal for each category over time. You can then detect
 anomalies and surface rare events or unusual types of messages by using
 <<ml-count-functions,count>> or <<ml-rare-functions,rare>> functions.
 
-In {kib}, there is a categorization wizard to help you create this type of 
+In {kib}, there is a categorization wizard to help you create this type of
 {anomaly-job}. For example, the following job generates categories from the
 contents of the `message` field and uses the count function to determine when
 certain categories are occurring at anomalous rates:
@@ -69,7 +69,7 @@ do not specify this keyword in one of those properties, the API request fails.
 ====
 
 
-You can use the **Anomaly Explorer** in {kib} to view the analysis results: 
+You can use the **Anomaly Explorer** in {kib} to view the analysis results:
 
 [role="screenshot"]
 image::images/ml-category-anomalies.jpg["Categorization results in the Anomaly Explorer"]
@@ -105,7 +105,7 @@ SQL statement from the categorization algorithm.
 If you enable per-partition categorization, categories are determined
 independently for each partition. For example, if your data includes messages
 from multiple types of logs from different applications, you can use a field
-like the ECS {ecs-ref}/ecs-event.html[`event.dataset` field] as the 
+like the ECS {ecs-ref}/ecs-event.html[`event.dataset` field] as the
 `partition_field_name` and categorize the messages for each type of log
 separately.
 
@@ -116,7 +116,7 @@ create or update a job and enable per-partition categorization, it fails.
 
 When per-partition categorization is enabled, you can also take advantage of a
 `stop_on_warn` configuration option. If the categorization status for a
-partition changes to `warn`, it doesn't categorize well and can cause a lot of 
+partition changes to `warn`, it doesn't categorize well and can cause a lot of
 unnecessary resource usage. When you set `stop_on_warn` to `true`, the job stops
 analyzing these problematic partitions. You can thus avoid an ongoing
 performance cost for partitions that are unsuitable for categorization.
@@ -128,7 +128,7 @@ performance cost for partitions that are unsuitable for categorization.
 Categorization uses English dictionary words to identify log message categories.
 By default, it also uses English tokenization rules. For this reason, if you use
 the default categorization analyzer, only English language log messages are
-supported, as described in the <<ml-limitations>>. 
+supported, as described in the <<ml-limitations>>.
 
 If you use the categorization wizard in {kib}, you can see which categorization
 analyzer it uses and highlighted examples of the tokens that it identifies. You
@@ -140,7 +140,7 @@ image::images/ml-category-analyzer.jpg["Editing the categorization analyzer in K
 
 The categorization analyzer can refer to a built-in {es} analyzer or a
 combination of zero or more character filters, a tokenizer, and zero or more
-token filters. In this example, adding a 
+token filters. In this example, adding a
 {ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
 achieves exactly the same behavior as the `categorization_filters` job
 configuration option described earlier. For more details about these properties,
@@ -157,7 +157,10 @@ POST _ml/anomaly_detectors/_validate
 {
   "analysis_config" : {
     "categorization_analyzer" : {
-      "tokenizer" : "ml_classic",
+      "char_filter" : [
+        "first_non_blank_line"
+      ],
+      "tokenizer" : "ml_standard",
       "filter" : [
         { "type" : "stop", "stopwords": [
           "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
@@ -182,8 +185,8 @@ POST _ml/anomaly_detectors/_validate
 If you specify any part of the `categorization_analyzer`, however, any omitted
 sub-properties are _not_ set to default values.
 
-The `ml_classic` tokenizer and the day and month stopword filter are more or 
-less equivalent to the following analyzer, which is defined using only built-in 
+The `ml_standard` tokenizer and the day and month stopword filter are more or
+less equivalent to the following analyzer, which is defined using only built-in
 {es} {ref}/analysis-tokenizers.html[tokenizers] and
 {ref}/analysis-tokenfilters.html[token filters]:
 
@@ -201,15 +204,18 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
       "detector_description": "Unusual message counts"
     }],
     "categorization_analyzer":{
+      "char_filter" : [
+        "first_non_blank_line" <1>
+      ],
       "tokenizer": {
         "type" : "simple_pattern_split",
-        "pattern" : "[^-0-9A-Za-z_.]+" <1>
+        "pattern" : "[^-0-9A-Za-z_./]+" <2>
       },
       "filter": [
-        { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
-        { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
-        { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
-        { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
+        { "type" : "pattern_replace", "pattern": "^[0-9].*" }, <3>
+        { "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <4>
+        { "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <5>
+        { "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <6>
         { "type" : "stop", "stopwords": [
           "",
           "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
@@ -232,17 +238,20 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
 ----------------------------------
 // TEST[skip:needs-licence]
 
-<1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
-<2> By default, categorization ignores tokens that begin with a digit.
-<3> By default, categorization also ignores tokens that are hexadecimal numbers.
-<4> Underscores, hyphens, and dots are removed from the beginning of tokens.
-<5> Underscores, hyphens, and dots are also removed from the end of tokens.
+<1> Only consider the first non-blank line of the message for categorization purposes.
+<2> Tokens basically consist of hyphens, digits, letters, underscores, dots and slashes.
+<3> By default, categorization ignores tokens that begin with a digit.
+<4> By default, categorization also ignores tokens that are hexadecimal numbers.
+<5> Underscores, hyphens, and dots are removed from the beginning of tokens.
+<6> Underscores, hyphens, and dots are also removed from the end of tokens.
 
-The key difference between the default `categorization_analyzer` and this 
-example analyzer is that using the `ml_classic` tokenizer is several times 
-faster. The difference in behavior is that this custom analyzer does not include 
-accented letters in tokens whereas the `ml_classic` tokenizer does, although 
-that could be fixed by using more complex regular expressions.
+The key difference between the default `categorization_analyzer` and this
+example analyzer is that using the `ml_standard` tokenizer is several times
+faster. The `ml_standard` tokenizer also tries to preserve URLs, Windows paths
+and email addresses as single tokens. Another difference in behavior is that
+this custom analyzer does not include accented letters in tokens whereas the
+`ml_standard` tokenizer does, although that could be fixed by using more complex
+regular expressions.
 
 If you are categorizing non-English messages in a language where words are
 separated by spaces, you might get better results if you change the day or month

diff --git a/docs/reference/ml/ml-shared.asciidoc b/docs/reference/ml/ml-shared.asciidoc
@@ -1587,11 +1587,17 @@ end::timestamp-results[]
 tag::tokenizer[]
 The name or definition of the <<analysis-tokenizers,tokenizer>> to use after
 character filters are applied. This property is compulsory if
-`categorization_analyzer` is specified as an object. Machine learning provides a
-tokenizer called `ml_classic` that tokenizes in the same way as the
-non-customizable tokenizer in older versions of the product. If you want to use
-that tokenizer but change the character or token filters, specify
-`"tokenizer": "ml_classic"` in your `categorization_analyzer`.
+`categorization_analyzer` is specified as an object. Machine learning provides
+a tokenizer called `ml_standard` that tokenizes in a way that has been
+determined to produce good categorization results on a variety of log
+file formats for logs in English. If you want to use that tokenizer but
+change the character or token filters, specify `"tokenizer": "ml_standard"`
+in your `categorization_analyzer`. Additionally, the `ml_classic` tokenizer
+is available, which tokenizes in the same way as the non-customizable
+tokenizer in old versions of the product (before 6.2). `ml_classic` was
+the default categorization tokenizer in versions 6.2 to 7.13, so if you
+need categorization identical to the default for jobs created in these
+versions, specify `"tokenizer": "ml_classic"` in your `categorization_analyzer`.
 end::tokenizer[]
 
 tag::total-by-field-count[]

diff --git a/...rc/main/java/org/elasticsearch/xpack/core/ml/job/config/CategorizationAnalyzerConfig.java b/...rc/main/java/org/elasticsearch/xpack/core/ml/job/config/CategorizationAnalyzerConfig.java
@@ -145,38 +145,39 @@ static CategorizationAnalyzerConfig buildFromXContentFragment(XContentParser par
     }
 
     /**
-     * Create a <code>categorization_analyzer</code> that mimics what the tokenizer and filters built into the ML C++
-     * code do.  This is the default analyzer for categorization to ensure that people upgrading from previous versions
+     * Create a <code>categorization_analyzer</code> that mimics what the tokenizer and filters built into the original ML
+     * C++ code do.  This is the default analyzer for categorization to ensure that people upgrading from old versions
      * get the same behaviour from their categorization jobs before and after upgrade.
      * @param categorizationFilters Categorization filters (if any) from the <code>analysis_config</code>.
      * @return The default categorization analyzer.
      */
     public static CategorizationAnalyzerConfig buildDefaultCategorizationAnalyzer(List<String> categorizationFilters) {
 
-        CategorizationAnalyzerConfig.Builder builder = new CategorizationAnalyzerConfig.Builder();
-
-        if (categorizationFilters != null) {
-            for (String categorizationFilter : categorizationFilters) {
-                Map<String, Object> charFilter = new HashMap<>();
-                charFilter.put("type", "pattern_replace");
-                charFilter.put("pattern", categorizationFilter);
-                builder.addCharFilter(charFilter);
-            }
-        }
-
-        builder.setTokenizer("ml_classic");
-
-        Map<String, Object> tokenFilter = new HashMap<>();
-        tokenFilter.put("type", "stop");
-        tokenFilter.put("stopwords", Arrays.asList(
-                "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
-                "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
-                "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
-                "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
-                "GMT", "UTC"));
-        builder.addTokenFilter(tokenFilter);
+        return new CategorizationAnalyzerConfig.Builder()
+            .addCategorizationFilters(categorizationFilters)
+            .setTokenizer("ml_classic")
+            .addDateWordsTokenFilter()
+            .build();
+    }
 
-        return builder.build();
+    /**
+     * Create a <code>categorization_analyzer</code> that will be used for newly created jobs where no categorization
+     * analyzer is explicitly provided.  This analyzer differs from the default one in that it uses the <code>ml_standard</code>
+     * tokenizer instead of the <code>ml_classic</code> tokenizer, and it only considers the first non-blank line of each message.
+     * This analyzer is <em>not</em> used for jobs that specify no categorization analyzer, as that would break jobs that were
+     * originally run in older versions.  Instead, this analyzer is explicitly added to newly created jobs once the entire cluster
+     * is upgraded to version 7.14 or above.
+     * @param categorizationFilters Categorization filters (if any) from the <code>analysis_config</code>.
+     * @return The standard categorization analyzer.
+     */
+    public static CategorizationAnalyzerConfig buildStandardCategorizationAnalyzer(List<String> categorizationFilters) {
+
+        return new CategorizationAnalyzerConfig.Builder()
+            .addCharFilter("first_non_blank_line")
+            .addCategorizationFilters(categorizationFilters)
+            .setTokenizer("ml_standard")
+            .addDateWordsTokenFilter()
+            .build();
     }
 
     private final String analyzer;
@@ -311,6 +312,18 @@ public Builder addCharFilter(Map<String, Object> charFilter) {
             return this;
         }
 
+        public Builder addCategorizationFilters(List<String> categorizationFilters) {
+            if (categorizationFilters != null) {
+                for (String categorizationFilter : categorizationFilters) {
+                    Map<String, Object> charFilter = new HashMap<>();
+                    charFilter.put("type", "pattern_replace");
+                    charFilter.put("pattern", categorizationFilter);
+                    addCharFilter(charFilter);
+                }
+            }
+            return this;
+        }
+
         public Builder setTokenizer(String tokenizer) {
             this.tokenizer = new NameOrDefinition(tokenizer);
             return this;
@@ -331,6 +344,19 @@ public Builder addTokenFilter(Map<String, Object> tokenFilter) {
             return this;
         }
 
+        Builder addDateWordsTokenFilter() {
+            Map<String, Object> tokenFilter = new HashMap<>();
+            tokenFilter.put("type", "stop");
+            tokenFilter.put("stopwords", Arrays.asList(
+                "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
+                "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
+                "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
+                "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
+                "GMT", "UTC"));
+            addTokenFilter(tokenFilter);
+            return this;
+        }
+
         /**
          * Create a config validating only structure, not exact analyzer/tokenizer/filter names
          */

diff --git a/x-pack/plugin/ml/qa/ml-with-security/build.gradle b/x-pack/plugin/ml/qa/ml-with-security/build.gradle
@@ -17,9 +17,11 @@ restResources {
 
 tasks.named("yamlRestTest").configure {
   systemProperty 'tests.rest.blacklist', [
-    // Remove this test because it doesn't call an ML endpoint and we don't want
+    // Remove these tests because they don't call an ML endpoint and we don't want
     // to grant extra permissions to the users used in this test suite
     'ml/ml_classic_analyze/Test analyze API with an analyzer that does what we used to do in native code',
+    'ml/ml_standard_analyze/Test analyze API with the standard 7.14 ML analyzer',
+    'ml/ml_standard_analyze/Test 7.14 analyzer with blank lines',
     // Remove tests that are expected to throw an exception, because we cannot then
     // know whether to expect an authorization exception or a validation exception
     'ml/calendar_crud/Test get calendar given missing',

diff --git a/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java b/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java
@@ -43,6 +43,7 @@
 import org.elasticsearch.common.xcontent.NamedXContentRegistry;
 import org.elasticsearch.env.Environment;
 import org.elasticsearch.env.NodeEnvironment;
+import org.elasticsearch.index.analysis.CharFilterFactory;
 import org.elasticsearch.index.analysis.TokenizerFactory;
 import org.elasticsearch.indices.SystemIndexDescriptor;
 import org.elasticsearch.indices.analysis.AnalysisModule.AnalysisProvider;
@@ -249,6 +250,8 @@
 import org.elasticsearch.xpack.ml.job.JobManager;
 import org.elasticsearch.xpack.ml.job.JobManagerHolder;
 import org.elasticsearch.xpack.ml.job.UpdateJobProcessNotifier;
+import org.elasticsearch.xpack.ml.job.categorization.FirstNonBlankLineCharFilter;
+import org.elasticsearch.xpack.ml.job.categorization.FirstNonBlankLineCharFilterFactory;
 import org.elasticsearch.xpack.ml.job.categorization.MlClassicTokenizer;
 import org.elasticsearch.xpack.ml.job.categorization.MlClassicTokenizerFactory;
 import org.elasticsearch.xpack.ml.job.categorization.MlStandardTokenizer;
@@ -1085,6 +1088,10 @@ public List<ExecutorBuilder<?>> getExecutorBuilders(Settings settings) {
         return Arrays.asList(jobComms, utility, datafeed);
     }
 
+    public Map<String, AnalysisProvider<CharFilterFactory>> getCharFilters() {
+        return Collections.singletonMap(FirstNonBlankLineCharFilter.NAME, FirstNonBlankLineCharFilterFactory::new);
+    }
+
     @Override
     public Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() {
         Map<String, AnalysisProvider<TokenizerFactory>> tokenizers = new HashMap<>();