Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Make ml_standard tokenizer the default for new categorization jobs #72805

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
29f759b
[ML] Make ml_standard tokenizer the default for new categorization jobs
droberts195 May 6, 2021
94414bf
Fixing tests
droberts195 May 6, 2021
c016c4e
Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs
droberts195 May 11, 2021
98babb0
Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs
droberts195 May 14, 2021
61529b5
Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs
droberts195 May 24, 2021
7727f9b
Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs
droberts195 May 27, 2021
512b7ae
Incorporating first line only into default ML analyzer
droberts195 May 27, 2021
2b4f6b4
Fix tests
droberts195 May 27, 2021
586d937
Don't run analyze rest test in security suite
droberts195 May 27, 2021
5567bf4
Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs
droberts195 May 27, 2021
39bb837
Fixing first line filter and adding tests
droberts195 May 28, 2021
6eebe22
Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs
droberts195 May 28, 2021
e54acb1
Fixes for an expected value beginning with a dollar sign
droberts195 May 28, 2021
5678c96
Fix more YAML test regex problems
droberts195 May 28, 2021
14df66a
Switching to a dedicated char filter
droberts195 Jun 1, 2021
9619c62
Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs
droberts195 Jun 1, 2021
4d25e97
Bug fixes
droberts195 Jun 1, 2021
8f711b2
Merge branch 'master' into ml_standard_tokenizer_for_new_cat_jobs
droberts195 Jun 1, 2021
724e25a
Improve comment and remove unused import
droberts195 Jun 1, 2021
bdbb6e9
Skip analyzer tests in ml-with-security
droberts195 Jun 1, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -588,14 +588,13 @@ public void testUpdateJob() throws Exception {
.setDescription("My description") // <2>
.setAnalysisLimits(new AnalysisLimits(1000L, null)) // <3>
.setBackgroundPersistInterval(TimeValue.timeValueHours(3)) // <4>
.setCategorizationFilters(Arrays.asList("categorization-filter")) // <5>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at JobUpdate.java, this is still valid. Why is it removed here in the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work any more, because you cannot configure both categorization filters and a categorization analyzer, and now every newly created job has a categorization analyzer.

You cannot update that categorization analyzer because that could completely change the way categorization is done, and ruin the stability of the categories.

Arguably we should never have let people update the categorization filters for the same reason. It's like we don't allow people to change the detectors after a job was created - if they change something so fundamental to what the job is doing then it's not really the same job after the update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, then it seems later PR should deprecate setting them and then we can remove it in 8.

.setDetectorUpdates(Arrays.asList(detectorUpdate)) // <6>
.setGroups(Arrays.asList("job-group-1")) // <7>
.setResultsRetentionDays(10L) // <8>
.setModelPlotConfig(new ModelPlotConfig(true, null, true)) // <9>
.setModelSnapshotRetentionDays(7L) // <10>
.setCustomSettings(customSettings) // <11>
.setRenormalizationWindowDays(3L) // <12>
.setDetectorUpdates(Arrays.asList(detectorUpdate)) // <5>
.setGroups(Arrays.asList("job-group-1")) // <6>
.setResultsRetentionDays(10L) // <7>
.setModelPlotConfig(new ModelPlotConfig(true, null, true)) // <8>
.setModelSnapshotRetentionDays(7L) // <9>
.setCustomSettings(customSettings) // <10>
.setRenormalizationWindowDays(3L) // <11>
.build();
// end::update-job-options

Expand Down
15 changes: 7 additions & 8 deletions docs/java-rest/high-level/ml/update-job.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -35,14 +35,13 @@ include-tagged::{doc-tests-file}[{api}-options]
<2> Updated description.
<3> Updated analysis limits.
<4> Updated background persistence interval.
<5> Updated analysis config's categorization filters.
<6> Updated detectors through the `JobUpdate.DetectorUpdate` object.
<7> Updated group membership.
<8> Updated result retention.
<9> Updated model plot configuration.
<10> Updated model snapshot retention setting.
<11> Updated custom settings.
<12> Updated renormalization window.
<5> Updated detectors through the `JobUpdate.DetectorUpdate` object.
<6> Updated group membership.
<7> Updated result retention.
<8> Updated model plot configuration.
<9> Updated model snapshot retention setting.
<10> Updated custom settings.
<11> Updated renormalization window.

Included with these options are specific optional `JobUpdate.DetectorUpdate` updates.
["source","java",subs="attributes,callouts,macros"]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,10 @@ This is a possible response:
"defaults" : {
"anomaly_detectors" : {
"categorization_analyzer" : {
"tokenizer" : "ml_classic",
"char_filter" : [
"first_non_blank_line"
],
"tokenizer" : "ml_standard",
"filter" : [
{
"type" : "stop",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ of possible messages:
Categorization is tuned to work best on data like log messages by taking token
order into account, including stop words, and not considering synonyms in its
analysis. Complete sentences in human communication or literary text (for
example email, wiki pages, prose, or other human-generated content) can be
extremely diverse in structure. Since categorization is tuned for machine data,
example email, wiki pages, prose, or other human-generated content) can be
extremely diverse in structure. Since categorization is tuned for machine data,
it gives poor results for human-generated data. It would create so many
categories that they couldn't be handled effectively. Categorization is _not_
natural language processing (NLP).
Expand All @@ -32,7 +32,7 @@ volume and pattern is normal for each category over time. You can then detect
anomalies and surface rare events or unusual types of messages by using
<<ml-count-functions,count>> or <<ml-rare-functions,rare>> functions.

In {kib}, there is a categorization wizard to help you create this type of
In {kib}, there is a categorization wizard to help you create this type of
{anomaly-job}. For example, the following job generates categories from the
contents of the `message` field and uses the count function to determine when
certain categories are occurring at anomalous rates:
Expand Down Expand Up @@ -69,7 +69,7 @@ do not specify this keyword in one of those properties, the API request fails.
====


You can use the **Anomaly Explorer** in {kib} to view the analysis results:
You can use the **Anomaly Explorer** in {kib} to view the analysis results:

[role="screenshot"]
image::images/ml-category-anomalies.jpg["Categorization results in the Anomaly Explorer"]
Expand Down Expand Up @@ -105,7 +105,7 @@ SQL statement from the categorization algorithm.
If you enable per-partition categorization, categories are determined
independently for each partition. For example, if your data includes messages
from multiple types of logs from different applications, you can use a field
like the ECS {ecs-ref}/ecs-event.html[`event.dataset` field] as the
like the ECS {ecs-ref}/ecs-event.html[`event.dataset` field] as the
`partition_field_name` and categorize the messages for each type of log
separately.

Expand All @@ -116,7 +116,7 @@ create or update a job and enable per-partition categorization, it fails.

When per-partition categorization is enabled, you can also take advantage of a
`stop_on_warn` configuration option. If the categorization status for a
partition changes to `warn`, it doesn't categorize well and can cause a lot of
partition changes to `warn`, it doesn't categorize well and can cause a lot of
unnecessary resource usage. When you set `stop_on_warn` to `true`, the job stops
analyzing these problematic partitions. You can thus avoid an ongoing
performance cost for partitions that are unsuitable for categorization.
Expand All @@ -128,7 +128,7 @@ performance cost for partitions that are unsuitable for categorization.
Categorization uses English dictionary words to identify log message categories.
By default, it also uses English tokenization rules. For this reason, if you use
the default categorization analyzer, only English language log messages are
supported, as described in the <<ml-limitations>>.
supported, as described in the <<ml-limitations>>.

If you use the categorization wizard in {kib}, you can see which categorization
analyzer it uses and highlighted examples of the tokens that it identifies. You
Expand All @@ -140,7 +140,7 @@ image::images/ml-category-analyzer.jpg["Editing the categorization analyzer in K

The categorization analyzer can refer to a built-in {es} analyzer or a
combination of zero or more character filters, a tokenizer, and zero or more
token filters. In this example, adding a
token filters. In this example, adding a
{ref}/analysis-pattern-replace-charfilter.html[`pattern_replace` character filter]
achieves exactly the same behavior as the `categorization_filters` job
configuration option described earlier. For more details about these properties,
Expand All @@ -157,7 +157,10 @@ POST _ml/anomaly_detectors/_validate
{
"analysis_config" : {
"categorization_analyzer" : {
"tokenizer" : "ml_classic",
"char_filter" : [
"first_non_blank_line"
],
"tokenizer" : "ml_standard",
"filter" : [
{ "type" : "stop", "stopwords": [
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
Expand All @@ -182,8 +185,8 @@ POST _ml/anomaly_detectors/_validate
If you specify any part of the `categorization_analyzer`, however, any omitted
sub-properties are _not_ set to default values.

The `ml_classic` tokenizer and the day and month stopword filter are more or
less equivalent to the following analyzer, which is defined using only built-in
The `ml_standard` tokenizer and the day and month stopword filter are more or
less equivalent to the following analyzer, which is defined using only built-in
{es} {ref}/analysis-tokenizers.html[tokenizers] and
{ref}/analysis-tokenfilters.html[token filters]:

Expand All @@ -201,15 +204,18 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
"detector_description": "Unusual message counts"
}],
"categorization_analyzer":{
"char_filter" : [
"first_non_blank_line" <1>
],
"tokenizer": {
"type" : "simple_pattern_split",
"pattern" : "[^-0-9A-Za-z_.]+" <1>
"pattern" : "[^-0-9A-Za-z_./]+" <2>
},
"filter": [
{ "type" : "pattern_replace", "pattern": "^[0-9].*" }, <2>
{ "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <3>
{ "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <4>
{ "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <5>
{ "type" : "pattern_replace", "pattern": "^[0-9].*" }, <3>
{ "type" : "pattern_replace", "pattern": "^[-0-9A-Fa-f.]+$" }, <4>
{ "type" : "pattern_replace", "pattern": "^[^0-9A-Za-z]+" }, <5>
{ "type" : "pattern_replace", "pattern": "[^0-9A-Za-z]+$" }, <6>
{ "type" : "stop", "stopwords": [
"",
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
Expand All @@ -232,17 +238,20 @@ PUT _ml/anomaly_detectors/it_ops_new_logs3
----------------------------------
// TEST[skip:needs-licence]

<1> Tokens basically consist of hyphens, digits, letters, underscores and dots.
<2> By default, categorization ignores tokens that begin with a digit.
<3> By default, categorization also ignores tokens that are hexadecimal numbers.
<4> Underscores, hyphens, and dots are removed from the beginning of tokens.
<5> Underscores, hyphens, and dots are also removed from the end of tokens.
<1> Only consider the first non-blank line of the message for categorization purposes.
<2> Tokens basically consist of hyphens, digits, letters, underscores, dots and slashes.
<3> By default, categorization ignores tokens that begin with a digit.
<4> By default, categorization also ignores tokens that are hexadecimal numbers.
<5> Underscores, hyphens, and dots are removed from the beginning of tokens.
<6> Underscores, hyphens, and dots are also removed from the end of tokens.

The key difference between the default `categorization_analyzer` and this
example analyzer is that using the `ml_classic` tokenizer is several times
faster. The difference in behavior is that this custom analyzer does not include
accented letters in tokens whereas the `ml_classic` tokenizer does, although
that could be fixed by using more complex regular expressions.
The key difference between the default `categorization_analyzer` and this
example analyzer is that using the `ml_standard` tokenizer is several times
faster. The `ml_standard` tokenizer also tries to preserve URLs, Windows paths
and email addresses as single tokens. Another difference in behavior is that
this custom analyzer does not include accented letters in tokens whereas the
`ml_standard` tokenizer does, although that could be fixed by using more complex
regular expressions.

If you are categorizing non-English messages in a language where words are
separated by spaces, you might get better results if you change the day or month
Expand Down
16 changes: 11 additions & 5 deletions docs/reference/ml/ml-shared.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -1592,11 +1592,17 @@ end::timestamp-results[]
tag::tokenizer[]
The name or definition of the <<analysis-tokenizers,tokenizer>> to use after
character filters are applied. This property is compulsory if
`categorization_analyzer` is specified as an object. Machine learning provides a
tokenizer called `ml_classic` that tokenizes in the same way as the
non-customizable tokenizer in older versions of the product. If you want to use
that tokenizer but change the character or token filters, specify
`"tokenizer": "ml_classic"` in your `categorization_analyzer`.
`categorization_analyzer` is specified as an object. Machine learning provides
a tokenizer called `ml_standard` that tokenizes in a way that has been
determined to produce good categorization results on a variety of log
file formats for logs in English. If you want to use that tokenizer but
change the character or token filters, specify `"tokenizer": "ml_standard"`
in your `categorization_analyzer`. Additionally, the `ml_classic` tokenizer
is available, which tokenizes in the same way as the non-customizable
tokenizer in old versions of the product (before 6.2). `ml_classic` was
the default categorization tokenizer in versions 6.2 to 7.13, so if you
need categorization identical to the default for jobs created in these
versions, specify `"tokenizer": "ml_classic"` in your `categorization_analyzer`.
end::tokenizer[]

tag::total-by-field-count[]
Expand Down
3 changes: 3 additions & 0 deletions x-pack/plugin/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -117,9 +117,12 @@ tasks.named("yamlRestCompatTest").configure {
'ml/datafeeds_crud/Test update datafeed to point to job already attached to another datafeed',
'ml/datafeeds_crud/Test update datafeed to point to missing job',
'ml/job_cat_apis/Test cat anomaly detector jobs',
'ml/jobs_crud/Test update job',
'ml/jobs_get_stats/Test get job stats after uploading data prompting the creation of some stats',
'ml/jobs_get_stats/Test get job stats for closed job',
'ml/jobs_get_stats/Test no exception on get job stats with missing index',
// TODO: the ml_info mute can be removed from master once the ml_standard tokenizer is in 7.x
'ml/ml_info/Test ml info',
'ml/post_data/Test POST data job api, flush, close and verify DataCounts doc',
'ml/post_data/Test flush with skip_time',
'ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled',
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -145,38 +145,39 @@ static CategorizationAnalyzerConfig buildFromXContentFragment(XContentParser par
}

/**
* Create a <code>categorization_analyzer</code> that mimics what the tokenizer and filters built into the ML C++
* code do. This is the default analyzer for categorization to ensure that people upgrading from previous versions
* Create a <code>categorization_analyzer</code> that mimics what the tokenizer and filters built into the original ML
* C++ code do. This is the default analyzer for categorization to ensure that people upgrading from old versions
* get the same behaviour from their categorization jobs before and after upgrade.
* @param categorizationFilters Categorization filters (if any) from the <code>analysis_config</code>.
* @return The default categorization analyzer.
*/
public static CategorizationAnalyzerConfig buildDefaultCategorizationAnalyzer(List<String> categorizationFilters) {

CategorizationAnalyzerConfig.Builder builder = new CategorizationAnalyzerConfig.Builder();

if (categorizationFilters != null) {
for (String categorizationFilter : categorizationFilters) {
Map<String, Object> charFilter = new HashMap<>();
charFilter.put("type", "pattern_replace");
charFilter.put("pattern", categorizationFilter);
builder.addCharFilter(charFilter);
}
}

builder.setTokenizer("ml_classic");

Map<String, Object> tokenFilter = new HashMap<>();
tokenFilter.put("type", "stop");
tokenFilter.put("stopwords", Arrays.asList(
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
"GMT", "UTC"));
builder.addTokenFilter(tokenFilter);
return new CategorizationAnalyzerConfig.Builder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This builder format is so much nicer++

.addCategorizationFilters(categorizationFilters)
.setTokenizer("ml_classic")
.addDateWordsTokenFilter()
.build();
}

return builder.build();
/**
* Create a <code>categorization_analyzer</code> that will be used for newly created jobs where no categorization
* analyzer is explicitly provided. This analyzer differs from the default one in that it uses the <code>ml_standard</code>
* tokenizer instead of the <code>ml_classic</code> tokenizer, and it only considers the first non-blank line of each message.
* This analyzer is <em>not</em> used for jobs that specify no categorization analyzer, as that would break jobs that were
* originally run in older versions. Instead, this analyzer is explicitly added to newly created jobs once the entire cluster
* is upgraded to version 7.14 or above.
* @param categorizationFilters Categorization filters (if any) from the <code>analysis_config</code>.
* @return The standard categorization analyzer.
*/
public static CategorizationAnalyzerConfig buildStandardCategorizationAnalyzer(List<String> categorizationFilters) {

return new CategorizationAnalyzerConfig.Builder()
.addCharFilter("first_non_blank_line")
.addCategorizationFilters(categorizationFilters)
.setTokenizer("ml_standard")
.addDateWordsTokenFilter()
.build();
}

private final String analyzer;
Expand Down Expand Up @@ -311,6 +312,18 @@ public Builder addCharFilter(Map<String, Object> charFilter) {
return this;
}

public Builder addCategorizationFilters(List<String> categorizationFilters) {
if (categorizationFilters != null) {
for (String categorizationFilter : categorizationFilters) {
Map<String, Object> charFilter = new HashMap<>();
charFilter.put("type", "pattern_replace");
charFilter.put("pattern", categorizationFilter);
addCharFilter(charFilter);
}
}
return this;
}

public Builder setTokenizer(String tokenizer) {
this.tokenizer = new NameOrDefinition(tokenizer);
return this;
Expand All @@ -331,6 +344,19 @@ public Builder addTokenFilter(Map<String, Object> tokenFilter) {
return this;
}

Builder addDateWordsTokenFilter() {
Map<String, Object> tokenFilter = new HashMap<>();
tokenFilter.put("type", "stop");
tokenFilter.put("stopwords", Arrays.asList(
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday",
"Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun",
"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December",
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
"GMT", "UTC"));
addTokenFilter(tokenFilter);
return this;
}

/**
* Create a config validating only structure, not exact analyzer/tokenizer/filter names
*/
Expand Down
4 changes: 3 additions & 1 deletion x-pack/plugin/ml/qa/ml-with-security/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,11 @@ restResources {

tasks.named("yamlRestTest").configure {
systemProperty 'tests.rest.blacklist', [
// Remove this test because it doesn't call an ML endpoint and we don't want
// Remove these tests because they don't call an ML endpoint and we don't want
// to grant extra permissions to the users used in this test suite
'ml/ml_classic_analyze/Test analyze API with an analyzer that does what we used to do in native code',
'ml/ml_standard_analyze/Test analyze API with the standard 7.14 ML analyzer',
'ml/ml_standard_analyze/Test 7.14 analyzer with blank lines',
// Remove tests that are expected to throw an exception, because we cannot then
// know whether to expect an authorization exception or a validation exception
'ml/calendar_crud/Test get calendar given missing',
Expand Down
Loading