[ML] Reverse engineer Grok patterns from categorization results #30125

droberts195 · 2018-04-25T11:11:07Z

It has been noted that the regexes we produce in our categorization
results are not that far away from Grok patterns that could be used
in Logstash to categorize messages at ingest time and do better
field extraction for log formats that do not have out-of-the-box
patterns.

This change adds a grok_pattern field to our GET categories API
output. It's calculated using the regex and examples in the
categorization result, and applying a list of candidate Grok
patterns to the bits in between the tokens that are considered to
define the category.

This can currently be considered a prototype, as the Grok patterns
it produces are not optimal. However, enough people have said it
would be useful for it to be worthwhile exposing it as experimental
functionality for interested parties to try out.

It has been noted that the regexes we produce in our categorization results are not that far away from Grok patterns that could be used in Logstash to categorize messages at ingest time and do better field extraction for log formats that do not have out-of-the-box patterns. This change adds a `grok_pattern` field to our GET categories API output. It's calculated using the regex and examples in the categorization result, and applying a list of candidate Grok patterns to the bits in between the tokens that are considered to define the category. This can currently be considered a prototype, as there is an outstanding question on how the functionality should work: * Is calculating the Grok patterns on the fly the best thing to do? It might be better to calculate them when categorization results are created/updated, and store the patterns in a new type of ML results document. Then we could let users manually improve the patterns and remember their edits. But the decision here needs to tie in with the end-to-end story for this functionality. If the intended flow is `ML -> User edits in the Grok debugger -> Logstash config` then maybe there's no need for ML to remember the user edits.

davidkyle

Looks good, I left a few comments possibly based on my own misunderstanding

davidkyle · 2018-04-30T10:19:54Z

...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/categorization/GrokPatternCreator.java

+            } else {
+                // We should never get here.  If we do it implies a bug in the original categorization,
+                // as it's produced a regex that doesn't match the examples.
+                assert matcher.matches() : exampleProcessor.pattern() + " did not match " + example;


This should be logged for the case where assertions are not enabled

davidkyle · 2018-04-30T10:53:47Z

...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/categorization/GrokPatternCreator.java

+                // "the cat sat on the mat" will result in "the ", " ", " on the ", and ""
+                // being added to the 4 "in between" collections in that order
+                for (int groupNum = 1; groupNum <= matcher.groupCount(); ++groupNum) {
+                    if (inBetweenBits.size() < groupNum) {


After the first pass of the for (String example: examples) the array will be populated. Maybe pre-populate inBetweenBits with the new ArrayLists once it is created

davidkyle · 2018-04-30T11:00:49Z

...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/categorization/GrokPatternCreator.java

+        // Finally, for each collection of "in between" bits we look for the best Grok pattern and incorporate
+        // it into the overall Grok pattern that will match the each example in its entirety
+        for (int inBetweenBitNum = 0; inBetweenBitNum < inBetweenBits.size(); ++inBetweenBitNum) {
+            // Remember (from the first comment in this method) that the first element in this array is


My understanding is that this causes every generated grok pattern to start with '.*?' unless one of ORDERED_CANDIDATE_GROK_PATTERNS matches the empty string is that the intention?

Because the Grok patterns are built from the regex that categorization churns out there's no way to be sure what comes before the first common token for the category. Therefore all the regexes will begin with .*?. (It doesn't make any difference if one of the ORDERED_CANDIDATE_GROK_PATTERNS matches the empty string - this prefix will always be on the regex.)

I agree it could be inefficient and look silly to a human who knows the data, but this is a limitation from trying to reverse engineer Grok patterns from the categorization output. An alternative would be to try to find fields as part of the main categorization algorithm. But that would imply either moving the whole categorization algorithm into Java, or adding Grok functionality to the C++, or using something other than Grok to define how to extract the fields.

Possibly the best thing to do is just release this functionality in its current form as experimental, because several people have said they could get value from this prototype, and switch to something completely different and more complex in the future.

davidkyle · 2018-04-30T11:52:13Z

...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/categorization/GrokPatternCreator.java

+        // E.g., ".*?cat.+?sat.+?mat.*" -> Pattern (.*?)cat(.+?)sat(.+?)mat(.*)
+        Pattern exampleProcessor = Pattern.compile(regex.replaceAll("(\\.[*+]\\??)", "($1)"), Pattern.DOTALL);
+
+        List<Collection<String>> inBetweenBits = new ArrayList<>(fixedRegexBits.length);


Maybe groupsMatches instead of inBetweenBits or matchesForGroups. This is and array of example matches for each group

droberts195 · 2018-05-14T09:46:23Z

x-pack/docs/en/rest-api/ml/resultsresource.asciidoc

+`grok_pattern`::
+  (string) A Grok pattern that could be used in Logstash or an Ingest Pipeline
+  to extract fields from messages that match the category. This field is
+  experimental and may be changed or removed in a future version. The Grok


@lcawl do you know if there are any precedents for a single field that is experimental within a feature that is generally fully supported? Is the way I've documented it here right?

@droberts195 I'd suggest putting "experimental[]" after the data type, per https://github.com/elastic/docs#experimental-and-beta

davidkyle

LGTM I left a couple of comments

davidkyle · 2018-05-14T10:26:37Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/persistence/JobProvider.java

@@ -629,7 +630,7 @@ public void bucketRecords(String jobId, Bucket bucket, int from, int size, boole
     * @param from  Skip the first N categories. This parameter is for paging
     * @param size  Take only this number of categories
     */
-    public void categoryDefinitions(String jobId, Long categoryId, Integer from, Integer size,
+    public void categoryDefinitions(String jobId, Long categoryId, boolean augment, Integer from, Integer size,


The javadoc will need updating for the new parameter

davidkyle · 2018-05-14T10:28:02Z

.../plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetCategoriesAction.java

@@ -41,7 +41,7 @@ protected void doExecute(GetCategoriesAction.Request request, ActionListener<Get

        Integer from = request.getPageParams() != null ? request.getPageParams().getFrom() : null;
        Integer size = request.getPageParams() != null ? request.getPageParams().getSize() : null;
-        jobProvider.categoryDefinitions(request.getJobId(), request.getCategoryId(), from, size,
+        jobProvider.categoryDefinitions(request.getJobId(), request.getCategoryId(), true, from, size,


Should this be exposed as a request parameter?

I'm not sure it's worth it. People can easily ignore the Grok patterns if they're not interested.

This change adds a grok_pattern field to the GET categories API output in ML. It's calculated using the regex and examples in the categorization result, and applying a list of candidate Grok patterns to the bits in between the tokens that are considered to define the category. This can currently be considered a prototype, as the Grok patterns it produces are not optimal. However, enough people have said it would be useful for it to be worthwhile exposing it as experimental functionality for interested parties to try out.

* 6.x: Revert "Silence IndexUpgradeIT test failures. (#30430)" [DOCS] Remove references to changelog and to highlights Revert "Mute ML upgrade test (#30458)" [ML] Fix BWC version for backport of #30125 [Docs] Improve section detailing translog usage (#30573) [Tests] Relax allowed delta in extended_stats aggregation (#30569) Fail if reading from closed KeyStoreWrapper (#30394) [ML] Reverse engineer Grok patterns from categorization results (#30125) Derive max composite buffers from max content len Update build file due to doc file rename SQL: Extract SQL request and response classes (#30457) Remove the changelog (#30593) Revert "Add deprecation warning for default shards (#30587)" Silence IndexUpgradeIT test failures. (#30430) Add deprecation warning for default shards (#30587) [DOCS] Adds 6.4.0 release highlight pages [DOCS] Adds release highlight pages (#30590) Docs: Document how to rebuild analyzers (#30498) [DOCS] Fixes title capitalization in security content LLRest: Add equals and hashcode tests for Request (#30584) [DOCS] Fix realm setting names (#30499) [DOCS] Fix path info for various security files (#30502) Docs: document precision limitations of geo_bounding_box (#30540) Fix non existing javadocs link in RestClientTests Auto-expand replicas only after failing nodes (#30553)

…ngs-to-true * elastic/master: [DOCS] Restores 7.0.0 release notes and highlights Remove assert statements from field caps documentation. (elastic#30601) Repository GCS plugin new client library (elastic#30168) HLRestClient: Follow-up for put index template api (elastic#30592) Unmute IndexUpgradeIT tests [DOCS] Remove references to changelog and to highlights Side-step pending deletes check (elastic#30571) [DOCS] Remove references to removed changelog Revert "Mute ML upgrade test (elastic#30458)" [ML] Adjust BWC version following backport of elastic#30125 [Docs] Improve section detailing translog usage (elastic#30573) [Tests] Relax allowed delta in extended_stats aggregation (elastic#30569) [ML] Reverse engineer Grok patterns from categorization results (elastic#30125) Update build file due to doc file rename Remove the changelog (elastic#30593) Fix issue with finishing handshake in ssl driver (elastic#30580) Fail if reading from closed KeyStoreWrapper (elastic#30394) Silence IndexUpgradeIT test failures. (elastic#30430)

droberts195 added WIP :ml Machine learning labels Apr 25, 2018

droberts195 force-pushed the prototype_grok_pattern_creator branch from fdf4cf5 to 9e2a2df Compare April 25, 2018 12:14

colings86 added the >feature label Apr 26, 2018

davidkyle reviewed Apr 30, 2018

View reviewed changes

droberts195 added 4 commits May 1, 2018 10:30

Address review comments

aebba04

Merge branch 'master' into prototype_grok_pattern_creator

2449e89

Remove redundant imports

3797beb

Merge branch 'master' into prototype_grok_pattern_creator

22cf6e6

droberts195 added v7.0.0 v6.4.0 labels May 11, 2018

droberts195 added 2 commits May 14, 2018 10:32

Update docs

6ad523a

Merge branch 'master' into prototype_grok_pattern_creator

d609489

droberts195 added review and removed WIP labels May 14, 2018

droberts195 commented May 14, 2018

View reviewed changes

davidkyle approved these changes May 14, 2018

View reviewed changes

droberts195 added 2 commits May 14, 2018 12:08

Add new argument to Javadoc

b0f99cc

Add experimental markup in docs

70dd1d4

droberts195 merged commit 50c34b2 into elastic:master May 15, 2018

droberts195 deleted the prototype_grok_pattern_creator branch May 15, 2018 08:02

droberts195 added a commit that referenced this pull request May 15, 2018

[ML] Fix BWC version for backport of #30125

078c251

droberts195 added a commit that referenced this pull request May 15, 2018

[ML] Adjust BWC version following backport of #30125

1de5a31

Mpdreamz mentioned this pull request Sep 21, 2018

[meta] 6.4.0 release elastic/elasticsearch-net#3397

Closed

89 tasks

Mpdreamz mentioned this pull request Oct 22, 2018

[meta] 6.5.0 Release elastic/elasticsearch-net#3457

Closed

codebrain mentioned this pull request Jan 28, 2019

[meta] 6.6.0 Release elastic/elasticsearch-net#3552

Closed

48 tasks

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Reverse engineer Grok patterns from categorization results #30125

[ML] Reverse engineer Grok patterns from categorization results #30125

droberts195 commented Apr 25, 2018 •

edited

Loading

davidkyle left a comment

davidkyle Apr 30, 2018

davidkyle Apr 30, 2018

davidkyle Apr 30, 2018

droberts195 May 1, 2018

davidkyle Apr 30, 2018

droberts195 May 14, 2018

lcawl May 14, 2018

davidkyle left a comment

davidkyle May 14, 2018

davidkyle May 14, 2018

droberts195 May 14, 2018

[ML] Reverse engineer Grok patterns from categorization results #30125

[ML] Reverse engineer Grok patterns from categorization results #30125

Conversation

droberts195 commented Apr 25, 2018 • edited Loading

davidkyle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidkyle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 commented Apr 25, 2018 •

edited

Loading