-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Reverse engineer Grok patterns from categorization results #30125
[ML] Reverse engineer Grok patterns from categorization results #30125
Conversation
It has been noted that the regexes we produce in our categorization results are not that far away from Grok patterns that could be used in Logstash to categorize messages at ingest time and do better field extraction for log formats that do not have out-of-the-box patterns. This change adds a `grok_pattern` field to our GET categories API output. It's calculated using the regex and examples in the categorization result, and applying a list of candidate Grok patterns to the bits in between the tokens that are considered to define the category. This can currently be considered a prototype, as there is an outstanding question on how the functionality should work: * Is calculating the Grok patterns on the fly the best thing to do? It might be better to calculate them when categorization results are created/updated, and store the patterns in a new type of ML results document. Then we could let users manually improve the patterns and remember their edits. But the decision here needs to tie in with the end-to-end story for this functionality. If the intended flow is `ML -> User edits in the Grok debugger -> Logstash config` then maybe there's no need for ML to remember the user edits.
fdf4cf5
to
9e2a2df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I left a few comments possibly based on my own misunderstanding
} else { | ||
// We should never get here. If we do it implies a bug in the original categorization, | ||
// as it's produced a regex that doesn't match the examples. | ||
assert matcher.matches() : exampleProcessor.pattern() + " did not match " + example; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be logged for the case where assertions are not enabled
// "the cat sat on the mat" will result in "the ", " ", " on the ", and "" | ||
// being added to the 4 "in between" collections in that order | ||
for (int groupNum = 1; groupNum <= matcher.groupCount(); ++groupNum) { | ||
if (inBetweenBits.size() < groupNum) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the first pass of the for (String example: examples)
the array will be populated. Maybe pre-populate inBetweenBits
with the new ArrayLists once it is created
// Finally, for each collection of "in between" bits we look for the best Grok pattern and incorporate | ||
// it into the overall Grok pattern that will match the each example in its entirety | ||
for (int inBetweenBitNum = 0; inBetweenBitNum < inBetweenBits.size(); ++inBetweenBitNum) { | ||
// Remember (from the first comment in this method) that the first element in this array is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that this causes every generated grok pattern to start with '.*?' unless one of ORDERED_CANDIDATE_GROK_PATTERNS
matches the empty string is that the intention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the Grok patterns are built from the regex that categorization churns out there's no way to be sure what comes before the first common token for the category. Therefore all the regexes will begin with .*?
. (It doesn't make any difference if one of the ORDERED_CANDIDATE_GROK_PATTERNS
matches the empty string - this prefix will always be on the regex.)
I agree it could be inefficient and look silly to a human who knows the data, but this is a limitation from trying to reverse engineer Grok patterns from the categorization output. An alternative would be to try to find fields as part of the main categorization algorithm. But that would imply either moving the whole categorization algorithm into Java, or adding Grok functionality to the C++, or using something other than Grok to define how to extract the fields.
Possibly the best thing to do is just release this functionality in its current form as experimental, because several people have said they could get value from this prototype, and switch to something completely different and more complex in the future.
// E.g., ".*?cat.+?sat.+?mat.*" -> Pattern (.*?)cat(.+?)sat(.+?)mat(.*) | ||
Pattern exampleProcessor = Pattern.compile(regex.replaceAll("(\\.[*+]\\??)", "($1)"), Pattern.DOTALL); | ||
|
||
List<Collection<String>> inBetweenBits = new ArrayList<>(fixedRegexBits.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe groupsMatches
instead of inBetweenBits
or matchesForGroups
. This is and array of example matches for each group
`grok_pattern`:: | ||
(string) A Grok pattern that could be used in Logstash or an Ingest Pipeline | ||
to extract fields from messages that match the category. This field is | ||
experimental and may be changed or removed in a future version. The Grok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lcawl do you know if there are any precedents for a single field that is experimental within a feature that is generally fully supported? Is the way I've documented it here right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195 I'd suggest putting "experimental[]" after the data type, per https://github.com/elastic/docs#experimental-and-beta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM I left a couple of comments
@@ -629,7 +630,7 @@ public void bucketRecords(String jobId, Bucket bucket, int from, int size, boole | |||
* @param from Skip the first N categories. This parameter is for paging | |||
* @param size Take only this number of categories | |||
*/ | |||
public void categoryDefinitions(String jobId, Long categoryId, Integer from, Integer size, | |||
public void categoryDefinitions(String jobId, Long categoryId, boolean augment, Integer from, Integer size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The javadoc will need updating for the new parameter
@@ -41,7 +41,7 @@ protected void doExecute(GetCategoriesAction.Request request, ActionListener<Get | |||
|
|||
Integer from = request.getPageParams() != null ? request.getPageParams().getFrom() : null; | |||
Integer size = request.getPageParams() != null ? request.getPageParams().getSize() : null; | |||
jobProvider.categoryDefinitions(request.getJobId(), request.getCategoryId(), from, size, | |||
jobProvider.categoryDefinitions(request.getJobId(), request.getCategoryId(), true, from, size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be exposed as a request parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's worth it. People can easily ignore the Grok patterns if they're not interested.
This change adds a grok_pattern field to the GET categories API output in ML. It's calculated using the regex and examples in the categorization result, and applying a list of candidate Grok patterns to the bits in between the tokens that are considered to define the category. This can currently be considered a prototype, as the Grok patterns it produces are not optimal. However, enough people have said it would be useful for it to be worthwhile exposing it as experimental functionality for interested parties to try out.
* 6.x: Revert "Silence IndexUpgradeIT test failures. (#30430)" [DOCS] Remove references to changelog and to highlights Revert "Mute ML upgrade test (#30458)" [ML] Fix BWC version for backport of #30125 [Docs] Improve section detailing translog usage (#30573) [Tests] Relax allowed delta in extended_stats aggregation (#30569) Fail if reading from closed KeyStoreWrapper (#30394) [ML] Reverse engineer Grok patterns from categorization results (#30125) Derive max composite buffers from max content len Update build file due to doc file rename SQL: Extract SQL request and response classes (#30457) Remove the changelog (#30593) Revert "Add deprecation warning for default shards (#30587)" Silence IndexUpgradeIT test failures. (#30430) Add deprecation warning for default shards (#30587) [DOCS] Adds 6.4.0 release highlight pages [DOCS] Adds release highlight pages (#30590) Docs: Document how to rebuild analyzers (#30498) [DOCS] Fixes title capitalization in security content LLRest: Add equals and hashcode tests for Request (#30584) [DOCS] Fix realm setting names (#30499) [DOCS] Fix path info for various security files (#30502) Docs: document precision limitations of geo_bounding_box (#30540) Fix non existing javadocs link in RestClientTests Auto-expand replicas only after failing nodes (#30553)
…ngs-to-true * elastic/master: [DOCS] Restores 7.0.0 release notes and highlights Remove assert statements from field caps documentation. (elastic#30601) Repository GCS plugin new client library (elastic#30168) HLRestClient: Follow-up for put index template api (elastic#30592) Unmute IndexUpgradeIT tests [DOCS] Remove references to changelog and to highlights Side-step pending deletes check (elastic#30571) [DOCS] Remove references to removed changelog Revert "Mute ML upgrade test (elastic#30458)" [ML] Adjust BWC version following backport of elastic#30125 [Docs] Improve section detailing translog usage (elastic#30573) [Tests] Relax allowed delta in extended_stats aggregation (elastic#30569) [ML] Reverse engineer Grok patterns from categorization results (elastic#30125) Update build file due to doc file rename Remove the changelog (elastic#30593) Fix issue with finishing handshake in ssl driver (elastic#30580) Fail if reading from closed KeyStoreWrapper (elastic#30394) Silence IndexUpgradeIT test failures. (elastic#30430)
It has been noted that the regexes we produce in our categorization
results are not that far away from Grok patterns that could be used
in Logstash to categorize messages at ingest time and do better
field extraction for log formats that do not have out-of-the-box
patterns.
This change adds a
grok_pattern
field to our GET categories APIoutput. It's calculated using the regex and examples in the
categorization result, and applying a list of candidate Grok
patterns to the bits in between the tokens that are considered to
define the category.
This can currently be considered a prototype, as the Grok patterns
it produces are not optimal. However, enough people have said it
would be useful for it to be worthwhile exposing it as experimental
functionality for interested parties to try out.