-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Logs UI] Research partitioning of log entries for categorization jobs #46610
Comments
Pinging @elastic/infra-logs-ui |
After considering the technical implications and limitations I can see both up- and downsides to the two obvious implementation options: Option 1: One categorization job partitioned via
|
A key point we shouldn't overlook is that categorisation only really makes sense in the context of unstructured log messages (e.g. syslog). If it is applied to structured log messages (e.g. nginx) then the results may not be very insightful. For example, running To be generally useful we need to account for this. |
@stevedodson @sophiec20 @grabowskit I am concerned about the prospect of creating a separate log categorization ML job for each logging dataset. It seems like a messy proposition to manage all these jobs on the Logs UI side. Has any consideration been given to the possibility of enhancing Option 1 described above to include the bucket and the category with its examples in partitioned results? Happy to schedule a short chat to discuss live if that is easier. |
We could fetch partition-specific examples from the underlying logs even with option 1 if we use the terms of the category in combination with a terms filter for the partition. |
A few comments.. unstructured data ML categorization requires unstructured log data for meaningful results. Would be good to see what options we have here ... is the plan to check for the existence of types of unstructured log messages If the message contains large chunks of XML or SQL or numbers as examples, then the analysis becomes polluted. We have category explosion if the log message does not lend itself to being categorized. Picking the right type of log message data makes a big difference in getting useful results. Job memory limits will prevent the impact of category explosion from being harmful to the cluster, however it is not an optimal use of resources and the results become noisy and less useful. With current ML categorization capabilities, it is optimal for some sort of up front knowledge to be used to assess if the data it suitable for categorizing. both rare and count by mlcategory Ideally we should plan to use both rare and count by mlcategory. Both are useful as issues tend to manifest themselves as event rate increases/decreases and unusual log messages. Also, I believe that the count job would be required in order to obtain model plot info which shows rate of categories. Rare and count could exist as two detectors in one job. separate or single ml categorization jobs It is optimal, based on current functionality in ML categorization, to create a job per log source. However upon reflection, a single job could be used providing the examples are ignored. ML will show the category definition, and from this definition message examples could derived from the source data for a known partition. summary To round off these comments ... Categorisation is a powerful technique to analyse log messages, especially when they are the right type of message. Using a single job is possible, but liable to pollution from unsuitable data. Changing categorization to allow for partitioned examples is not strictly needed providing examples can be "searched for" in the source data. On-boarding workflow can be key for best analysis results. We may need to consider a more thoughtful workflow and if that is not possible, then perhaps changes to ML to enable this. |
Thanks for the helpful comments, @sophiec20!
That filter should definitely be there, I guess. Beyond that we could include a blacklist for known-bad
Good point. I just wanted to leave that open until the requirements by the UI are clearer.
👍 The plan of separating the categorization into its own tab could allow for a more specialized and helpful on-boarding process. |
Closing this since the goals of the research task have been achieved. That doesn't mean the discussion can't go on 😉 |
Summary
The goal of this research effort is to determine how the quality of the categories derived by the ML algorithms can be improved. In particular, it should be investigated how the knowledge about the log entries belonging the distinct log types (via
event.dataset
) can be utilized in the job configurations.Challenges
Acceptance criteria
We have learned...
ecs.dataset
field in log entries.The text was updated successfully, but these errors were encountered: