Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Logs UI] Research partitioning of log entries for categorization jobs #46610

Closed
weltenwort opened this issue Sep 25, 2019 · 9 comments
Closed
Assignees
Labels
Feature:Logs UI Logs UI feature R&D Research and development ticket (not meant to produce code, but to make a decision) Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.5.0

Comments

@weltenwort
Copy link
Member

weltenwort commented Sep 25, 2019

Summary

The goal of this research effort is to determine how the quality of the categories derived by the ML algorithms can be improved. In particular, it should be investigated how the knowledge about the log entries belonging the distinct log types (via event.dataset) can be utilized in the job configurations.

Challenges

  • The set of log entry types can vary dynamically. The user could, for example, add a new type of logs to their centralized logging setup. Similarly, the use-case might have changed such that a certain type of log entries is no longer ingested. We would ideally be able to accommodate these kinds of changes without requiring the user to take action and without losing the trained model.
  • The wide variety of log ingestion setups can lead to a larger number of log types falling into the "other" type because they don't have a proper dataset field. Would that "other" partition still make sense? Would that field become a requirement?
  • If there are separate categorization jobs for separate types of logs, can their anomalies be mixed in a visualization without misrepresenting the data?

Acceptance criteria

We have learned...

  • how to configure ML jobs to take advantage of the ecs.dataset field in log entries.
  • whether a "catch-all" partition makes sense.
  • what the implications are in terms of storage and compute resource usage.
  • how to dynamically include new datasets in the analysis as the use-cases evolve.
  • what the semantics of the category anomalies are and whether they can be compared across jobs.
@weltenwort weltenwort added Feature:Logs UI Logs UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services R&D Research and development ticket (not meant to produce code, but to make a decision) v7.5.0 labels Sep 25, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-logs-ui

@weltenwort weltenwort changed the title [Logs UI] Research ways to partition the log entry categorization jobs by log type [Logs UI] Research partitioning of log entries for categorization jobs Sep 25, 2019
@weltenwort weltenwort self-assigned this Sep 26, 2019
@weltenwort
Copy link
Member Author

After considering the technical implications and limitations I can see both up- and downsides to the two obvious implementation options:

Option 1: One categorization job partitioned via partition_field_name

The job would have...

  • a datafeed that contains the same log messages as used for the log entry rate anomaly detection.
  • a categorization_field_name set to message.
  • a rare detector with the by_field_name set to mlcategory and partition_field_name set to event.dataset.
  • the model plot enabled.

The following result data structures could be attributed to a partition:

  • the anomaly record
  • the model_plot with the actual count per bucket and partition

A few result data structures would be shared between the partitions:

  • the bucket
  • the category with its examples

Advantages

  • Simplified job management
  • Automatic adaptation to appearing and disappearing partitions
  • Reduced memory (heap) and cpu usage compared to option 2

Disadvantages

  • Examples are not partition-specific

Option 2: Separate categorization jobs per partition field name

There would be a job for each to-be-categorized log type with...

  • a datafeed that contains only the log messages with a specific event.dataset value.
  • a categorization_field_name set to message.
  • a rare detector with the by_field_name set to mlcategory.
  • the model plot enabled.

All result data structures can be attributed to a partition since they are completely distinct.

Advantages

  • Partition-specific examples
  • Fine-grained job deployment

Disadvantages

  • More complex job management
  • Need for manual onboarding of new partitions, by either a user action or a permanent background task
  • Increased memory (heap) and cpu usage compared to option 1

@weltenwort
Copy link
Member Author

@stevedodson
Copy link
Contributor

A key point we shouldn't overlook is that categorisation only really makes sense in the context of unstructured log messages (e.g. syslog). If it is applied to structured log messages (e.g. nginx) then the results may not be very insightful. For example, running rare by mlcategory on nginx logs may give confusing results.

To be generally useful we need to account for this.

@tbragin
Copy link
Contributor

tbragin commented Oct 6, 2019

@stevedodson @sophiec20 @grabowskit I am concerned about the prospect of creating a separate log categorization ML job for each logging dataset. It seems like a messy proposition to manage all these jobs on the Logs UI side. Has any consideration been given to the possibility of enhancing Option 1 described above to include the bucket and the category with its examples in partitioned results? Happy to schedule a short chat to discuss live if that is easier.

cc @jasonrhodes @weltenwort

@weltenwort
Copy link
Member Author

We could fetch partition-specific examples from the underlying logs even with option 1 if we use the terms of the category in combination with a terms filter for the partition.

@sophiec20
Copy link
Contributor

sophiec20 commented Oct 7, 2019

A few comments..

unstructured data

ML categorization requires unstructured log data for meaningful results. Would be good to see what options we have here ... is the plan to check for the existence of message? We may need to do more that just turn the same log analysis event rate data into a categorization analysis.

types of unstructured log messages

If the message contains large chunks of XML or SQL or numbers as examples, then the analysis becomes polluted. We have category explosion if the log message does not lend itself to being categorized. Picking the right type of log message data makes a big difference in getting useful results.

Job memory limits will prevent the impact of category explosion from being harmful to the cluster, however it is not an optimal use of resources and the results become noisy and less useful.

With current ML categorization capabilities, it is optimal for some sort of up front knowledge to be used to assess if the data it suitable for categorizing.

both rare and count by mlcategory

Ideally we should plan to use both rare and count by mlcategory. Both are useful as issues tend to manifest themselves as event rate increases/decreases and unusual log messages. Also, I believe that the count job would be required in order to obtain model plot info which shows rate of categories.

Rare and count could exist as two detectors in one job.

separate or single ml categorization jobs

It is optimal, based on current functionality in ML categorization, to create a job per log source. However upon reflection, a single job could be used providing the examples are ignored. ML will show the category definition, and from this definition message examples could derived from the source data for a known partition.

summary

To round off these comments ... Categorisation is a powerful technique to analyse log messages, especially when they are the right type of message. Using a single job is possible, but liable to pollution from unsuitable data. Changing categorization to allow for partitioned examples is not strictly needed providing examples can be "searched for" in the source data. On-boarding workflow can be key for best analysis results. We may need to consider a more thoughtful workflow and if that is not possible, then perhaps changes to ML to enable this.

@weltenwort
Copy link
Member Author

Thanks for the helpful comments, @sophiec20!

check for the existence of message

That filter should definitely be there, I guess. Beyond that we could include a blacklist for known-bad ecs.datasets produced by our own filebeat modules.

both rare and count by mlcategory

Good point. I just wanted to leave that open until the requirements by the UI are clearer.

On-boarding workflow can be key for best analysis results. We may need to consider a more thoughtful workflow and if that is not possible, then perhaps changes to ML to enable this.

👍 The plan of separating the categorization into its own tab could allow for a more specialized and helpful on-boarding process.

@weltenwort
Copy link
Member Author

Closing this since the goals of the research task have been achieved. That doesn't mean the discussion can't go on 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Logs UI Logs UI feature R&D Research and development ticket (not meant to produce code, but to make a decision) Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.5.0
Projects
None yet
Development

No branches or pull requests

6 participants