[Logs UI] Research partitioning of log entries for categorization jobs #46610

weltenwort · 2019-09-25T16:26:12Z

Summary

The goal of this research effort is to determine how the quality of the categories derived by the ML algorithms can be improved. In particular, it should be investigated how the knowledge about the log entries belonging the distinct log types (via event.dataset) can be utilized in the job configurations.

Challenges

The set of log entry types can vary dynamically. The user could, for example, add a new type of logs to their centralized logging setup. Similarly, the use-case might have changed such that a certain type of log entries is no longer ingested. We would ideally be able to accommodate these kinds of changes without requiring the user to take action and without losing the trained model.
The wide variety of log ingestion setups can lead to a larger number of log types falling into the "other" type because they don't have a proper dataset field. Would that "other" partition still make sense? Would that field become a requirement?
If there are separate categorization jobs for separate types of logs, can their anomalies be mixed in a visualization without misrepresenting the data?

Acceptance criteria

We have learned...

how to configure ML jobs to take advantage of the ecs.dataset field in log entries.
whether a "catch-all" partition makes sense.
what the implications are in terms of storage and compute resource usage.
how to dynamically include new datasets in the analysis as the use-cases evolve.
what the semantics of the category anomalies are and whether they can be compared across jobs.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-09-25T16:26:14Z

Pinging @elastic/infra-logs-ui

weltenwort · 2019-09-27T10:41:32Z

After considering the technical implications and limitations I can see both up- and downsides to the two obvious implementation options:

Option 1: One categorization job partitioned via `partition_field_name`

The job would have...

a datafeed that contains the same log messages as used for the log entry rate anomaly detection.
a categorization_field_name set to message.
a rare detector with the by_field_name set to mlcategory and partition_field_name set to event.dataset.
the model plot enabled.

The following result data structures could be attributed to a partition:

the anomaly record
the model_plot with the actual count per bucket and partition

A few result data structures would be shared between the partitions:

the bucket
the category with its examples

Advantages

Simplified job management
Automatic adaptation to appearing and disappearing partitions
Reduced memory (heap) and cpu usage compared to option 2

Disadvantages

Examples are not partition-specific

Option 2: Separate categorization jobs per partition field name

There would be a job for each to-be-categorized log type with...

a datafeed that contains only the log messages with a specific event.dataset value.
a categorization_field_name set to message.
a rare detector with the by_field_name set to mlcategory.
the model plot enabled.

All result data structures can be attributed to a partition since they are completely distinct.

Advantages

Partition-specific examples
Fine-grained job deployment

Disadvantages

More complex job management
Need for manual onboarding of new partitions, by either a user action or a permanent background task
Increased memory (heap) and cpu usage compared to option 1

weltenwort · 2019-09-27T10:43:30Z

/cc @tbragin @sophiec20 @grabowskit @droberts195 @katrin-freihofner

stevedodson · 2019-09-27T13:44:14Z

A key point we shouldn't overlook is that categorisation only really makes sense in the context of unstructured log messages (e.g. syslog). If it is applied to structured log messages (e.g. nginx) then the results may not be very insightful. For example, running rare by mlcategory on nginx logs may give confusing results.

To be generally useful we need to account for this.

tbragin · 2019-10-06T17:39:16Z

@stevedodson @sophiec20 @grabowskit I am concerned about the prospect of creating a separate log categorization ML job for each logging dataset. It seems like a messy proposition to manage all these jobs on the Logs UI side. Has any consideration been given to the possibility of enhancing Option 1 described above to include the bucket and the category with its examples in partitioned results? Happy to schedule a short chat to discuss live if that is easier.

cc @jasonrhodes @weltenwort

weltenwort · 2019-10-07T13:16:32Z

We could fetch partition-specific examples from the underlying logs even with option 1 if we use the terms of the category in combination with a terms filter for the partition.

sophiec20 · 2019-10-07T14:09:09Z

A few comments..

unstructured data

ML categorization requires unstructured log data for meaningful results. Would be good to see what options we have here ... is the plan to check for the existence of message? We may need to do more that just turn the same log analysis event rate data into a categorization analysis.

types of unstructured log messages

If the message contains large chunks of XML or SQL or numbers as examples, then the analysis becomes polluted. We have category explosion if the log message does not lend itself to being categorized. Picking the right type of log message data makes a big difference in getting useful results.

Job memory limits will prevent the impact of category explosion from being harmful to the cluster, however it is not an optimal use of resources and the results become noisy and less useful.

With current ML categorization capabilities, it is optimal for some sort of up front knowledge to be used to assess if the data it suitable for categorizing.

both rare and count by mlcategory

Ideally we should plan to use both rare and count by mlcategory. Both are useful as issues tend to manifest themselves as event rate increases/decreases and unusual log messages. Also, I believe that the count job would be required in order to obtain model plot info which shows rate of categories.

Rare and count could exist as two detectors in one job.

separate or single ml categorization jobs

It is optimal, based on current functionality in ML categorization, to create a job per log source. However upon reflection, a single job could be used providing the examples are ignored. ML will show the category definition, and from this definition message examples could derived from the source data for a known partition.

summary

To round off these comments ... Categorisation is a powerful technique to analyse log messages, especially when they are the right type of message. Using a single job is possible, but liable to pollution from unsuitable data. Changing categorization to allow for partitioned examples is not strictly needed providing examples can be "searched for" in the source data. On-boarding workflow can be key for best analysis results. We may need to consider a more thoughtful workflow and if that is not possible, then perhaps changes to ML to enable this.

weltenwort · 2019-10-07T15:37:20Z

Thanks for the helpful comments, @sophiec20!

check for the existence of message

That filter should definitely be there, I guess. Beyond that we could include a blacklist for known-bad ecs.datasets produced by our own filebeat modules.

both rare and count by mlcategory

Good point. I just wanted to leave that open until the requirements by the UI are clearer.

On-boarding workflow can be key for best analysis results. We may need to consider a more thoughtful workflow and if that is not possible, then perhaps changes to ML to enable this.

👍 The plan of separating the categorization into its own tab could allow for a more specialized and helpful on-boarding process.

weltenwort · 2019-10-24T15:19:26Z

Closing this since the goals of the research task have been achieved. That doesn't mean the discussion can't go on 😉

weltenwort added Feature:Logs UI Logs UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services R&D Research and development ticket (not meant to produce code, but to make a decision) v7.5.0 labels Sep 25, 2019

weltenwort changed the title ~~[Logs UI] Research ways to partition the log entry categorization jobs by log type~~ [Logs UI] Research partitioning of log entries for categorization jobs Sep 25, 2019

weltenwort self-assigned this Sep 26, 2019

weltenwort added the [zube]: In Progress label Sep 26, 2019

weltenwort closed this as completed Oct 24, 2019

zube bot added [zube]: Done and removed [zube]: In Progress labels Oct 24, 2019

sgrodzicki removed the [zube]: Done label Oct 29, 2019

weltenwort mentioned this issue Nov 12, 2019

[Logs UI] Add ML module with a common log categorization job #50414

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Logs UI] Research partitioning of log entries for categorization jobs #46610

[Logs UI] Research partitioning of log entries for categorization jobs #46610

weltenwort commented Sep 25, 2019 •

edited

Loading

elasticmachine commented Sep 25, 2019

weltenwort commented Sep 27, 2019

weltenwort commented Sep 27, 2019

stevedodson commented Sep 27, 2019

tbragin commented Oct 6, 2019 •

edited

Loading

weltenwort commented Oct 7, 2019

sophiec20 commented Oct 7, 2019 •

edited

Loading

weltenwort commented Oct 7, 2019

weltenwort commented Oct 24, 2019

[Logs UI] Research partitioning of log entries for categorization jobs #46610

[Logs UI] Research partitioning of log entries for categorization jobs #46610

Comments

weltenwort commented Sep 25, 2019 • edited Loading

Summary

Challenges

Acceptance criteria

elasticmachine commented Sep 25, 2019

weltenwort commented Sep 27, 2019

Option 1: One categorization job partitioned via partition_field_name

Advantages

Disadvantages

Option 2: Separate categorization jobs per partition field name

Advantages

Disadvantages

weltenwort commented Sep 27, 2019

stevedodson commented Sep 27, 2019

tbragin commented Oct 6, 2019 • edited Loading

weltenwort commented Oct 7, 2019

sophiec20 commented Oct 7, 2019 • edited Loading

weltenwort commented Oct 7, 2019

weltenwort commented Oct 24, 2019

weltenwort commented Sep 25, 2019 •

edited

Loading

Option 1: One categorization job partitioned via `partition_field_name`

tbragin commented Oct 6, 2019 •

edited

Loading

sophiec20 commented Oct 7, 2019 •

edited

Loading