Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/transform] - Log processing capabilities #9410

Open
djaglowski opened this issue Apr 22, 2022 · 7 comments
Open

[processor/transform] - Log processing capabilities #9410

djaglowski opened this issue Apr 22, 2022 · 7 comments
Assignees
Labels
data:logs Logs related issues never stale Issues marked with this label will be never staled and automatically removed pkg/ottl priority:p2 Medium processor/transform Transform processor roadmapping Issue describes several feature requests for a topic

Comments

@djaglowski
Copy link
Member

This is a very rough draft analysis of how the transformprocessor could be enhanced to support the log processing capabilities of the log-collection library. Certainly more careful design would be warranted, but the suggestions and examples are a starting point for conversation.

Path expressions vs "field syntax"

log-collection defines a field syntax that is very similar to transformprocessor's "path expression". However, it allows for the ability to refer to nested fields in attributes and body. This would be a very important capability for parity.

Expressions

log-collection exposes an expression engine. This could be represented as a new function called expr(), which would typically be composed into other functions:

  • set(attributes["some.attr"], expr(foo ? 'foo' : 'no foo'))

Alternately, it may be possible to provide equivalent functions for the same capabilities.

Parsers

log-collection's generic parsers like json, regex, csv, and keyvalue could be represented as functions. These all produce a map[string]interface{}, which could then be set as desired:

  • parse_json(body, attributes)
  • parse_regex(body, '/^.....$/', attributes["tmp"])

A common pattern is to "embed" subparsers into these generic parsers, with the primary benefit being that they only execute if the primary parsing operation succeeded. Possibly this could be represented with some kind of conditional sub-query concept:

  • parse_json(body, attributes)
    • strptime(attributes["time"], "%a %b %e %H:%M:%S %Z %Y", "Local")

Moving values around

  • set is equivalent to add
  • retain is roughly equivalent to keep_keys, but there appear to be some nuanced differences. Need to look into this more.
  • copy(from, to)
  • move(from, to)
  • remove(attributes["one"], attributes["two"])
  • flatten(attributes["multilayer.map"])

Timestamps

log-collection supports multiple timestamp parsing syntaxes, namely strptime, gotime, and epoch (unix). These would translate fairly easily to functions:

  • strptime(attributes["time"], "%a %b %e %H:%M:%S %Z %Y", "Local")
  • gotime(attributes["time"], Jan 2 15:04:05 MST 2006)
  • unixtime(concat(attributes["seconds"], ".", attributes["seconds"]), "s.ns")

Severity

log-collection provides a very robust mechanism for interpreting severities, which may be difficult to represent in the syntax of this processor. The main idea of the system is that severity is interpreted according to a mapping. Several out-of-the-box mapping are available, and the user can layer on additional mappings as needed. This gives a concise configuration, and the implementation can be highly optimized (single map lookup, instead of iteration over many queries).

One way to represent these same capabilities would require a class of functions that produce and/or mutate severity mappings:

  • sevmap_default()
  • sevmap_with(sevmap_empty(), as_warn("w", "warn", "warning", "hey!"), as_error("e", "error", "err", "noo!"))
  • sevmap_with(sevmap_http(), as_fatal(404))

Conditionality

  • This is likely planned, but additional matching operators would be necessary to reach parity, specifically regex matching.
  • Possibly would be nice to provide a way to apply a where clause over multiple related queries:
  • if(condition, run(q1, q2, q3))
  • if(parse_json(body, attributes), run(strptime(...), severity(...)))

Routing

log-collection supports a router capability, which allows users to apply alternate processing paths based on any criteria that can be evaluated against individual logs. A brute force equivalent would be to apply the same where clause repeatedly:

  • strptime(attributes["time"], "%a %b %e %H:%M:%S %Z %Y") where body ~= some_regex
  • severity(attributes["sev"], stdsevmap()) where body ~= some_regex
  • gotime(attributes["timestamp"], "Jan 2 15:04:05 MST 2006") where body ~= other_regex
  • severity(attributes["status.code"], httpsevmap()) where body ~= other_regex

Resource and scope challenges

Logs often contain information about the resource and/or scope which must be parsed from text. Isolating and setting these values is fairly straightforward when working with a flat data model, such as is used in log-collection, but it's not clear to me whether the pdata model will struggle with this.

For example, suppose a log format is shaped like resource_name,scope_name,message. Should/does transformprocessor create a new pdata.ResourceLogs for each time a resource attribute is isolated? Should it cross reference with existing resources in the pdata.ResourceLogsSlice and combine them? Could it do this performantly? How many log processing functions could trigger this kind of complication? (eg. move(attributes["resource_name"], resource["name"])).

Need to give more thought to this area especially.

@djaglowski djaglowski added the data:logs Logs related issues label Apr 22, 2022
@djaglowski
Copy link
Member Author

cc: @anuraaga @bogdandrutu @dehaansa

@TylerHelmuth
Copy link
Member

This is an awesome write-up. My main take away is that there is quite a bit of overlap between logs-collection and transformprocessor, with logs-collection being the more mature, feature-rich component.

If the goal is to consolidate processors into the transform processor then logs-collection should be merged in, with gaps in functionality in the transform processor being filled so that the logs-collection processor works as expected. This will overall make the transform processor more capable for other signals as well.

But that is a very significant undertaking. Of the all the processors that have been mentioned as merging into the transform processor, it feels like log-collection is the most complex. I'm not sure it makes sense to make logs-collection the first merger.

@TylerHelmuth
Copy link
Member

I am starting to take a look at this, starting with the json_parser functionality.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jan 16, 2023
@TylerHelmuth TylerHelmuth added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Jan 16, 2023
@bencehornak-gls
Copy link

@djaglowski The parse_regex functionality would be really nice for legacy applications to extract attributes from hardly changeable log messages. The only way to do that right now AFAIK is via the experimental logtransform processor, however, that is not meant to be shipped (#29150). So I think that leaves no options open for regex-based log parsing. 😢

Is there any plan to add parse_regex for log messages to the transform processor?

@TylerHelmuth
Copy link
Member

@bencehornak-gls in the transform processor you want to use the ExtractPatterns converter.

@bencehornak-gls
Copy link

Thanks for the tip, @TylerHelmuth, that helped a lot! I overlooked this converter.

@TylerHelmuth TylerHelmuth added the roadmapping Issue describes several feature requests for a topic label Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data:logs Logs related issues never stale Issues marked with this label will be never staled and automatically removed pkg/ottl priority:p2 Medium processor/transform Transform processor roadmapping Issue describes several feature requests for a topic
Projects
None yet
Development

No branches or pull requests

5 participants