Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logs/Events guidance on what to put to attributes vs body #1651

Open
lmolkova opened this issue Dec 5, 2024 · 11 comments
Open

Logs/Events guidance on what to put to attributes vs body #1651

lmolkova opened this issue Dec 5, 2024 · 11 comments

Comments

@lmolkova
Copy link
Contributor

lmolkova commented Dec 5, 2024

It's not clear what should go to event attribute vs body.

Some considerations:

  • body, in practice, has much larger size limits
  • body supports complex nested structure
  • attributes are easier/more efficient to query and are easier to index

As a result, each event author needs to decide where each property goes and take into account backend limitations.

We need some clear guidance in semantic conventions, but also this affects event API design.

@jack-berg
Copy link
Member

I know there's been a lot of back and forth about this, but I think there's a correct answer here:

Use the body to record what happened. Use attributes as a means of capturing supplementary contextual information.

Semantic conventions should define fields for the body. Instrumentation authors should primarily record data to the body. The emit event API should prioritize the body, and provide great UX for adding to it. The emit event API should retain the ability to record attributes, but de-prioritize them and indicate that they are meant for contextualizing the event, rather than recording what happened.

Reasoning:

  • We want to be consistent. We've got to arrive at some guidance so that instrumentation and semconv authors are not each making decisions and reaching different conclusions. An ecosystem where some events are put their primary data in the body and others put their primary data in attributes is a bad result.
  • We want to be able to differentiate between what happened and additional contextual information about it. I.e. users should be able to write a processor which enriches the log record with data from Context, without polluting the event itself. This is important because backends might want to do strict validation against the event schema and throw away information which doesn't conform. By putting the event in the body and contextual information in attributes, we have a clear distinction between the two. Backends can treat attributes as a sort of "user space" and apply different validation / retention rules than to the body.
  • On paper, log record body and attributes both support complex types, and no there's seemingly limited advantage to choosing one vs the other. But attributes only support complex types for the purpose of bridging. If we're designing a user facing emit event API, we could have this API accept the stricter [standard attribute definition][https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/README.md#standard-attribute]. This is symmetric to attributes in our other user facing APIs (spans, metrics, resources, etc). By putting the event's primary data in the body, we retain the ability to use complex types, and also retain a degree of symmetry with other APIs. We contain the user facing API surface area of complex types to a single place. Nice!
  • If you accept the "we want to be consistent" argument of the first bullet, then we're either putting the primary event data in the body or the attributes. We can prove that body is the right place by examining the result if we put the data in attributes:
    • It seems like we're going out of our way to ignore the body field's semantics: "A value containing the body of the log record." Body is not set for all events. Seems strange.
    • Attributes is the superset of data representing the event and contextual information surrounding it. Hard for consumers to differentiate whether a producer didn't comply with the semantic conventions or just provided supplementary context info.
    • We have a log API operation which is user facing and which supports non-standard attribute definition. This is a first, and asymmetric with other user facing APIs.

A few additional reactions:

body, in practice, has much larger size limits

This is only the case because we haven't tackled the problem of how (attribute) limits apply to AnyValue body. This is a miss, and something we need to solve regardless of whether we put the data in body or attributes. I.e. putting the data in body doesn't magically erase backend limitations on size.

attributes are easier/more efficient to query and are easier to index

Since log record attributes support complex types today for bridging purposes, I think backends already need to figure out how query / index AnyValue. The difference between body and attributes is that body is type AnyValue, while attributes is type Map<String, AnyValue>. But I suspect (and hope) that most events will choose to have a body with type Map<String, AnyValue>, and that the use of other AnyValue types as the body (bool, string, int, double, array, bytes) are the edge case. I don't think that attributes will be any different than body in terms of querying / indexing in practice.

I think we need to tease apart the questions of "how should we instrument this" and "what do backends do". I don't think we should let current backend limitations encourage us to model instrumentation in a less optimal way. We should count on backends evolving over time to reflect the emerging standards being created. In the interim, we should create tools like Event to SpanEvent Bridge to help meet users / backends where they're at today - tools that translate from our idealized representations we use in instrumentation to representations currently supported.

Sorry for the wall of text. Trying to be complete for this elusive question that continues to plague us. 🙂

@lmolkova
Copy link
Contributor Author

lmolkova commented Dec 7, 2024

I take your argument on consistency and that simple guidance is needed. I'm trying to find a solution that would support other requirements too:

  • correlation across signals with attributes. Let's assume instrumentation reports spans and metrics with attribute user.id (session.id, user_agent.original, etc). Would it be a good experience if event looked like {"key": "value", "another_key": {"user": {"id": "42"}}}? How would someone know that another_key.user.id is the same as user.id on spans? I.e. it breaks correlation and consistency across signals for related telemetry. I'd say this consistency is more important than consistency between different events.
  • query ergonomics. I do not believe backends support indexing individual properties of AnyValue well - we don't use it in otel enough - e.g. Java doesn't even produce AnyValue log attributes. So I challenge your assumption that indexing problem is already solved. But even if it is, querying using nested properties is going to be more complicated than using attributes.
  • existing practices. We do capture key-value pairs as attributes in log bridges. The same is true for span events. We do it because it's simple, intuitive, efficient (more than body), works well with any backends.

A few more thoughts:

  • it is hard to differentiate 'what happened' from contextual information about it. When user loads a page, is user.id a contextual information or a part of event? I don't know, probably both. Also regardless of what it is, the way I use this information is the same - I query/aggregate - e.g. I can check how many times user loaded a page or how many pages users load and so on.
  • Span events don't have a body. I argue that simple events don't need body at all. Logs do because they usually put important information in the string message. But you can easily write logs without body (e.g. {"event_name": "...connection_dropped", "reason": "socket error", "connection.id": "42"} - why would you duplicate it all in a string body?)
  • backends de-facto have bigger limits for body vs attributes. It's nice to know which parts of telemetry are not expected to be large and which are.

I prefer the following direction (we briefly chatted with @trask and @jsuereth about it in the past):

  • users should use standard attributes for things they'd aggregate on/group by/correlate with other signals
  • users should use event body fields to record:
    • externally defined things (browser page view events, original web server structured log bodies, etc)
    • things that are rarely used for queries, at least not with exact match - e.g. dynamic log string messages or request/response payloads when they are recorded.

[Update]
The TL;DR use body for things that apply to this event only, so their semantics is less important. Use attributes for properties recorded on different signals (or even different events).

@MSNev
Copy link
Contributor

MSNev commented Dec 9, 2024

correlation across signals with attributes. Let's assume instrumentation reports spans and metrics with attribute user.id (session.id, user_agent.original, etc). Would it be a good experience if event looked like {"key": "value", "another_key": {"user": {"id": "42"}}}? How would someone know that another_key.user.id is the same as user.id on spans? I.e. it breaks correlation and consistency across signals for related telemetry. I'd say this consistency is more important than consistency between different events.

Generally, session, user and other correlation items are not (and should not be) part of the event as they are contextual in nature.

With the exception of specific events related to user logging on / of, session start / stop and as part of these event definitions they would define the event where these attributes are required.

How would someone know that another_key.user.id is the same as user.id on spans?

On this, if this situation occurred (because of the specific event), then they would know because the definition of the event would define this. But! by default backends should NOT make any assumptions about the "name" (or path) of any content of the body is directly comparable to any other attribute or body field "UNLESS" the event name is the same.

As defined [in the events data-model

[Update]

Span events don't have a body. I argue that simple events don't need body at all.

Agree, this is also why the event definition "supports" having no body, because there are "simple" events which will not require any specific fields as they would only represent contextual state changes.

And for the existing "span" events there have been many "hacks" to squash the detail into a span event, due to the limitations of span events, so I don't believe that we should be "designing" the log based events around these historical "hacks", but we should acknowledge that there ARE simple events.
Example of some specific "hacks" in JS, rather than have a "single" event which describes the single event which encompasses all of the "timings" the (legacy) code actually emits multiple events (each named by the "timing" property with the value of this "time" and zero attributes),

@jack-berg
Copy link
Member

Just a quick point of process. It seems like there is actual guidance today on this topic, added in #1263. Is it safe to say we're revisiting and potentially clarifying that guidance?

correlation across signals with attributes.

You're right - to the extent that a field is used within an event with the same semantics as an attribute on a metric or span, I think the best case is to use the same attribute name as from the global registry. Let me lay out a few more details of my vision that (somehow?) I omitted before:

  • The log record body is Any, but for the purpose of the event API we restrict it to say that its always Map<String, Any>. We also restrict event API attributes to the standard definition. So body is similar to attributes except that values are allowed to complex types instead of just the primitives and arrays of primitives allowed by standard attribute.
  • The event's primary data (i.e. the payload / whatever is defined by semantic conventions / whatever is recorded by instrumentation) goes in the body as entries in the Map<String, Any>.
  • All event fields are part of the global attribute registry, supporting easy correlation across signals. Since event fields can be complex types, the global attribute registry evolves to be able to support these types, but only allows attributes with complex types to be referenced in event fields.

query ergonomics.

Events end up having two bags of attributes:

  • log_record.attributes values are (typically) primitives and arrays of primitives. All keys are part of global attribute registry. Entries are contextualization, typically added by user in processors / collectors. Semantic conventions don't attempt to define these.
  • log_record.body values may be primitives, arrays or primitives, or other complex types. All keys are part of global attribute registry. Entries constitute what happen, and are recorded by instrumentation. Semantic conventions define these.

Query ergonomics shouldn't be any different because we guarantee that event payloads are Map<String, Any>. A backend that doesn't like two bags of attributes can (but doesn't have to) merge the two together, prioritizing fields from log_record.body. The benefit to backends is that they can validate the log_record.body against the semantic conventions.

existing practices.

Agree attributes are simple and intuitive. So let's restrict the event API's body to be Map<String, Any>. Get the same intuitive behavior. Only difference is that values can also be complex types. We tell users to always record fields to the body and have similar API ergonomics to recording attributes. From the user's POV, they're just adding attributes, but with less restrictions on the values. Now users don't need to worry about whether something is more appropriate for body or attributes based on whether it may be correlated with other signals, which won't always be intuitive to answer.

it is hard to differentiate 'what happened' from contextual information about it.

I don't think so. Say that instrumentation always records what happened, and doesn't concern itself with any contextual information. Users can optionally enrich the records with additional contextual attributes in processors (or maybe with instrumentation customization callbacks). If its not recorded by the instrumentation, its contextual. Everything that we want recorded by the instrumentation is what happened.

Span events don't have a body. I argue that simple events don't need body at all.

Its certainly a direction we could go but I think its valuable to separate between what is recorded by instrumentation / conforms to semconv and what is extra bits of contextual information added by processors. As for span event's not having a body, I think we can nail down the mapping between events and span events later, but that there are reasonable options to do this.

I prefer the following direction (we briefly chatted with @trask and @jsuereth about it in the past):
users should use standard attributes for things they'd aggregate on/group by/correlate with other signals
users should use event body fields to record:

This guidance meets the consistent criteria, which I think is the most important thing. But I think its rather difficult guidance to follow for a casual user. Events defined in semconv will likely be able to work out rules of thumb for what what are fields that might be aggregated / correlated with other signals, but casual users recording custom events may consistently struggle to do the right thing.

Also, I just want to mention that while I'm stating an opinion here on how things might work, I don't plan on blocking or getting in the way of whatever the log SIG decides. This is just food for thought. For the purposes of this convo, think of me as a casual user and not wearing my TC hat.

@lmolkova
Copy link
Contributor Author

lmolkova commented Dec 9, 2024

@jack-berg thanks for the context and sharing your thoughts - it's extremely valuable.

Just a quick point of process. It seems like there is actual guidance today on this topic, added in #1263. Is it safe to say we're revisiting and potentially clarifying that guidance?

yes, I think we need to clarify this guidance.

Let me clarify my understanding on your proposal (with some assumptions on how it solves cross-signal correlation).

  • The event body fields (in Map<String, Any> ) are equivalent to attributes. E.g. since we have attribute user.id on span, events which need to record user.id should use top-level user.id field.
  • Instrumentations SHOULD NOT record attributes on events (or CANNOT?). Contextual information is usually supplied in processors (I want to challenge this BTW).
  • Backends may improve query ergonomics for the top-level primitive body fields and might even treat them in the same way as attributes

Is my understanding correct?

If so, I have a few additional thoughts:

  1. This is simple on the instrumentation side (put everything into event body). It results in weird event payloads where some of the top-level fields are in the registry and you can cross-query using them. Some are scoped to this event and you can't assume the same meaning across events/signals. There is de-facto no way of knowing if a specific field.name is one or another. Inside data-processing pipelines or at a query time you need to care about two different sources of attributes. So I argue this is more complicated on the consumption/processing side.
  2. I'm not convinced that it's clear what goes to the event payload vs context. When connection terminates, where does the connection.id go? It's both 'what happened' and also the context. The network.peer.address for this connection is more of a contextual information, but it's available inside the instrumentation and very hard to add inside the processor. It might be clear to some, but it's absolutely unclear to me and probably many others.

I think we are going in the same direction though - attributes and body fields are the same and intend to transmit the same information. So I have a radical question on why do we need both (given that logs support AnyValue types).

Important

Why don't we put everything into the attributes? You can only correlate across different signals using standard attributes and only logs/events attributes can have complex attribute value types. Backends may index based on attributes and may recognize known types by the attribute name. Queries/post-processing is the same. Everything is the same as today, but we open the door to AnyValue attributes on logs in semconv and implementations.

Body is only used for unstructured things (log messages, request/response payloads/etc). Body is still AnyValue, but everyone can and should treat it as an opaque blob.

This would be the most consistent solution, but a radical one.

@lmolkova
Copy link
Contributor Author

lmolkova commented Dec 9, 2024

The proposal above applies nicely to existing semconvs

  • feature-flag events - they only need body to support complex (union) type value.
  • gen_ai events - they'd put opt-in (potentially sensitive, large) content into the body with type undefined, but the event itself usually remains valuable without the body
  • mobile events where all defined body fields seem to be applicable to spans and metrics too.
  • azure resource logs where either the whole original azure log event should go unchanged to the body with undefined type (since Azure controls the actual schema) or should be broken down into individual attributes some of which (identity) would be complex.

@jack-berg
Copy link
Member

Is my understanding correct?

Yes. The instrumentations SHOULD not or CANNOT record contextual attributes is TBD in my opinion. I would say minimally its discouraged. But perhaps there is a case for instrumentation customization as discussed here.

It results in weird event payloads where some of the top-level fields are in the registry and you can cross-query using them.

That's already the case tho in other signals. E.g. spans contain attributes which are not present on corresponding metrics. All the fields are part of the global attribute registry, so its easy to go see what they mean. Albeit you may have to go look at a specific convention to see what the field means in the context of that convention.

Inside data-processing pipelines or at a query time you need to care about two different sources of attributes. So I argue this is more complicated on the consumption/processing side.

One way or another the data processing pipeline is receiving a list of key value pairs, some of which constitute the event payload, some of which are extra (either added by user or instrumentation not conforming to conventions). By splitting payload into body, and additional contextual attributes into attributes, we just make this difference explicit for consuming pipelines. If they don't need to make use of this difference, they can just merge the two sources of attributes together and its no different than if they were merged to begin with.

I'm not convinced that it's clear what goes to the event payload vs context. When connection terminates, where does the connection.id go? It's both 'what happened' and also the context. The network.peer.address for this connection is more of a contextual information, but it's available inside the instrumentation and very hard to add inside the processor.

The rule of thumb I would define is if its a field we want to define in semantic conventions for an event, then put it in the event payload (body). Attributes becomes purely extra stuff - I'm imaging app specific context things like baggage and the like. In the case of a connection termination event, presumably the connection.idis something we care about defining in semconv, so it would go in the payload (body) according to this rule.

I think we are going in the same direction though - attributes and body fields are the same and intend to transmit the same information. So I have a radical question on why do we need both (given that logs support AnyValue types).

👍 I don't think its strictly necessary, but it is nice because:

  • Make it easy for backends to differentiate between the data recorded by instrumentation and presumably attempting to conform to semantic conventions, and data added by some other part of the processing pipeline.
  • Seems to best embody the log data model definition of the body field.
  • Keep user facing APIs consistent in that attributes are only standard attributes. (third bullet point from this comment)

Why don't we put everything into the attributes? You can only correlate across different signals using standard attributes and only logs/events attributes can have complex attribute value types. Backends may index based on attributes and may recognize known types by the attribute name. Queries/post-processing is the same. Everything is the same as today, but we open the door to AnyValue attributes on logs in semconv and implementations.

Body is only used for unstructured things (log messages, request/response payloads/etc). Body is still AnyValue, but everyone can and should treat it as an opaque blob.

This question is pretty aligned with my POV, just opting to put everything in attributes instead of body. The thing I like about this (and my vision for the same reason) is that it makes it super simple for users.

If we went in this direction, I would say that events should always leave body blank. And the event API shouldn't provide a mechanism to record to it. When events do want to record unstructured things like request / response payloads, isn't it still better if we record those in an attribute key / value pair where the key is describes what the unstructured thing is? I.e. http.client.request_body: <request_body>, http.client.response_body: <response_body>.

@jack-berg
Copy link
Member

After the convo in the 12/10/24 spec SIG, I'm coming around to @lmolkova's proposal to dump everything in attributes.

The highest value argument I saw for my suggestion of putting everything recorded by instrumentation in the body and all contextual information added later in attributes was to make it easy for backends to differentiate the payload for the purpose of validation.

But consider this from the perspective a backend looking to perform such validation:

  • The backend wants to check that events named foo.bar conform the the semantic conventions for those events as of version1.31.0, which include expected fields foo.field1 and foo.field2.
  • Semconv has published a new version 1.32.0 which adds foo.field3 to the list of fields for foo.bar. The backend isn't aware of this field.
  • Its processing a log record with event name foo.bar and attributes foo.field1, foo.field2, foo.field3, com.acme.tenant.id. The fields foo.feild1 and foo.feild2 conform to the expectations of semconv, but what should it make of the extra fields it doesn't know about foo.field3 and com.acme.tenant.id?
  • The backend can't know whether the extra attributes are part of a new version of the schema its not aware of, or were extra bits of context provided by the user. For the purposes of its validation, it only knows about foo.field1 and foo.field2 and whether those meet the requirements of semconv. It needs to make a decision (e.g. keep or drop) on what to do about any extra field, regardless of if its provided by a new version of semconv or by the user in a processor.
  • The backend's task of validation isn't actually made much easier (or any easier?) by the separation of fields into body and attributes. In both cases, backends will encounter fields they don't recognize and need to make a decision on what to do about them.

So if that argument goes away, the rest of my arguments seem weak compared to the arguments in favor of putting everything in attributes.

Consider me in support of @lmolkova's proposal to put everything in attributes.

@lmolkova
Copy link
Contributor Author

lmolkova commented Dec 10, 2024

Thanks @jack-berg, we had a similar discussion on validation in the Logs SIG call today.

The validation needs schema (hardcoded or supplied) and given that the schema exists, the validation can be applied to the body and attributes together.

And, while we would not encourage people to modify event payload, it still can be enriched, PII redacted, truncated, post-processed in other ways (e.g. here's the proposal for storing certain properties in an external storage and replacing them with ref URL), so we cannot provide any guarantees that the body received is exactly the same as the body produced.

Other relevant discussions from the Logs SIG:

  • We might still need body (opaque/external/verbose data):
    • we do use Body today for log messages and we probably should keep using body in the log bridges.
    • there are cases in mobile and GenAI domains where we'd want to have binary payloads (crash reports, images, etc) - we might need to use body for those (no final conclusion here)
  • we have a consensus on using attributes for everything structured we'd define in semantic conventions under the assumption that:
  • Some open questions:
    • how would attribute limits apply (size and number of attributes)
    • how to separate 'original' vs 'enriched' attributes and is it necessary
    • how much do we care about schema validation? We should be able to do it in the same way against body and attributes using semantic conventions

@jack-berg
Copy link
Member

we do use Body today for log messages and we probably should keep using body in the log bridges.

👍

We might still need body (opaque/external/verbose data):
there are cases in mobile and GenAI domains where we'd want to have binary payloads (crash reports, images, etc) - we might need to use body for those (no final conclusion here)

In these cases, wouldn't it still be better to put it in attributes with a descriptive key defining "what" the opaque / external / verbose / binary data is?

@lmolkova
Copy link
Contributor Author

lmolkova commented Dec 10, 2024

In these cases, wouldn't it still be better to put it in attributes with a descriptive key defining "what" the opaque / external / verbose / binary data is?

probably so, unless we want to have a dedicated place that's known to be opaque / external / verbose / binary (and backends can optimize for it). It can still be solved with a dedicated attribute, but given that we already have body and use it in log bridges for exactly the same purpose, it seems valid to keep using it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

3 participants