Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect data_stream.{dataset,namespace} for OTel logs, metrics, and traces #201

Merged
merged 17 commits into from
Feb 6, 2024

Conversation

carsonip
Copy link
Member

@carsonip carsonip commented Jan 15, 2024

  • Parse data_stream.{dataset,namespace} from OTel logs, metrics and traces
  • Route using user-defined data_stream.{dataset,namespace}

Implications / Breaking changes

  • data_stream.dataset and data_stream.namespace are parsed in the order of resource attributes, scope attributes, span attributes, and span events attributes. The latest match wins. e.g. a span event data_stream.dataset will override the span's attributes.
  • Implied by the previous point, data_stream.dataset and data_stream.namespace are inherited from earlier levels if they are not specified at the current level. e.g. if a span does not specify data_stream.dataset, it will use the scope's data_stream.dataset.
  • data_stream.dataset and data_stream.namespace are parsed and no longer set as labels (data_stream_dataset and data_stream_namespace)

Edge cases

  • Although data_stream.dataset and data_stream.namespace are parsed for error span events, only data_stream.namespace will be respected, because data_stream.dataset will be overridden by SetDataStream.
  • For RUM events, although data_stream.dataset and data_stream.namespace are parsed, none of them will be respected as they will be overridden by SetDataStream.

Closes elastic/apm-server#10191

@@ -182,6 +182,16 @@ func (c *Consumer) convertLogRecord(
event.Network.Connection = modelpb.NetworkConnectionFromVTPool()
}
event.Network.Connection.Type = v.Str()
case attributeDataStreamDataset:
if event.DataStream == nil {
event.DataStream = modelpb.DataStreamFromVTPool()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this respect OTLP Resource.attributes and LogRecord.attributes or only one? What about ScopeLogs.Scope.attributes?

Copy link
Member Author

@carsonip carsonip Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. It was only reading LogRecord.attributes, but I've updated the PR to only Resource.attributes (only one). Looking at the code, we never read from ScopeLogs.Scope.attributes. Let me know if it makes more sense to read from both LogRecord.attributes and Resource.attributes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently unspecified in https://github.com/elastic/opentelemetry-dev/blob/main/docs/design-decisions/ingest/routing.md and I think we need to clarify this.

From an OTel Collector perspective, I think if we're only to support one, Scope.attributes would actually match the best. Reason being is that it's typical for a related dataset to emit a distinct Scope, which may come from the same Resource as other instrumentation. For example you may have two scopes, otelcol/hostmetricsreceiver/cpu and otelcol/hostmetricsreceiver/memory, which should be indexed into separate data streams coming from the same resource (host). Similarly you could have logs from two different application SDKs (say nginx and custom java app) from the same host, but should end up in different data streams for easier anaylsis and more compact storage.

That said, we could support all 3 levels with some precedence rules (I would expect LogRecord.attributes > Scope.attributes > Resource.attributes)

@axw have you given this any thought?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain than any of them will cover all scenarios, e.g.:

  • resource attributes aren't appropriate for routing two different data sets produced by one data source, since resource describes the data source
  • instrumentation scope attributes aren't appropriate for routing data for two different service instances, since they would have identical instrumentation, just different resource attributes (e.g. service.instance.id)
  • I'm not sure that signal-level attributes are ever appropriate?

That said, we could support all 3 levels with some precedence rules (I would expect LogRecord.attributes > Scope.attributes > Resource.attributes)

Sounds good, for consistency with the attribute precedence we've already documented (sans top-level fields).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

support all 3 levels with some precedence rules (I would expect LogRecord.attributes > Scope.attributes > Resource.attributes)

done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, we could support all 3 levels with some precedence rules (I would expect LogRecord.attributes > Scope.attributes > Resource.attributes)

I'm revisiting the docs and I'm a little confused here. The attribute precedence described by @joshdover (which I implemented in the PR atm) appears to be in the opposite order to our documented vision.

@joshdover
Copy link

joshdover commented Jan 16, 2024

How well is the existing document schema for these logs indexed into ES when they target data streams that use the default logs-*-* template instead of the logs-apm.app.*-* one?

Note that depending on the discussion in https://github.com/elastic/opentelemetry-dev/pull/59, we may need to append/prepend a otel suffix/prefix to the dataset if the one on the event isn't specified. I think though it would make sense to delay that until we implement https://github.com/elastic/observability-dev/issues/2986 and have new index templates that target these otel data streams.

@carsonip
Copy link
Member Author

How well is the existing document schema for these logs indexed into ES when they target data streams that use the default logs-- template instead of the logs-apm.app.- one?

With a document as simple as

{
  "_index": ".ds-logs-apm.app.unknown_service-default-2024.01.16-000001",
  "_id": "6igPE40B0Q8GnOZbvKyd",
  "_version": 1,
  "_score": 0,
  "_source": {
    "observer": {
      "hostname": "carson-elastic",
      "type": "apm-server",
      "version": "8.13.0"
    },
    "agent": {
      "name": "otlp",
      "version": "unknown"
    },
    "@timestamp": "2024-01-16T16:16:53.948Z",
    "log": {
      "level": "fatal"
    },
    "data_stream": {
      "namespace": "default",
      "type": "logs",
      "dataset": "apm.app.unknown_service"
    },
    "service": {
      "name": "unknown_service",
      "language": {
        "name": "unknown"
      }
    },
    "event": {
      "severity": 21
    },
    "message": "test record",
    "labels": {
      "foo": "bar"
    }
  },
  "fields": {
    "event.severity": [
      21
    ],
    "service.name": [
      "unknown_service"
    ],
    "data_stream.namespace": [
      "default"
    ],
    "message": [
      "test record"
    ],
    "data_stream.type": [
      "logs"
    ],
    "observer.hostname": [
      "carson-elastic"
    ],
    "service.language.name": [
      "unknown"
    ],
    "@timestamp": [
      "2024-01-16T16:16:53.948Z"
    ],
    "observer.type": [
      "apm-server"
    ],
    "observer.version": [
      "8.13.0"
    ],
    "data_stream.dataset": [
      "apm.app.unknown_service"
    ],
    "log.level": [
      "fatal"
    ],
    "agent.name": [
      "otlp"
    ],
    "agent.version": [
      "unknown"
    ],
    "labels.foo": [
      "bar"
    ]
  }
}

They look virtually the same in logs-generic-default vs logs-apm*, at least in Discover. There are subtle differences in the underlying mapping.

logs-apm-mapping.json
logs-generic-mapping.json

Dotted fields are de-dotted, e.g. dot.ted becomes labels.dot_ted in either case and are working well.

@carsonip carsonip marked this pull request as ready for review January 17, 2024 17:05
@carsonip carsonip requested a review from a team as a code owner January 17, 2024 17:05
@carsonip carsonip requested review from axw and joshdover January 17, 2024 17:05
Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just one request. I wonder if we should do it for traces and metrics already, given that this is new functionality and won't affect anything unless these attributes are set.

@carsonip carsonip requested a review from lahsivjar January 18, 2024 10:18
@carsonip carsonip changed the title Respect data_stream.namespace and data_stream.dataset for OTel logs Respect data_stream.{namespace,dataset} for OTel logs, metrics, and traces Jan 22, 2024
@carsonip carsonip changed the title Respect data_stream.{namespace,dataset} for OTel logs, metrics, and traces Respect data_stream.{dataset,namespace} for OTel logs, metrics, and traces Jan 22, 2024
@carsonip
Copy link
Member Author

carsonip commented Jan 22, 2024

@axw I've updated the PR to do the same for logs, metrics and traces. What isn't handled is Consumer.convertSpanEvent which are errors and logs from spans.

@carsonip carsonip requested a review from axw January 23, 2024 10:30
// a different datastream.
if isRUMAgentName(event.GetAgent().GetName()) {
event.DataStream.Dataset = rumTracesDataset
}
}
case modelpb.ErrorEventType:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[For reviewer] Errors are not handled in a sense that the dataset is not respected. apm.error will always be used but the namespace is set by input data_stream.namespace.

event.DataStream.Dataset = tracesDataset
// In order to maintain different ILM policies, RUM traces are sent to
// a different datastream.
if isRUMAgentName(event.GetAgent().GetName()) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[For reviewer] RUM check is bypassed if DataStream is specified. Is this desirable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminded me: RUM agents shouldn't be control routing until we have some way for operators to restrict them. Otherwise (unauthenticated, untrusted) RUM agents can create arbitrarily many data streams, and DoS Elasticsearch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[For reviewer] QQ: line 222, 223 otelSpan.Events() and event = event.CloneVT() means that errors and logs from traces automatically inherit data_stream.* from resource, scope and span.

  1. For errors, in the PR we currently always override dataset to apm.error, but namespace will still be set using the input's data_stream.namespace. Since errors are special, so it makes sense to override dataset to apm.error in SetDataStream, but should it be consistent and override namespace as well?
  2. For logs from traces, both dataset and namespace will be set by input's data_stream.*. It makes it indistinguishable from plain otel logs. Is this desirable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we are not reading data_stream.* inside convertSpanEvent. Should we do so as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also take it from the span event attributes, in case there's a need to override the span-level attributes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also take it from the span event attributes

Q: span event including errors? Does it mean that we are allowing users to override apm.error dataset?

Q: what about events from jaeger? Should we parse data stream from attributes as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: span event including errors? Does it mean that we are allowing users to override apm.error dataset?

🥲 probably shouldn't allow overriding the dataset for now. Maybe later, if we can stop relying on the dataset for identifying errors. I expect we'll end up searching all logs for errors, but too early to say yet.

Q: what about events from jaeger? Should we parse data stream from attributes as well?

I don't think it matters a great deal - IMO whatever option avoids having a special case for Jaeger.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably shouldn't allow overriding the dataset for now

Agreed. Dataset will be parsed at OTel conversion layer but later reset to apm.error by SetDataStream. However, data_stream.namespace will be parsed and respected. Not sure if you consider this a bug.

IMO whatever option avoids having a special case for Jaeger.

Although it means a few more lines, I think it is good to be consistent and do the same for Jaeger.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Dataset will be parsed at OTel conversion layer but later reset to apm.error by SetDataStream. However, data_stream.namespace will be parsed and respected. Not sure if you consider this a bug.

I do, we should not allow untrusted agents to control data stream naming at all.

Although it means a few more lines, I think it is good to be consistent and do the same for Jaeger.

SGTM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do, we should not allow untrusted agents to control data stream naming at all.

Which I see you have implemented, so LGTM

@carsonip
Copy link
Member Author

we may need to append/prepend a otel suffix/prefix to the dataset if the one on the event isn't specified. I think though it would make sense to delay that until we implement https://github.com/elastic/observability-dev/issues/2986 and have new index templates that target these otel data streams.

Yes, it is a separate task to switch default dataset and possibly handle otel and non-otel cases separately. We can delay that. It should not block this PR.

@carsonip
Copy link
Member Author

@axw sorry for dragging you into all these low level talk. I've updated the PR description to list the implications / breaking changes and edge case handling. See if it looks good to you.

Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carsonip I've reviewed the description and skimmed the code, changes LGTM. I think the order that Josh described sounds right, so you're that right it's not the same order as for queries.

I haven't reviewed all the code, let me know if I need to.

@carsonip carsonip merged commit 3b28495 into elastic:main Feb 6, 2024
4 checks passed
carsonip added a commit to carsonip/apm-server that referenced this pull request Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[OpenTelemetry] data_stream.namepace and data_stream.dataset aren't being respected
3 participants