Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-open: Clarify that service.* conventions apply to all telemetry sources #671

Conversation

jack-berg
Copy link
Member

This is the second attempt at #630, which was reverted over concerns that it was merged too quickly.

Including the original PR description below because it still stands:

The question of whether the service.* conventions are applicable only to web services is an important one that has come up several times. If the answer is yes, then everything that isn't a web service needs its own version of service.name, service.instance.id, service.namespace, service.version to uniquely define the thing producing telemetry. We've discussed this at length several times and while we don't have anything written down yet in the spec / semantic-conventions, the actions we've taken (i.e. rejecting alternatives like telemetry.source, app.name, etc) confirm that the service.* attributes are applicable to all telemetry services, not some narrower subset that some people consider a web service.

This PR aims to clarify this to avoid repeating the same discussion.

Some PRs, issues that are related:

The conversation on #630 that took place after merging is relevant to all reviewers.

@jack-berg jack-berg requested review from a team January 25, 2024 16:33
@@ -77,7 +77,7 @@ as specified in the [Resource SDK specification](https://github.com/open-telemet

**type:** `service`

**Description:** A service instance.
**Description:** A telemetry source. OpenTelemetry has adopted a broad interpretation such that every telemetry source is a service. Examples include, but are not limited to: web services, hosts, mobile applications, browser application, edge computing devices, functions as a service, databases, message brokers, etc. Specific types of telemetry sources may have additional conventions defining domain specific information, but the `service` conventions are applicable to all telemetry sources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would qualify as telemetry source when scraping metrics from a Kubernetes cluster? Is it the cluster component responsible for creating and providing the metrics (the "source" in a technical sense)? Or is it a deployment, node, or container (the entity a piece of telemetry data primarily refers to, in a logical sense)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a good question and probably a key aspect of the design of the scraping component. The prometheus model (commonly used to scrape metrics from k8s clusters) every endpoint scraped is an "instance", and multiple replicated endpoints of an instance are a "job". These ideas map well to the idea of service.name (i.e. prometheus job) and service.instance.id (i.e. prometheus instance). The next question is can the scraper ascribe any other useful bits of information describing / identifying the telemetry source? I.e. does it know its scraping information about a node, deployment, or pod, or does everything role up into just a cluster?

The idea of a common set of identifying resource attributes applicable across all telemetry sources works nicely with the #575 idea of adding a service.type attribute. With these in place, a OTLP receiver can reliably know the identity and type of telemetry producer.

The next question to answer is how correlation works: If all telemetry sources have the same identifying attributes, how do we indicate that a service is running on a host in a cluster? We have attributes for a host and cluster, but when everything uses service.* for identity, we need conventions about how a host and a cluster populate the service.* attributes.

For example, consider a setup with a service running on a host. Both the service and host are telemetry producers:

// Host resource
{ service.name: "jberg", service.instance.id: "abc123", service.type: "host" }

// Service resource
{ service.name: "my-service", "service.instance.id: "def456", service.type: "web_service", host.name: "jberg", host.id: "abc123" }

If we have a convention that states that hosts set service.name=${host.name}, service.instance.id=${host.id}, service.type=host, then we can reliably known that "my-service" is running on the host because it has matching host.name, host.id resource attributes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so service.type would identify the kind of entity that service attributes refer to.

However, it seems to me that with its current proposed definition it doesn't fully cover this use case:

The service.type identifies the product that is deployed as the service.

A value like "host" doesn't match that description, the proposed definition of service.type would need to be generalized to account for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. This PR doesn't solve all the problems associated with identifying and correlating telemetry sources. It aims to codify just one concept (which I assert we already imply) - that all telemetry sources use the same attributes for identity - which we can build off of in future steps.

A service instance.
A telemetry source. OpenTelemetry has adopted a broad interpretation such that every
telemetry source is a service. Examples include, but are not limited to: web services,
hosts, mobile applications, browser application, edge computing devices, functions as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hosts, mobile applications, browser application, edge computing devices, functions as
hosts, mobile applications, browser applications, edge computing devices, functions as

@joaopgrassi
Copy link
Member

CC @tigrannajaryan @yurishkuro a new PR is here to continue the discussions.

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this PR different from #630? Is seems to propose the exact same changes and does not address any of the concerns raised in #630 discussion.

Blocking so it doesn't get merged by accident again.

@jack-berg
Copy link
Member Author

Its the same as #630 because while there was disagreement, there were also 7 approvals.

I like the solution you proposed:

Here's a possible solution:

  • resource.type: service | host | ...
  • resource.name: my-service | my.host.com | ...

Basically just replacing s/service/resource/ seems to address the concerns. And more resource-specific attributes like app.app-store (making it up) are both possible and recognizable because of resource.type. service.name can be treated as a fallback to resource.name (OTEL schema transforms should support such upgrade).

But that doesn't seem possible given this langauge:

Exception: Some resource attributes are embedded in various locations of the
Specification, e.g. the service.* attributes which are required by SDKs to be
produced and have corresponding environment variables defined in general SDK configuration. These resource
attributes MUST NOT be ever changed. They are considered a hard-coded part of
this specification.

@yurishkuro
Copy link
Member

yurishkuro commented Jan 29, 2024

But that doesn't seem possible given this language:

Heh, the wording is certainly sure of itself. But it doesn't bother me that much by itself, the question is what those "various locations" are and what the impact would be to evolve them in backwards-compatible manner.

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this should be merged because I don't think all telemetry sources are services. My position is that by default the telemetry emitted by Otel SDK is coming from a Service. However, I think there can be other telemetry sources that are not services.

As an example a Host has system metrics that are telemetry and can be reported and analysed. There is no Service in this picture unless we inject it just for the purpose of having one (e.g. the Collector that collects the metrics is the Service).

@jack-berg
Copy link
Member Author

If all SDKs MUST include service.name, and there are telemetry sources which service.name does not apply to, then there are telemetry sources which can not use an SDK.

This is wrong - SDKs are not meant to be limited to use to the colloquial definition of a service (i.e. web service). We must find a way to not artificially limit the types of telemetry producers SDKs can represent.

Heh, the wording is certainly sure of itself. But it doesn't bother me that much by itself, the question is what those "various locations" are and what the impact would be to evolve them in backwards-compatible manner.

We could:

  • Introduce a new more generic prefix (say telemetry.* for the sake of example), which includes identifying attributes applicable to all telemetry sources
  • Define a mapping between service.* and telemetry.*
  • Include both service.* and telemetry.* by default, and let users configure their pipeline to drop the redundant attributes

@tigrannajaryan
Copy link
Member

This is wrong - SDKs are not meant to be limited to use to the colloquial definition of a service (i.e. web service). We must find a way to not artificially limit the types of telemetry producers SDKs can represent.

I agree.

But that doesn't seem possible given this langauge:

I am not sure I read it that way. Do you see it as a prohibition to have non-Service Resources? It seems to say "don't change attributes which are defined as part of env variables".

@pyohannes
Copy link
Contributor

I like the solution you proposed:

Here's a possible solution:

resource.type: service | host | ...
resource.name: my-service | my.host.com | ...

With a slight modification, such a solution could be fully backward compatible:

resource.type: service | host | ...
<resource.type>.name: my-service | my.host.com | ...

Which would give:

  • resource.type=host, host.name=my.host.com
  • resource.type=service, service.name=my-service

This is a pattern that's already used in semantic conventions, for example if messaging.system=kafka, then you can expect messaging.kafka.* attributes.

As a downside, you don't have a single attribute that you can look at. To achieve that, you'd need to process data.

@jack-berg
Copy link
Member Author

jack-berg commented Jan 30, 2024

Earlier I said:

If all SDKs MUST include service.name, and there are telemetry sources which service.name does not apply to, then there are telemetry sources which can not use an SDK.

It appears I misunderstood the spec with respect to all SDKs needing to produce a resource with a service.name. To clarify, the spec says the following:

The resource spec says:

The SDK MUST provide access to a Resource with at least the attributes listed at Semantic Attributes with SDK-provided Default Value.
Note: This means that it is possible to create and associate a resource that does not have all or any of the SDK-provided attributes present. However, that does not happen by default.

The note directly contradicts my conclusion. Its not only possible to produce a SDK without a service.name, but explicitly allowed.

However, as I also noted, the versioning and stability doc mentions:

Exception: Some resource attributes are embedded in various locations of the
Specification, e.g. the service.* attributes which are required by SDKs to be
produced and have corresponding environment variables defined in general SDK configuration. These resource
attributes MUST NOT be ever changed. They are considered a hard-coded part of
this specification.

This appears to reflect a misunderstanding of the resource spec. It should be considered a bug as it came well after the stabilization of the resource spec.

I still think that this PR represents a valid way to proceed, and is consistent with our rejection proposals for alternative identifying attributes (e.g. rejection of app.name). For us to say that alternative sets of identifying attributes are possible, we need a litmus test for when a class of telemetry producers warrants its own identity.

@tigrannajaryan
Copy link
Member

Its not only possible to produce a SDK without a service.name, but explicitly allowed.

That's my understanding as well.

@joaopgrassi
Copy link
Member

joaopgrassi commented Feb 1, 2024

The resource spec says:

The SDK MUST provide access to a Resource with at least the attributes listed at Semantic Attributes with SDK-provided Default Value.
Note: This means that it is possible to create and associate a resource that does not have all or any of the SDK-provided attributes present. However, that does not happen by default.

The note directly contradicts my conclusion. Its not only possible to produce a SDK without a service.name, but explicitly allowed.

The way I interpret that is: The SDK will only not add the attributes listed at Semantic Attributes with SDK-provided Default Value if the user creates a resource themselves via whatever ways, and associate that instead of the one created by default by the SDK. That makes total sense as it can happen and it's not easy to prevent it.

The paragraph even says: (emphasis mine)

Note: This means that it is possible to create and associate a resource that does not have all or any of the SDK-provided attributes present. However, that does not happen by default

So my interpretation is like yours that SDKs must always fill the service.* required and stable attributes by default.

BTW the PR that introduced that for context: open-telemetry/opentelemetry-specification#1294

@jsuereth
Copy link
Contributor

@jack-berg Do you think this deserves an OTEP or "stronger" location in OpenTelemetry? Semconv CANNOT dictate what the specification does/says. Given complexities around modelling services and SDKs, I think this may require an OTEP.

cc @tigrannajaryan on whether this is something we should tackle as part of Entity / Resource-modelling efforts.

@jack-berg
Copy link
Member Author

Do you think this deserves an OTEP or "stronger" location in OpenTelemetry? Semconv CANNOT dictate what the specification does/says.

Not sure I understand. This PR doesn't appear to make any mandates on the spec.

I think this may require an OTEP.

I opened this because I was under the impression that the strong language in the spec around default resource attributes and our actions to reject attempts to define alternative sets of identifying resource attributes (i.e. such as app.name) implied that the content of this PR was already the community position. It appears that I was mistaken. This is a frustrating conclusion because it means that key questions around resources continue to not have an answer:

  • How do you determine the identity of a telemetry producer with a resource which does not have service.* attributes?
  • Which types of telemetry producers should use service.* attributes?
  • When is it appropriate for a type of telemetry producer to have its own set of identifying attributes?
  • How to use the SDK with an alternative set of identifying attributes?
  • When both service.* and another "defacto" set of identifying attributes are present (i.e. host.*), which represents the identity telemetry producer?

These are really important questions that will continue to frustrate / confuse users, maintainers, and vendors. If we reject this PR, then we as a community (and probably the TC in particular) really ought to make a concerted effort answer these questions.

@tigrannajaryan
Copy link
Member

@jack-berg those are very good questions. @jsuereth and I are working on a potential OTEP that will likely answer some of these. Can we pause this PR for a bit and see if the OTEP helps?

@jack-berg
Copy link
Member Author

@jack-berg those are very good questions. @jsuereth and I are working on a potential OTEP that will likely answer some of these. Can we pause this PR for a bit and see if the OTEP helps?

Yes that's fine with me. I can help with reviews and any prototyping that might be required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

8 participants