-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-open: Clarify that service.* conventions apply to all telemetry sources #671
Re-open: Clarify that service.* conventions apply to all telemetry sources #671
Conversation
…urces" (open-telemetry#638) This reverts commit e37eac7.
@@ -77,7 +77,7 @@ as specified in the [Resource SDK specification](https://github.com/open-telemet | |||
|
|||
**type:** `service` | |||
|
|||
**Description:** A service instance. | |||
**Description:** A telemetry source. OpenTelemetry has adopted a broad interpretation such that every telemetry source is a service. Examples include, but are not limited to: web services, hosts, mobile applications, browser application, edge computing devices, functions as a service, databases, message brokers, etc. Specific types of telemetry sources may have additional conventions defining domain specific information, but the `service` conventions are applicable to all telemetry sources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would qualify as telemetry source when scraping metrics from a Kubernetes cluster? Is it the cluster component responsible for creating and providing the metrics (the "source" in a technical sense)? Or is it a deployment, node, or container (the entity a piece of telemetry data primarily refers to, in a logical sense)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a good question and probably a key aspect of the design of the scraping component. The prometheus model (commonly used to scrape metrics from k8s clusters) every endpoint scraped is an "instance", and multiple replicated endpoints of an instance are a "job". These ideas map well to the idea of service.name
(i.e. prometheus job) and service.instance.id
(i.e. prometheus instance). The next question is can the scraper ascribe any other useful bits of information describing / identifying the telemetry source? I.e. does it know its scraping information about a node, deployment, or pod, or does everything role up into just a cluster?
The idea of a common set of identifying resource attributes applicable across all telemetry sources works nicely with the #575 idea of adding a service.type
attribute. With these in place, a OTLP receiver can reliably know the identity and type of telemetry producer.
The next question to answer is how correlation works: If all telemetry sources have the same identifying attributes, how do we indicate that a service is running on a host in a cluster? We have attributes for a host and cluster, but when everything uses service.*
for identity, we need conventions about how a host and a cluster populate the service.*
attributes.
For example, consider a setup with a service running on a host. Both the service and host are telemetry producers:
// Host resource
{ service.name: "jberg", service.instance.id: "abc123", service.type: "host" }
// Service resource
{ service.name: "my-service", "service.instance.id: "def456", service.type: "web_service", host.name: "jberg", host.id: "abc123" }
If we have a convention that states that hosts set service.name=${host.name}, service.instance.id=${host.id}, service.type=host
, then we can reliably known that "my-service" is running on the host because it has matching host.name, host.id
resource attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so service.type
would identify the kind of entity that service
attributes refer to.
However, it seems to me that with its current proposed definition it doesn't fully cover this use case:
The
service.type
identifies the product that is deployed as the service.
A value like "host" doesn't match that description, the proposed definition of service.type
would need to be generalized to account for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. This PR doesn't solve all the problems associated with identifying and correlating telemetry sources. It aims to codify just one concept (which I assert we already imply) - that all telemetry sources use the same attributes for identity - which we can build off of in future steps.
A service instance. | ||
A telemetry source. OpenTelemetry has adopted a broad interpretation such that every | ||
telemetry source is a service. Examples include, but are not limited to: web services, | ||
hosts, mobile applications, browser application, edge computing devices, functions as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hosts, mobile applications, browser application, edge computing devices, functions as | |
hosts, mobile applications, browser applications, edge computing devices, functions as |
CC @tigrannajaryan @yurishkuro a new PR is here to continue the discussions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its the same as #630 because while there was disagreement, there were also 7 approvals. I like the solution you proposed:
But that doesn't seem possible given this langauge:
|
Heh, the wording is certainly sure of itself. But it doesn't bother me that much by itself, the question is what those "various locations" are and what the impact would be to evolve them in backwards-compatible manner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this should be merged because I don't think all telemetry sources are services. My position is that by default the telemetry emitted by Otel SDK is coming from a Service. However, I think there can be other telemetry sources that are not services.
As an example a Host has system metrics that are telemetry and can be reported and analysed. There is no Service in this picture unless we inject it just for the purpose of having one (e.g. the Collector that collects the metrics is the Service).
If all SDKs MUST include This is wrong - SDKs are not meant to be limited to use to the colloquial definition of a service (i.e. web service). We must find a way to not artificially limit the types of telemetry producers SDKs can represent.
We could:
|
I agree.
I am not sure I read it that way. Do you see it as a prohibition to have non-Service Resources? It seems to say "don't change attributes which are defined as part of env variables". |
With a slight modification, such a solution could be fully backward compatible:
Which would give:
This is a pattern that's already used in semantic conventions, for example if As a downside, you don't have a single attribute that you can look at. To achieve that, you'd need to process data. |
Earlier I said:
It appears I misunderstood the spec with respect to all SDKs needing to produce a resource with a service.name. To clarify, the spec says the following: The resource spec says:
The note directly contradicts my conclusion. Its not only possible to produce a SDK without a However, as I also noted, the versioning and stability doc mentions:
This appears to reflect a misunderstanding of the resource spec. It should be considered a bug as it came well after the stabilization of the resource spec. I still think that this PR represents a valid way to proceed, and is consistent with our rejection proposals for alternative identifying attributes (e.g. rejection of app.name). For us to say that alternative sets of identifying attributes are possible, we need a litmus test for when a class of telemetry producers warrants its own identity. |
That's my understanding as well. |
The way I interpret that is: The SDK will only not add the attributes listed at Semantic Attributes with SDK-provided Default Value if the user creates a resource themselves via whatever ways, and associate that instead of the one created by default by the SDK. That makes total sense as it can happen and it's not easy to prevent it. The paragraph even says: (emphasis mine)
So my interpretation is like yours that SDKs must always fill the BTW the PR that introduced that for context: open-telemetry/opentelemetry-specification#1294 |
@jack-berg Do you think this deserves an OTEP or "stronger" location in OpenTelemetry? Semconv CANNOT dictate what the specification does/says. Given complexities around modelling services and SDKs, I think this may require an OTEP. cc @tigrannajaryan on whether this is something we should tackle as part of Entity / Resource-modelling efforts. |
Not sure I understand. This PR doesn't appear to make any mandates on the spec.
I opened this because I was under the impression that the strong language in the spec around default resource attributes and our actions to reject attempts to define alternative sets of identifying resource attributes (i.e. such as
These are really important questions that will continue to frustrate / confuse users, maintainers, and vendors. If we reject this PR, then we as a community (and probably the TC in particular) really ought to make a concerted effort answer these questions. |
@jack-berg those are very good questions. @jsuereth and I are working on a potential OTEP that will likely answer some of these. Can we pause this PR for a bit and see if the OTEP helps? |
Yes that's fine with me. I can help with reviews and any prototyping that might be required. |
This is the second attempt at #630, which was reverted over concerns that it was merged too quickly.
Including the original PR description below because it still stands:
The question of whether the
service.*
conventions are applicable only to web services is an important one that has come up several times. If the answer is yes, then everything that isn't a web service needs its own version ofservice.name
,service.instance.id
,service.namespace
,service.version
to uniquely define the thing producing telemetry. We've discussed this at length several times and while we don't have anything written down yet in the spec / semantic-conventions, the actions we've taken (i.e. rejecting alternatives liketelemetry.source
,app.name
, etc) confirm that theservice.*
attributes are applicable to all telemetry services, not some narrower subset that some people consider a web service.This PR aims to clarify this to avoid repeating the same discussion.
Some PRs, issues that are related:
The conversation on #630 that took place after merging is relevant to all reviewers.