forked from open-telemetry/opentelemetry-specification
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Proposed scenarios and roadmap for messaging semantic conventions for…
… tracing (open-telemetry#173) [Semantic conventions for messaging systems for tracing](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md) are available, but are in an experimental state. A [workgroup focusing on messaging semantic conventions](open-telemetry/community#819) will work on bringing the existing semantic conventions for messaging to a stable state. The workgroup meets on **Thursdays at 8AM PST**. This documents proposes a scope for an initial stable version of messaging semantic conventions, as well as a roadmap. It should serve as a starting point for initial discussions in the workgroup and, once agreed on, define the further agenda of the workgroup.
- Loading branch information
1 parent
c861c17
commit c7e49fe
Showing
1 changed file
with
264 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,264 @@ | ||
# Scenarios for Tracing semantic conventions for messaging | ||
|
||
This document aims to capture scenarios and a road map, both of which will | ||
serve as a basis for [stabilizing](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#stable) | ||
the [existing semantic conventions for messaging](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md), | ||
which are currently in an [experimental](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#experimental) | ||
state. The goal is to declare messaging semantic conventions stable before the | ||
end of 2021. | ||
|
||
## Motivation | ||
|
||
Many observability scenarios involve messaging systems, event streaming, or | ||
event-driven architectures. For Distributed Tracing to be useful across the | ||
entire scenario, having good observability for messaging or eventing operations | ||
is critical. To achieve this, OpenTelemetry must provide stable conventions and | ||
guidelines for instrumenting those operations. Popular messaging systems that | ||
should be supported include Kafka, RabbitMQ, Apache RocketMQ, Azure Event Hubs | ||
and Service Bus, Amazon SQS, SNS, and Kinesis. | ||
|
||
Bringing the existing experimental semantic conventions for messaging to a | ||
stable state is a crucial step for users and instrumentation authors, as it | ||
allows them to rely on [stability guarantees](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#not-defined-semantic-conventions-stability), | ||
and thus to ship and use stable instrumentation. | ||
|
||
## Roadmap | ||
|
||
1. This OTEP, consisting of scenarios and a proposed roadmap, is approved and | ||
merged. | ||
2. [Stability guarantees](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#not-defined-semantic-conventions-stability) | ||
for semantic conventions are approved and merged. This is not strictly related | ||
to semantic conventions for messaging but is a prerequisite for stabilizing any | ||
semantic conventions. | ||
3. OTEPs proposing guidance for general instrumentation problems that also | ||
pertain to messaging are approved and merged. Those general instrumentation | ||
problems include retries and instrumentation layers. | ||
4. An OTEP proposing a set of attributes and conventions covering the scenarios | ||
in this document is approved and merged. | ||
5. Proposed specification changes are verified by prototypes for the scenarios | ||
and examples below. | ||
6. The [specification for messaging semantic conventions for tracing](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md) | ||
are updated according to the OTEP mentioned above and are declared | ||
[stable](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#stable). | ||
|
||
The steps in the roadmap don't necessarily need to happen in the given order, | ||
some steps can be worked on in parallel. | ||
|
||
## Terminology | ||
|
||
The terminology used in this document is based on the [CloudEvents specification](https://github.com/cloudevents/spec/blob/v1.0.1/spec.md). | ||
CloudEvents is hosted by the CNCF and provides a specification for describing | ||
event data in common formats to provide interoperability across services, | ||
platforms and systems. | ||
|
||
### Message | ||
|
||
A "message" is a transport envelope for the transfer of information. The | ||
information is a combination of a payload and metadata. Metadata can be | ||
directed at consumers or at intermediaries on the message path. Messages are | ||
transferred via one or more intermediaries. Messages are uniquely | ||
identifiable. | ||
|
||
In the strict sense, a _message_ is a payload that is sent to a specific | ||
destination, whereas an _event_ is a signal emitted by a component upon | ||
reaching a given state. This document is agnostic of those differences and uses | ||
the term "message" in a wider sense to cover both concepts. | ||
|
||
### Producer | ||
|
||
The "producer" is a specific instance, process or device that creates and | ||
publishes a message. "Publishing" is the process of sending a message or batch | ||
to the intermediary or consumer. | ||
|
||
### Consumer | ||
|
||
A "consumer" receives the message and acts upon it. It uses the context and | ||
data to execute some logic, which might lead to the occurrence of new events. | ||
|
||
The consumer receives, processes, and settles a message. "Receiving" is the | ||
process of obtaining a message from the intermediary, "processing" is the | ||
process of acting on the information a message contains, "settling" is the | ||
process of notifying an intermediary that a message was processed successfully. | ||
|
||
### Intermediary | ||
|
||
An "intermediary" receives a message to forward it to the next receiver, which | ||
might be another intermediary or a consumer. | ||
|
||
## Scenarios | ||
|
||
Producing and consuming a message involves five stages: | ||
|
||
``` | ||
PRODUCER | ||
Create | ||
| CONSUMER | ||
v +--------------+ | ||
Publish -> | INTERMEDIARY | -> Receive | ||
+--------------+ | | ||
^ v | ||
. Process | ||
. | | ||
. v | ||
. . . . . . Settle | ||
``` | ||
|
||
1. The producer creates a message. | ||
2. The producer publishes the message to an intermediary. | ||
3. The consumer receives the message from an intermediary. | ||
4. The consumer processes the message. | ||
5. The consumer settles the message by notifying the intermediary that the | ||
message was processed. In some cases (fire-and-forget), the settlement stage | ||
does not exist. | ||
|
||
The messaging semantic conventions need to define how to model those stages in | ||
traces, how to propagate context, and how to enrich traces with attributes. | ||
Failures and retries need to be handled in all stages that interface with the | ||
intermediary (publish, receive and settle) and will be covered by general | ||
instrumentation guidance. | ||
|
||
Based on this model, the following scenarios capture major requirements and | ||
can be used for prototyping, as examples, and as test cases. | ||
|
||
### Individual settlement | ||
|
||
Individual settlement systems imply independent logical message flows. A single | ||
message is created and published in the same context, and it's delivered, | ||
consumed, and settled as a single entity. Each message needs to be settled | ||
individually. Usually, settlement information is stored by the intermediary, not | ||
by the consumer. | ||
|
||
Transport batching can be treated as a special case: messages can be | ||
transported together as an optimization, but are produced and consumed | ||
individually. | ||
|
||
As the diagram below shows, each message can be settled individually, | ||
regardless of the position of the message in the stream or queue. In contrast | ||
to checkpoint-based settlement, settlement information is related to individual | ||
messages and not to the overall message stream. | ||
|
||
``` | ||
+---------+ +---------+ +---------+ +---------+ +---------+ +---------+ | ||
|Message A| |Message B| |Message C| |Message D| |Message E| |Message F| | ||
+---------+ +---------+ +---------+ +---------+ +---------+ +---------+ | ||
Settled Settled Settled | ||
``` | ||
|
||
#### Examples | ||
|
||
1. The following configurations should be instrumented and tested for RabbitMQ | ||
or a similar messaging system: | ||
|
||
* 1 producer, 1 queue, 2 consumers | ||
* 1 producer, fanout exchange to 2 queues, 2 consumers | ||
* 2 producers, fanout exchange to 2 queues, 2 consumers | ||
|
||
Each of the producers continuously produces messages. | ||
|
||
### Checkpoint-based settlement | ||
|
||
Messages are processed as a stream and settled by moving a checkpoint. A | ||
checkpoint points to a position of the stream up to which messages were | ||
processed and settled. Messages cannot be settled individually, instead, the | ||
checkpoint needs to be forwarded. Usually, the consumer is responsible for | ||
storing checkpointing information, not the intermediary. | ||
|
||
Checkpoint-based settlement systems are designed to efficiently receive and | ||
settle batches of messages. However, it is not possible to settle messages | ||
independent of their position in the stream (e. g., if message B is located at | ||
a later position in the stream than message A, then message B cannot be settled | ||
without also settling message A). | ||
|
||
As the diagram below shows, messages cannot be settled individually. Instead, | ||
settlement information is related to the overall ordered message stream. | ||
|
||
``` | ||
Checkpoint | ||
| | ||
v | ||
+---------+ +---------+ +---------+ +---------+ +---------+ +---------+ | ||
|Message A| |Message B| |Message C| |Message D| |Message E| |Message F| | ||
+---------+ +---------+ +---------+ +---------+ +---------+ +---------+ | ||
<--- Settled | ||
``` | ||
|
||
#### Examples | ||
|
||
1. The following configurations should be instrumented and tested for Kafka or | ||
a similar messaging system: | ||
|
||
* 1 producer, 2 consumers in the same consumer group | ||
* 1 producer, 2 consumers in different consumer groups | ||
* 2 producers, 2 consumers in the same consumer group | ||
|
||
Each of the producers produces a continuous stream of messages. | ||
|
||
## Open questions | ||
|
||
The following areas are considered out-of-scope of a first stable release of | ||
semantic conventions for messaging. While not being explicitly considered for | ||
a first stable release, it is important to ensure that this first stable | ||
release can serve as a solid foundation for further improvements in these areas. | ||
|
||
### Sampling | ||
|
||
The current experimental semantic conventions rely heavily on span links as | ||
a way to correlate spans. This is necessary, as several traces are needed to | ||
model the complete path that a message takes through the system. With the currently | ||
available sampling capabilities of OpenTelemetry, it is not possible to ensure | ||
that a set of linked traces is sampled. As a result, it is unlikely to sample a | ||
set of traces that covers the complete path a message takes. | ||
|
||
Solving this problem requires a solution for sampling based on span links, | ||
which is not in scope for this OTEP. | ||
|
||
However, having a too high number of span links in a single trace or having too | ||
many traces linked together can make the visualization and analysis of traces | ||
inefficient. This problem is not related to sampling and needs to be addressed | ||
by the semantic conventions. | ||
|
||
### Instrumenting intermediaries | ||
|
||
Instrumenting intermediaries can be valuable for debugging configuration or | ||
performance issues, or for detecting specific intermediary failures. | ||
|
||
Stable semantic conventions for instrumenting intermediaries can be provided at | ||
a future point in time, but are not in scope for this OTEP. The messaging | ||
semantic conventions this document refers to need to provide instrumentation | ||
that works well without the need to have intermediaries instrumented. | ||
|
||
### Metrics | ||
|
||
Messaging semantic conventions for tracing and for metrics overlap and should | ||
be as consistent as possible. However, semantic conventions for metrics will be | ||
handled separately and are not in scope for this OTEP. | ||
|
||
### Asynchronous message passing in the wider sense | ||
|
||
Asynchronous message passing in the wider sense is a communication method | ||
wherein the system puts a message in a queue or channel and does not require an | ||
immediate response to continue processing. This can range from utilizing a | ||
simple queue implementation to a full-fledged messaging system. | ||
|
||
Messaging semantic conventions are intended for systems that fit into one of | ||
the [scenarios laid out in the previous section](#scenarios), which cover a | ||
significant part of asynchronous message passing applications. However, there | ||
are low-level patterns of asynchronous message passing that don't fit in any of | ||
those scenarios, e. g. channels in Go, or message passing in Erlang. Those | ||
might be covered by a different set of semantic conventions in the future. | ||
|
||
There also exist several frameworks for queuing and executing background jobs, | ||
often those frameworks utilize patterns of asynchronous message passing to | ||
queue jobs. Those frameworks might utilize messaging semantic conventions if | ||
they fit in any of the [scenarios laid out in the previous section](#scenarios), | ||
but otherwise targeting those various frameworks is not an explicit goal for | ||
these conventions. Those frameworks might be covered by [semantic conventions for "jobs"](https://github.com/open-telemetry/opentelemetry-specification/pull/1582) | ||
in the future. | ||
|
||
## Further reading | ||
|
||
* [CloudEvents](https://github.com/cloudevents/spec/blob/v1.0.1/spec.md) | ||
* [Message-Driven (in contrast to Event-Driven)](https://www.reactivemanifesto.org/glossary#Message-Driven) | ||
* [Asynchronous message passing](https://en.wikipedia.org/wiki/Message_passing#Asynchronous_message_passing) | ||
* [Existing semantic conventions for messaging](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md) |