From 3eae9327618bab72ca4af0d51829e3c043fcad16 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Fri, 21 Jul 2023 00:44:16 +0000 Subject: [PATCH 01/30] OTel metrics proposal --- A66-otel-stats.md | 223 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 223 insertions(+) create mode 100644 A66-otel-stats.md diff --git a/A66-otel-stats.md b/A66-otel-stats.md new file mode 100644 index 000000000..dd16ff9f4 --- /dev/null +++ b/A66-otel-stats.md @@ -0,0 +1,223 @@ +OpenTelemetry Metrics +---- +* Author: Yash Tibrewal (@yashykt) +* Approver: Mark Roth (@markdroth) +* Status: Draft +* Implemented in: +* Last updated: Jul 20, 2023 +* Discussion at: (filled after thread exists) + +## Abstract + +This doc proposes a data model for gRPC OpenTelemetry metrics. + +## Background + +There are a collection of [metrics](https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/gRPC.md) proposed by OpenCensus for gRPC. OpenCensus is no longer being actively maintained and is being deprecated, with OpenTelemetry suggested as the successor framework. + +### Related Proposals: +* [gRPC Retry Stats](A45-retry-stats.md) + +## Proposal + +### Units + +Following the [OpenTelemetry Metrics Semantic Conventions](https://opentelemetry.io/docs/specs/otel/metrics/semantic_conventions/), the following units are used - +* Latencies are measured in float64 seconds, `s` +* Sizes are measured in bytes, `By` +* Counts for number of calls are measured in `{call}` +* Counts for number of attempts are measured in `{attempt}` + +Buckets for histograms in default views should be as follows - +* Latency : 0, 0.00001, 0.00005, 0.0001, 0.0003, 0.0006, 0.0008, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.008, 0.01, 0.013, 0.016, 0.02, 0.025, 0.03, 0.04, 0.05, 0.065, 0.08, 0.1, 0.13, 0.16, 0.2, 0.25, 0.3, 0.4, 0.5, 0.65, 0.8, 1, 2, 5, 10, 20, 50, 100 +* Size : 0, 1024, 2048, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216, 67108864, 268435456, 1073741824, 4294967296 +* Count : 0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536 + +### Attributes +* `grpc.method` : Full gRPC method name, including package, service and method, e.g. "google.bigtable.v2.Bigtable/CheckAndMutateRow" +* `grpc.status` : gRPC server status code received, e.g. "OK", "CANCELLED", "DEADLINE_EXCEEDED" +* `grpc.target` : Target URI used when creating gRPC Channel, e.g. "dns:///pubsub.googleapis.com:443", "xds:///helloworld-gke:8000" +* `grpc.authority` : Authority used by the call/attempt, e.g. "pubsub.googleapis.com", "helloworld-gke" + +### Client Per-Attempt Instruments + +* **grpc.client.attempt.started**
+The total number of RPC attempts started, including those that have not completed.
+*Attributes*: grpc.method, grpc.target
+*Type*: Counter
+*Unit*: {attempt}
+* **grpc.client.attempt.duration**
+End-to-end time taken to complete an RPC attempt including the time it takes to pick a subchannel.
+*Attributes*: grpc.method, grpc.target, grpc.status
+*Type*: Histogram (Latency Buckets)
+* **grpc.client.attempt.sent_total_compressed_message_size**
+Total bytes (compressed but not encrypted) sent across all request messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.
+Attributes: grpc.method, grpc.target, grpc.status
+Type: Histogram (Size Buckets)
+* **grpc.client.attempt.rcvd_total_compressed_message_size**
+Total bytes (compressed but not encrypted) received across all response messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.
+ +### Client Per-Call Instruments + +* **grpc.client.call.duration**
+This metric aims to measure the end-to-end time the gRPC library takes to complete an RPC from the application’s perspective.
+Start timestamp - After the client application starts the RPC.
+End timestamp - Before the status of the RPC is delivered to the application.
+If the implementation uses an interceptor then the exact start and end timestamps would depend on the ordering of the interceptors. Non-interceptor implementations should record the timestamps as close as possible to the top of the gRPC stack, i.e., payload serialization would be included in the measurement.
+*Attributes*: grpc.method, grpc.target, grpc.status
+*Type*: Histogram (Latency Buckets)
+ +### Server Instruments + +* **grpc.server.call.started**
+The total number of RPCs started, including those that have not completed.
+*Attributes*: grpc.method, grpc.authority
+*Type*: counter
+*Unit*: {call}
+* **grpc.server.call.sent_total_compressed_message_size**
+Total bytes (compressed but not encrypted) sent across all response messages (metadata excluded) per RPC; does not include grpc or transport framing bytes.
+*Attributes*: grpc.method, grpc.authority, grpc.status
+*Type*: Histogram (Size Buckets)
+* **grpc.server.call.rcvd_total_compressed_message_size**
+This metric aims to measure the end2end time an RPC takes from the server transport’s (HTTP2/ inproc / cronet) perspective.
+Start timestamp - After the transport knows that it's got a new stream. For HTTP2, this would be after the first header frame for the stream has been received and decoded. Whether the timestamp is recorded before or after HPACK is left to the implementation.
+End timestamp - Ends at the first point where the transport considers the stream done. For HTTP2, this would be when scheduling a trailing header with END_STREAM to be written, or RST_STREAM, or a connection abort. Note that this wouldn’t necessarily mean that the bytes have also been immediately scheduled to be written by TCP.
+*Attributes*: grpc.method, grpc.authority, grpc.status +*Type*: Histogram (Latency Buckets) + +## Language Specifics + +Each language implementation will provide an API for registering an OpenTelemetry plugin. Overall, the APIs should have the following capabilities - +* Allow installing multiple OpenTelemetry plugins. +* Allow setting a [MeterProvider](https://opentelemetry.io/docs/specs/otel/metrics/api/#meterprovider) on individual plugins. +* Optionally allow enabling/disabling metrics. This would allow optimizations to avoid computation and collection of expensive stats within the gRPC library. Note that even without this capability, users of OpenTelemetry would be able to customize the [views](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view) through the MeterProvider. +* Optionally allow setting of a OpenTelemetry plugin for a specific channel or server, instead of setting it globally. + +Note that implementations of the gRPC OpenTelemetry plugin should take care to only depend on the OpenTelemetry API and not the OpenTelemetry SDK. + +### C++ + +```C++ +Class OpenTelemetryPluginBuilder { + public: + // Enables base set of metrics by default + OpenTelemetryPluginBuilder() = default; + // If `SetMeterProvider` is called, \a meter_provider will be used instead of the global default. + OpenTelemetryPluginBuilder& SetMeterProvider(shared_ptr meter_provider); + // Enable metric \a metric_name. + OpenTelemetryPluginBuilder& EnableMetric(absl::string_view metric_name); + // Disable metric \a metric_name + OpenTelemetryPluginBuilder& DisableMetric(absl::string_view metric_name); + // Builds and registers a OpenTelemetry Plugin + void BuildAndRegisterGlobal(); +}; + +``` +In the future, additional API might be provided to allow registering the plugin for a particular channel or server builder. + +### Java + +To be filled + +### Go + +To be filled + +### Python + +To be filled + +## Migration from OpenCensus + +The following sections show the differences between the gRPC OpenCensus spec and the proposed gRPC OpenTelemetry spec and the mapping of metrics between the two. It also presents metrics present in OpenCensus spec that do not have a mapping in the OpenTelemetry spec at present. +Following the mapping, two migration strategies are proposed for customers who are satisfied with the coverage provided by the current OpenTelemetry spec. + +### Differences from gRPC OpenCensus Spec + +* OpenTelemetry instrument names don’t allow ‘/’ so we use ‘.’ as the separator. We also get rid of the “.io” suffix in “grpc.io” as it doesn’t seem to add any value and is consistent with other names in the metrics spec from OpenTelemetry. +* We also use this opportunity to resolve ambiguities from the gRPC OpenCensus spec (detailed below). +* OpenTelemetry has attributes similar to tags in OpenCensus, and the OpenCensus tag names already seem to match the OpenTelemetry spec - except for ‘_’ vs ‘.’ for namespaces. So we just replace the ‘_’ with ‘.’. Note that the 'client' and 'server' distinction has also been removed since it does not add any benefit. + * grpc_client_method -> grpc.method + * grpc_client_status -> grpc.status + * grpc_server_method -> grpc.method + * grpc_server_status -> grpc.status +* Two new attributes have been added. + * grpc.target - Added on client metrics + * grpc.authority - Added on server metrics +* Latency metrics in the OpenTelemetry spec use the recommended `s` unit instead of `ms`. + +### Metrics with Corresponding Equivalent + +The following OpenCensus metrics have an equivalent in the OpenTelemetry spec (with the above noted differences) allowing for receivers of the telemetry data to join the views from the two metrics for continuity. + +| gRPC OpenCensus | gRPC OpenTelemetry | +| ----------------|--------------------| +| grpc.io/client/started_rpcs | grpc.client.attempt.started | +| grpc.io/client/completed_rpcs | (Derivable from grpc.client.attempt.duration) | +| grpc.io/client/roundtrip_latency | grpc.client.attempt.duration | +| grpc.io/server/started_rpcs | grpc.server.call.started | +| grpc.io/server/completed_rpcs | (Derivable from grpc.server.call.duration) | +| grpc.io/server/server_latency | grpc.server.call.duration | + +### Metrics with Nuanced Differences + +Unfortunately, the implementations of the gRPC OpenCensus spec in the various languages do not agree on the definition of the following size metrics. Go records uncompressed message bytes for the OpenCensus metric, while C++ and Java record the compressed message bytes. The OpenTelemetry spec proposed here calls for recording the compressed message bytes, resulting in an equivalence between the metrics definitions for C++ and Java, but not for Go. + +| gRPC OpenCensus | gRPC OpenTelemetry | +| ----------------|--------------------| +| grpc.io/client/sent_bytes_per_rpc | grpc.client.attempt.sent_total_compressed_message_size | +| grpc.io/client/received_bytes_per_rpc | grpc.client.attempt.rcvd_total_compressed_message_size | +| grpc.io/server/sent_bytes_per_rpc | grpc.server.call.sent_total_compressed_message_size | +| grpc.io/server/received_bytes_per_rpc | grpc.server.call.rcvd_total_compressed_message_size | + +### Additional gRPC OpenCensus Metrics not supported in first iteration + +There are some additional metrics defined in the gRPC OpenCensus spec and retry stats which we will not be supporting in the first iteration of the OpenTelemetry plugin. Some of these will eventually be accepted into the OpenTelemetry spec with the appropriate changes. +* Client Views + * grpc.io/client/sent_messages_per_rpc + * grpc.io/client/received_messages_per_rpc + * grpc.io/client/server_latency + * grpc.io/client/sent_messages_per_method + * grpc.io/client/received_messages_per_method + * grpc.io/client/sent_bytes_per_method + * grpc.io/client/received_bytes_per_method +* Server Views + * grpc.io/server/sent_messages_per_rpc + * grpc.io/server/received_messages_per_rpc + * grpc.io/server/sent_messages_per_method + * grpc.io/server/received_messages_per_method + * grpc.io/server/sent_bytes_per_method + * grpc.io/server/received_bytes_per_method +* Retry Views + * grpc.io/client/retries_per_call + * grpc.io/client/retries + * grpc.io/client/transparent_retries_per_call + * grpc.io/client/transparent_retries + * grpc.io/client/retry_delay_per_call + +### Migration Strategy 1 + +* Update telemetry dashboards and alerts to join the results from the OpenCensus metrics and the OpenTelemetry metrics. +* Roll out changes to client and server binaries to register the OpenTelemetry plugin instead of the OpenCensus plugin. +* After 100% rollout and some duration (to maintain previous history), update telemetry dashboards and alerts to not query OpenCensus metrics. + +### Migration Strategy 2 - Supporting multiple stats plugins + +For this strategy, gRPC stacks need to support registration of both the OpenCensus and the OpenTelemetry plugins at the same time and allow both metrics to be exported. This allows users to experiment with OpenTelemetry before disabling the OpenCensus plugin. + +* Both plugins are registered to gRPC, resulting in both metrics being exported. (Note the cost of reporting stats from two plugins at the same time.) +* Separate dashboards and alerts are created for the OpenTelemetry metrics. (No join is needed anymore.) +* Remove registration of OpenCensus plugin when monitoring from OpenTelemetry plugin is deemed satisfactory. + +## Rationale + +OpenCensus is no longer being actively maintained and is being deprecated, with OpenTelemetry suggested as the successor framework. The OpenTelemetry spec aim to maintain compatibility with the gRPC OpenCensus spec wherever reasonable to allow for an easy migration path. + +## Implementation + +Implementations for the OpenTelemetry plugin are currently planned for C++, Java, Go and Python. + +* C++ - A basic stats functionality for OpenTelemetry (though still internal) has been implemented in https://github.com/grpc/grpc/pull/33650. This would be expanded on and moved to experimental status once the API is approved and implemented. Note that this PR only added bazel support for the plugin. CMake support will also be added shortly. +* Java - TBD but assumed to be implemented by @DNVindhya. +* Go - TBD but assumed to be implemented by @zasweq. +* Python - TBD but assumed to be implemented by @XuanWang-Amos. From 7dd3b2a3a3bc930fd60a395ad5292dc09ce70bb5 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Fri, 21 Jul 2023 10:27:33 +0000 Subject: [PATCH 02/30] mdformat --- A66-otel-stats.md | 344 ++++++++++++++++++++++++++++------------------ 1 file changed, 209 insertions(+), 135 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index dd16ff9f4..9ea52d78f 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -1,103 +1,133 @@ -OpenTelemetry Metrics ----- -* Author: Yash Tibrewal (@yashykt) -* Approver: Mark Roth (@markdroth) -* Status: Draft -* Implemented in: -* Last updated: Jul 20, 2023 -* Discussion at: (filled after thread exists) +# OpenTelemetry Metrics + +* Author: Yash Tibrewal (@yashykt) +* Approver: Mark Roth (@markdroth) +* Status: Draft +* Implemented in: +* Last updated: Jul 20, 2023 +* Discussion at: (filled after thread exists) ## Abstract -This doc proposes a data model for gRPC OpenTelemetry metrics. +Propose a metrics data model for gRPC OpenTelemetry metrics. ## Background -There are a collection of [metrics](https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/gRPC.md) proposed by OpenCensus for gRPC. OpenCensus is no longer being actively maintained and is being deprecated, with OpenTelemetry suggested as the successor framework. +There are a collection of +[metrics](https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/gRPC.md) +proposed by OpenCensus for gRPC. OpenCensus is no longer being actively +maintained and is being deprecated, with OpenTelemetry suggested as the +successor framework. + +### Related Proposals: -### Related Proposals: -* [gRPC Retry Stats](A45-retry-stats.md) +* [gRPC Retry Stats](A45-retry-stats.md) ## Proposal ### Units -Following the [OpenTelemetry Metrics Semantic Conventions](https://opentelemetry.io/docs/specs/otel/metrics/semantic_conventions/), the following units are used - -* Latencies are measured in float64 seconds, `s` -* Sizes are measured in bytes, `By` -* Counts for number of calls are measured in `{call}` -* Counts for number of attempts are measured in `{attempt}` +Following the +[OpenTelemetry Metrics Semantic Conventions](https://opentelemetry.io/docs/specs/otel/metrics/semantic_conventions/), +the following units are used - + +* Latencies are measured in float64 seconds, `s` +* Sizes are measured in bytes, `By` +* Counts for number of calls are measured in `{call}` +* Counts for number of attempts are measured in `{attempt}` Buckets for histograms in default views should be as follows - -* Latency : 0, 0.00001, 0.00005, 0.0001, 0.0003, 0.0006, 0.0008, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.008, 0.01, 0.013, 0.016, 0.02, 0.025, 0.03, 0.04, 0.05, 0.065, 0.08, 0.1, 0.13, 0.16, 0.2, 0.25, 0.3, 0.4, 0.5, 0.65, 0.8, 1, 2, 5, 10, 20, 50, 100 -* Size : 0, 1024, 2048, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216, 67108864, 268435456, 1073741824, 4294967296 -* Count : 0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536 + +* Latency : 0, 0.00001, 0.00005, 0.0001, 0.0003, 0.0006, 0.0008, 0.001, 0.002, + 0.003, 0.004, 0.005, 0.006, 0.008, 0.01, 0.013, 0.016, 0.02, 0.025, 0.03, + 0.04, 0.05, 0.065, 0.08, 0.1, 0.13, 0.16, 0.2, 0.25, 0.3, 0.4, 0.5, 0.65, + 0.8, 1, 2, 5, 10, 20, 50, 100 +* Size : 0, 1024, 2048, 4096, 16384, 65536, 262144, 1048576, 4194304, + 16777216, 67108864, 268435456, 1073741824, 4294967296 +* Count : 0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, + 16384, 32768, 65536 ### Attributes -* `grpc.method` : Full gRPC method name, including package, service and method, e.g. "google.bigtable.v2.Bigtable/CheckAndMutateRow" -* `grpc.status` : gRPC server status code received, e.g. "OK", "CANCELLED", "DEADLINE_EXCEEDED" -* `grpc.target` : Target URI used when creating gRPC Channel, e.g. "dns:///pubsub.googleapis.com:443", "xds:///helloworld-gke:8000" -* `grpc.authority` : Authority used by the call/attempt, e.g. "pubsub.googleapis.com", "helloworld-gke" + +* `grpc.method` : Full gRPC method name, including package, service and + method, e.g. "google.bigtable.v2.Bigtable/CheckAndMutateRow" +* `grpc.status` : gRPC server status code received, e.g. "OK", "CANCELLED", + "DEADLINE_EXCEEDED" +* `grpc.target` : Target URI used when creating gRPC Channel, e.g. + "dns:///pubsub.googleapis.com:443", "xds:///helloworld-gke:8000" +* `grpc.authority` : Authority used by the call/attempt, e.g. + "pubsub.googleapis.com", "helloworld-gke" ### Client Per-Attempt Instruments -* **grpc.client.attempt.started**
-The total number of RPC attempts started, including those that have not completed.
-*Attributes*: grpc.method, grpc.target
-*Type*: Counter
-*Unit*: {attempt}
-* **grpc.client.attempt.duration**
-End-to-end time taken to complete an RPC attempt including the time it takes to pick a subchannel.
-*Attributes*: grpc.method, grpc.target, grpc.status
-*Type*: Histogram (Latency Buckets)
-* **grpc.client.attempt.sent_total_compressed_message_size**
-Total bytes (compressed but not encrypted) sent across all request messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.
-Attributes: grpc.method, grpc.target, grpc.status
-Type: Histogram (Size Buckets)
-* **grpc.client.attempt.rcvd_total_compressed_message_size**
-Total bytes (compressed but not encrypted) received across all response messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.
+* **grpc.client.attempt.started**
+ The total number of RPC attempts started, including those that have not completed.
+ *Attributes*: grpc.method, grpc.target
+ *Type*: Counter
+ *Unit*: {attempt}
+* **grpc.client.attempt.duration**
+ End-to-end time taken to complete an RPC attempt including the time it takes to pick a subchannel.
+ *Attributes*: grpc.method, grpc.target, grpc.status
+ *Type*: Histogram (Latency Buckets)
+* **grpc.client.attempt.sent_total_compressed_message_size**
+ Total bytes (compressed but not encrypted) sent across all request messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.
+ Attributes: grpc.method, grpc.target, grpc.status
+ Type: Histogram (Size Buckets)
+* **grpc.client.attempt.rcvd_total_compressed_message_size**
+ Total bytes (compressed but not encrypted) received across all response messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.
### Client Per-Call Instruments -* **grpc.client.call.duration**
-This metric aims to measure the end-to-end time the gRPC library takes to complete an RPC from the application’s perspective.
-Start timestamp - After the client application starts the RPC.
-End timestamp - Before the status of the RPC is delivered to the application.
-If the implementation uses an interceptor then the exact start and end timestamps would depend on the ordering of the interceptors. Non-interceptor implementations should record the timestamps as close as possible to the top of the gRPC stack, i.e., payload serialization would be included in the measurement.
-*Attributes*: grpc.method, grpc.target, grpc.status
-*Type*: Histogram (Latency Buckets)
+* **grpc.client.call.duration**
+ This metric aims to measure the end-to-end time the gRPC library takes to complete an RPC from the application’s perspective.
+ Start timestamp - After the client application starts the RPC.
+ End timestamp - Before the status of the RPC is delivered to the application.
+ If the implementation uses an interceptor then the exact start and end timestamps would depend on the ordering of the interceptors. Non-interceptor implementations should record the timestamps as close as possible to the top of the gRPC stack, i.e., payload serialization would be included in the measurement.
+ *Attributes*: grpc.method, grpc.target, grpc.status
+ *Type*: Histogram (Latency Buckets)
### Server Instruments -* **grpc.server.call.started**
-The total number of RPCs started, including those that have not completed.
-*Attributes*: grpc.method, grpc.authority
-*Type*: counter
-*Unit*: {call}
-* **grpc.server.call.sent_total_compressed_message_size**
-Total bytes (compressed but not encrypted) sent across all response messages (metadata excluded) per RPC; does not include grpc or transport framing bytes.
-*Attributes*: grpc.method, grpc.authority, grpc.status
-*Type*: Histogram (Size Buckets)
-* **grpc.server.call.rcvd_total_compressed_message_size**
-This metric aims to measure the end2end time an RPC takes from the server transport’s (HTTP2/ inproc / cronet) perspective.
-Start timestamp - After the transport knows that it's got a new stream. For HTTP2, this would be after the first header frame for the stream has been received and decoded. Whether the timestamp is recorded before or after HPACK is left to the implementation.
-End timestamp - Ends at the first point where the transport considers the stream done. For HTTP2, this would be when scheduling a trailing header with END_STREAM to be written, or RST_STREAM, or a connection abort. Note that this wouldn’t necessarily mean that the bytes have also been immediately scheduled to be written by TCP.
-*Attributes*: grpc.method, grpc.authority, grpc.status -*Type*: Histogram (Latency Buckets) +* **grpc.server.call.started**
+ The total number of RPCs started, including those that have not completed.
+ *Attributes*: grpc.method, grpc.authority
+ *Type*: counter
+ *Unit*: {call}
+* **grpc.server.call.sent_total_compressed_message_size**
+ Total bytes (compressed but not encrypted) sent across all response messages (metadata excluded) per RPC; does not include grpc or transport framing bytes.
+ *Attributes*: grpc.method, grpc.authority, grpc.status
+ *Type*: Histogram (Size Buckets)
+* **grpc.server.call.rcvd_total_compressed_message_size**
+ This metric aims to measure the end2end time an RPC takes from the server transport’s (HTTP2/ inproc / cronet) perspective.
+ Start timestamp - After the transport knows that it's got a new stream. For HTTP2, this would be after the first header frame for the stream has been received and decoded. Whether the timestamp is recorded before or after HPACK is left to the implementation.
+ End timestamp - Ends at the first point where the transport considers the stream done. For HTTP2, this would be when scheduling a trailing header with END_STREAM to be written, or RST_STREAM, or a connection abort. Note that this wouldn’t necessarily mean that the bytes have also been immediately scheduled to be written by TCP.
+ *Attributes*: grpc.method, grpc.authority, grpc.status + *Type*: Histogram (Latency Buckets) ## Language Specifics -Each language implementation will provide an API for registering an OpenTelemetry plugin. Overall, the APIs should have the following capabilities - -* Allow installing multiple OpenTelemetry plugins. -* Allow setting a [MeterProvider](https://opentelemetry.io/docs/specs/otel/metrics/api/#meterprovider) on individual plugins. -* Optionally allow enabling/disabling metrics. This would allow optimizations to avoid computation and collection of expensive stats within the gRPC library. Note that even without this capability, users of OpenTelemetry would be able to customize the [views](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view) through the MeterProvider. -* Optionally allow setting of a OpenTelemetry plugin for a specific channel or server, instead of setting it globally. - -Note that implementations of the gRPC OpenTelemetry plugin should take care to only depend on the OpenTelemetry API and not the OpenTelemetry SDK. +Each language implementation will provide an API for registering an +OpenTelemetry plugin. Overall, the APIs should have the following capabilities - + +* Allow installing multiple OpenTelemetry plugins. +* Allow setting a + [MeterProvider](https://opentelemetry.io/docs/specs/otel/metrics/api/#meterprovider) + on individual plugins. +* Optionally allow enabling/disabling metrics. This would allow optimizations + to avoid computation and collection of expensive stats within the gRPC + library. Note that even without this capability, users of OpenTelemetry + would be able to customize the + [views](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view) through + the MeterProvider. +* Optionally allow setting of a OpenTelemetry plugin for a specific channel or + server, instead of setting it globally. + +Note that implementations of the gRPC OpenTelemetry plugin should take care to +only depend on the OpenTelemetry API and not the OpenTelemetry SDK. ### C++ -```C++ +```c++ Class OpenTelemetryPluginBuilder { public: // Enables base set of metrics by default @@ -113,7 +143,9 @@ Class OpenTelemetryPluginBuilder { }; ``` -In the future, additional API might be provided to allow registering the plugin for a particular channel or server builder. + +In the future, additional API might be provided to allow registering the plugin +for a particular channel or server builder. ### Java @@ -129,95 +161,137 @@ To be filled ## Migration from OpenCensus -The following sections show the differences between the gRPC OpenCensus spec and the proposed gRPC OpenTelemetry spec and the mapping of metrics between the two. It also presents metrics present in OpenCensus spec that do not have a mapping in the OpenTelemetry spec at present. -Following the mapping, two migration strategies are proposed for customers who are satisfied with the coverage provided by the current OpenTelemetry spec. +The following sections show the differences between the gRPC OpenCensus spec and +the proposed gRPC OpenTelemetry spec and the mapping of metrics between the two. +It also presents metrics present in OpenCensus spec that do not have a mapping +in the OpenTelemetry spec at present. Following the mapping, two migration +strategies are proposed for customers who are satisfied with the coverage +provided by the current OpenTelemetry spec. ### Differences from gRPC OpenCensus Spec -* OpenTelemetry instrument names don’t allow ‘/’ so we use ‘.’ as the separator. We also get rid of the “.io” suffix in “grpc.io” as it doesn’t seem to add any value and is consistent with other names in the metrics spec from OpenTelemetry. -* We also use this opportunity to resolve ambiguities from the gRPC OpenCensus spec (detailed below). -* OpenTelemetry has attributes similar to tags in OpenCensus, and the OpenCensus tag names already seem to match the OpenTelemetry spec - except for ‘_’ vs ‘.’ for namespaces. So we just replace the ‘_’ with ‘.’. Note that the 'client' and 'server' distinction has also been removed since it does not add any benefit. - * grpc_client_method -> grpc.method - * grpc_client_status -> grpc.status - * grpc_server_method -> grpc.method - * grpc_server_status -> grpc.status -* Two new attributes have been added. - * grpc.target - Added on client metrics - * grpc.authority - Added on server metrics -* Latency metrics in the OpenTelemetry spec use the recommended `s` unit instead of `ms`. +* OpenTelemetry instrument names don’t allow ‘/’ so we use ‘.’ as the + separator. We also get rid of the “.io” suffix in “grpc.io” as it doesn’t + seem to add any value and is consistent with other names in the metrics spec + from OpenTelemetry. +* We also use this opportunity to resolve ambiguities from the gRPC OpenCensus + spec (detailed below). +* OpenTelemetry has attributes similar to tags in OpenCensus, and the + OpenCensus tag names already seem to match the OpenTelemetry spec - except + for ‘_’ vs ‘.’ for namespaces. So we just replace the ‘_’ with ‘.’. Note + that the 'client' and 'server' distinction has also been removed since it + does not add any benefit. + * grpc_client_method -> grpc.method + * grpc_client_status -> grpc.status + * grpc_server_method -> grpc.method + * grpc_server_status -> grpc.status +* Two new attributes have been added. + * grpc.target - Added on client metrics + * grpc.authority - Added on server metrics +* Latency metrics in the OpenTelemetry spec use the recommended `s` unit + instead of `ms`. ### Metrics with Corresponding Equivalent -The following OpenCensus metrics have an equivalent in the OpenTelemetry spec (with the above noted differences) allowing for receivers of the telemetry data to join the views from the two metrics for continuity. +The following OpenCensus metrics have an equivalent in the OpenTelemetry spec +(with the above noted differences) allowing for receivers of the telemetry data +to join the views from the two metrics for continuity. -| gRPC OpenCensus | gRPC OpenTelemetry | -| ----------------|--------------------| -| grpc.io/client/started_rpcs | grpc.client.attempt.started | -| grpc.io/client/completed_rpcs | (Derivable from grpc.client.attempt.duration) | -| grpc.io/client/roundtrip_latency | grpc.client.attempt.duration | -| grpc.io/server/started_rpcs | grpc.server.call.started | -| grpc.io/server/completed_rpcs | (Derivable from grpc.server.call.duration) | -| grpc.io/server/server_latency | grpc.server.call.duration | +gRPC OpenCensus | gRPC OpenTelemetry +-------------------------------- | --------------------------------------------- +grpc.io/client/started_rpcs | grpc.client.attempt.started +grpc.io/client/completed_rpcs | (Derivable from grpc.client.attempt.duration) +grpc.io/client/roundtrip_latency | grpc.client.attempt.duration +grpc.io/server/started_rpcs | grpc.server.call.started +grpc.io/server/completed_rpcs | (Derivable from grpc.server.call.duration) +grpc.io/server/server_latency | grpc.server.call.duration ### Metrics with Nuanced Differences -Unfortunately, the implementations of the gRPC OpenCensus spec in the various languages do not agree on the definition of the following size metrics. Go records uncompressed message bytes for the OpenCensus metric, while C++ and Java record the compressed message bytes. The OpenTelemetry spec proposed here calls for recording the compressed message bytes, resulting in an equivalence between the metrics definitions for C++ and Java, but not for Go. +Unfortunately, the implementations of the gRPC OpenCensus spec in the various +languages do not agree on the definition of the following size metrics. Go +records uncompressed message bytes for the OpenCensus metric, while C++ and Java +record the compressed message bytes. The OpenTelemetry spec proposed here calls +for recording the compressed message bytes, resulting in an equivalence between +the metrics definitions for C++ and Java, but not for Go. -| gRPC OpenCensus | gRPC OpenTelemetry | -| ----------------|--------------------| -| grpc.io/client/sent_bytes_per_rpc | grpc.client.attempt.sent_total_compressed_message_size | -| grpc.io/client/received_bytes_per_rpc | grpc.client.attempt.rcvd_total_compressed_message_size | -| grpc.io/server/sent_bytes_per_rpc | grpc.server.call.sent_total_compressed_message_size | -| grpc.io/server/received_bytes_per_rpc | grpc.server.call.rcvd_total_compressed_message_size | +gRPC OpenCensus | gRPC OpenTelemetry +------------------------------------- | ------------------ +grpc.io/client/sent_bytes_per_rpc | grpc.client.attempt.sent_total_compressed_message_size +grpc.io/client/received_bytes_per_rpc | grpc.client.attempt.rcvd_total_compressed_message_size +grpc.io/server/sent_bytes_per_rpc | grpc.server.call.sent_total_compressed_message_size +grpc.io/server/received_bytes_per_rpc | grpc.server.call.rcvd_total_compressed_message_size ### Additional gRPC OpenCensus Metrics not supported in first iteration -There are some additional metrics defined in the gRPC OpenCensus spec and retry stats which we will not be supporting in the first iteration of the OpenTelemetry plugin. Some of these will eventually be accepted into the OpenTelemetry spec with the appropriate changes. -* Client Views - * grpc.io/client/sent_messages_per_rpc - * grpc.io/client/received_messages_per_rpc - * grpc.io/client/server_latency - * grpc.io/client/sent_messages_per_method - * grpc.io/client/received_messages_per_method - * grpc.io/client/sent_bytes_per_method - * grpc.io/client/received_bytes_per_method -* Server Views - * grpc.io/server/sent_messages_per_rpc - * grpc.io/server/received_messages_per_rpc - * grpc.io/server/sent_messages_per_method - * grpc.io/server/received_messages_per_method - * grpc.io/server/sent_bytes_per_method - * grpc.io/server/received_bytes_per_method -* Retry Views - * grpc.io/client/retries_per_call - * grpc.io/client/retries - * grpc.io/client/transparent_retries_per_call - * grpc.io/client/transparent_retries - * grpc.io/client/retry_delay_per_call +There are some additional metrics defined in the gRPC OpenCensus spec and retry +stats which we will not be supporting in the first iteration of the +OpenTelemetry plugin. Some of these will eventually be accepted into the +OpenTelemetry spec with the appropriate changes. + +* Client Views + * grpc.io/client/sent_messages_per_rpc + * grpc.io/client/received_messages_per_rpc + * grpc.io/client/server_latency + * grpc.io/client/sent_messages_per_method + * grpc.io/client/received_messages_per_method + * grpc.io/client/sent_bytes_per_method + * grpc.io/client/received_bytes_per_method +* Server Views + * grpc.io/server/sent_messages_per_rpc + * grpc.io/server/received_messages_per_rpc + * grpc.io/server/sent_messages_per_method + * grpc.io/server/received_messages_per_method + * grpc.io/server/sent_bytes_per_method + * grpc.io/server/received_bytes_per_method +* Retry Views + * grpc.io/client/retries_per_call + * grpc.io/client/retries + * grpc.io/client/transparent_retries_per_call + * grpc.io/client/transparent_retries + * grpc.io/client/retry_delay_per_call ### Migration Strategy 1 -* Update telemetry dashboards and alerts to join the results from the OpenCensus metrics and the OpenTelemetry metrics. -* Roll out changes to client and server binaries to register the OpenTelemetry plugin instead of the OpenCensus plugin. -* After 100% rollout and some duration (to maintain previous history), update telemetry dashboards and alerts to not query OpenCensus metrics. +* Update telemetry dashboards and alerts to join the results from the + OpenCensus metrics and the OpenTelemetry metrics. +* Roll out changes to client and server binaries to register the OpenTelemetry + plugin instead of the OpenCensus plugin. +* After 100% rollout and some duration (to maintain previous history), update + telemetry dashboards and alerts to not query OpenCensus metrics. ### Migration Strategy 2 - Supporting multiple stats plugins -For this strategy, gRPC stacks need to support registration of both the OpenCensus and the OpenTelemetry plugins at the same time and allow both metrics to be exported. This allows users to experiment with OpenTelemetry before disabling the OpenCensus plugin. +For this strategy, gRPC stacks need to support registration of both the +OpenCensus and the OpenTelemetry plugins at the same time and allow both metrics +to be exported. This allows users to experiment with OpenTelemetry before +disabling the OpenCensus plugin. -* Both plugins are registered to gRPC, resulting in both metrics being exported. (Note the cost of reporting stats from two plugins at the same time.) -* Separate dashboards and alerts are created for the OpenTelemetry metrics. (No join is needed anymore.) -* Remove registration of OpenCensus plugin when monitoring from OpenTelemetry plugin is deemed satisfactory. +* Both plugins are registered to gRPC, resulting in both metrics being + exported. (Note the cost of reporting stats from two plugins at the same + time.) +* Separate dashboards and alerts are created for the OpenTelemetry metrics. + (No join is needed anymore.) +* Remove registration of OpenCensus plugin when monitoring from OpenTelemetry + plugin is deemed satisfactory. ## Rationale -OpenCensus is no longer being actively maintained and is being deprecated, with OpenTelemetry suggested as the successor framework. The OpenTelemetry spec aim to maintain compatibility with the gRPC OpenCensus spec wherever reasonable to allow for an easy migration path. +OpenCensus is no longer being actively maintained and is being deprecated, with +OpenTelemetry suggested as the successor framework. The OpenTelemetry spec aim +to maintain compatibility with the gRPC OpenCensus spec wherever reasonable to +allow for an easy migration path. ## Implementation -Implementations for the OpenTelemetry plugin are currently planned for C++, Java, Go and Python. - -* C++ - A basic stats functionality for OpenTelemetry (though still internal) has been implemented in https://github.com/grpc/grpc/pull/33650. This would be expanded on and moved to experimental status once the API is approved and implemented. Note that this PR only added bazel support for the plugin. CMake support will also be added shortly. -* Java - TBD but assumed to be implemented by @DNVindhya. -* Go - TBD but assumed to be implemented by @zasweq. -* Python - TBD but assumed to be implemented by @XuanWang-Amos. +Implementations for the OpenTelemetry plugin are currently planned for C++, +Java, Go and Python. + +* C++ - A basic stats functionality for OpenTelemetry (though still internal) + has been implemented in https://github.com/grpc/grpc/pull/33650. This would + be expanded on and moved to experimental status once the API is approved and + implemented. Note that this PR only added bazel support for the plugin. + CMake support will also be added shortly. +* Java - TBD but assumed to be implemented by @DNVindhya. +* Go - TBD but assumed to be implemented by @zasweq. +* Python - TBD but assumed to be implemented by @XuanWang-Amos. From 6f67debdf037215b668d92b26cd9235a9254f32a Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Fri, 21 Jul 2023 10:31:14 +0000 Subject: [PATCH 03/30] Add link --- A66-otel-stats.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index 9ea52d78f..fffec2b1a 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -16,7 +16,7 @@ Propose a metrics data model for gRPC OpenTelemetry metrics. There are a collection of [metrics](https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/gRPC.md) proposed by OpenCensus for gRPC. OpenCensus is no longer being actively -maintained and is being deprecated, with OpenTelemetry suggested as the +maintained and is being [deprecated](https://opentelemetry.io/blog/2023/sunsetting-opencensus/#:~:text=Compatibility%20specification%204.-,What%20to%20Expect%20After%20July%2031st%2C%202023,found%20will%20not%20be%20patched.), with OpenTelemetry suggested as the successor framework. ### Related Proposals: From 7730c4e63074aff953d1385657c790e4a4614a9e Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Fri, 21 Jul 2023 10:42:33 +0000 Subject: [PATCH 04/30] Add discussion thread --- A66-otel-stats.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index fffec2b1a..5c8f5d822 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -2,10 +2,10 @@ * Author: Yash Tibrewal (@yashykt) * Approver: Mark Roth (@markdroth) -* Status: Draft +* Status: In Review * Implemented in: * Last updated: Jul 20, 2023 -* Discussion at: (filled after thread exists) +* Discussion at: https://groups.google.com/g/grpc-io/c/po-deqYEQzE ## Abstract @@ -16,8 +16,9 @@ Propose a metrics data model for gRPC OpenTelemetry metrics. There are a collection of [metrics](https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/gRPC.md) proposed by OpenCensus for gRPC. OpenCensus is no longer being actively -maintained and is being [deprecated](https://opentelemetry.io/blog/2023/sunsetting-opencensus/#:~:text=Compatibility%20specification%204.-,What%20to%20Expect%20After%20July%2031st%2C%202023,found%20will%20not%20be%20patched.), with OpenTelemetry suggested as the -successor framework. +maintained and is being +[deprecated](https://opentelemetry.io/blog/2023/sunsetting-opencensus/#:~:text=Compatibility%20specification%204.-,What%20to%20Expect%20After%20July%2031st%2C%202023,found%20will%20not%20be%20patched.), +with OpenTelemetry suggested as the successor framework. ### Related Proposals: From dccfec3416aa4eff896e0b56abd53c667d663bca Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Fri, 21 Jul 2023 10:55:18 +0000 Subject: [PATCH 05/30] Fixes --- A66-otel-stats.md | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index 5c8f5d822..c2a3e1af6 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -76,6 +76,8 @@ Buckets for histograms in default views should be as follows - Type: Histogram (Size Buckets)
* **grpc.client.attempt.rcvd_total_compressed_message_size**
Total bytes (compressed but not encrypted) received across all response messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.
+ *Attributes*: grpc.method, grpc.target, grpc.status
+ *Type*: Histogram (Size Buckets)
### Client Per-Call Instruments @@ -83,7 +85,7 @@ Buckets for histograms in default views should be as follows - This metric aims to measure the end-to-end time the gRPC library takes to complete an RPC from the application’s perspective.
Start timestamp - After the client application starts the RPC.
End timestamp - Before the status of the RPC is delivered to the application.
- If the implementation uses an interceptor then the exact start and end timestamps would depend on the ordering of the interceptors. Non-interceptor implementations should record the timestamps as close as possible to the top of the gRPC stack, i.e., payload serialization would be included in the measurement.
+ If the implementation uses an interceptor then the exact start and end timestamps would depend on the ordering of the interceptors. Non-interceptor implementations should record the timestamps as close as possible to the top of the gRPC stack, i.e., payload serialization should be included in the measurement.
*Attributes*: grpc.method, grpc.target, grpc.status
*Type*: Histogram (Latency Buckets)
@@ -99,11 +101,15 @@ Buckets for histograms in default views should be as follows - *Attributes*: grpc.method, grpc.authority, grpc.status
*Type*: Histogram (Size Buckets)
* **grpc.server.call.rcvd_total_compressed_message_size**
+ Total bytes (compressed but not encrypted) received across all request messages (metadata excluded) per RPC; does not include grpc or transport framing bytes.
+ *Attributes*: grpc.method, grpc.authority, grpc.status
+ *Type*: Histogram (Size Buckets)
+* **grpc.server.call.duration**
This metric aims to measure the end2end time an RPC takes from the server transport’s (HTTP2/ inproc / cronet) perspective.
Start timestamp - After the transport knows that it's got a new stream. For HTTP2, this would be after the first header frame for the stream has been received and decoded. Whether the timestamp is recorded before or after HPACK is left to the implementation.
End timestamp - Ends at the first point where the transport considers the stream done. For HTTP2, this would be when scheduling a trailing header with END_STREAM to be written, or RST_STREAM, or a connection abort. Note that this wouldn’t necessarily mean that the bytes have also been immediately scheduled to be written by TCP.
- *Attributes*: grpc.method, grpc.authority, grpc.status - *Type*: Histogram (Latency Buckets) + *Attributes*: grpc.method, grpc.authority, grpc.status
+ *Type*: Histogram (Latency Buckets)
## Language Specifics @@ -117,14 +123,15 @@ OpenTelemetry plugin. Overall, the APIs should have the following capabilities - * Optionally allow enabling/disabling metrics. This would allow optimizations to avoid computation and collection of expensive stats within the gRPC library. Note that even without this capability, users of OpenTelemetry - would be able to customize the + would be able to customize [views](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view) through the MeterProvider. * Optionally allow setting of a OpenTelemetry plugin for a specific channel or server, instead of setting it globally. -Note that implementations of the gRPC OpenTelemetry plugin should take care to -only depend on the OpenTelemetry API and not the OpenTelemetry SDK. +Note that implementations of the gRPC OpenTelemetry plugin +[should prefer](https://opentelemetry.io/docs/specs/otel/overview/) to only +depend on the OpenTelemetry API and not the OpenTelemetry SDK. ### C++ @@ -164,10 +171,10 @@ To be filled The following sections show the differences between the gRPC OpenCensus spec and the proposed gRPC OpenTelemetry spec and the mapping of metrics between the two. -It also presents metrics present in OpenCensus spec that do not have a mapping -in the OpenTelemetry spec at present. Following the mapping, two migration -strategies are proposed for customers who are satisfied with the coverage -provided by the current OpenTelemetry spec. +It also presents metrics present in OpenCensus spec that do not map to a metric +in the OpenTelemetry spec at present. Two migration strategies are also proposed +for customers who are satisfied with the stats coverage provided by the current +OpenTelemetry spec. ### Differences from gRPC OpenCensus Spec From 58ed16386f9cf7a76ae58a729816e98fc16ec8f0 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Fri, 21 Jul 2023 20:58:32 +0000 Subject: [PATCH 06/30] Resolve some reviewer comments --- A66-otel-stats.md | 48 +++++++++++++++++++++++++++++++++-------------- 1 file changed, 34 insertions(+), 14 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index c2a3e1af6..0a49bdb76 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -22,11 +22,13 @@ with OpenTelemetry suggested as the successor framework. ### Related Proposals: -* [gRPC Retry Stats](A45-retry-stats.md) +* [A45: Exposing OpenCensus Metrics and Tracing for gRPC retry](A45-retry-stats.md) ## Proposal -### Units +### Metrics Schema + +#### Units Following the [OpenTelemetry Metrics Semantic Conventions](https://opentelemetry.io/docs/specs/otel/metrics/semantic_conventions/), @@ -48,7 +50,7 @@ Buckets for histograms in default views should be as follows - * Count : 0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536 -### Attributes +#### Attributes * `grpc.method` : Full gRPC method name, including package, service and method, e.g. "google.bigtable.v2.Bigtable/CheckAndMutateRow" @@ -59,7 +61,7 @@ Buckets for histograms in default views should be as follows - * `grpc.authority` : Authority used by the call/attempt, e.g. "pubsub.googleapis.com", "helloworld-gke" -### Client Per-Attempt Instruments +#### Client Per-Attempt Instruments * **grpc.client.attempt.started**
The total number of RPC attempts started, including those that have not completed.
@@ -79,7 +81,7 @@ Buckets for histograms in default views should be as follows - *Attributes*: grpc.method, grpc.target, grpc.status
*Type*: Histogram (Size Buckets)
-### Client Per-Call Instruments +#### Client Per-Call Instruments * **grpc.client.call.duration**
This metric aims to measure the end-to-end time the gRPC library takes to complete an RPC from the application’s perspective.
@@ -89,7 +91,7 @@ Buckets for histograms in default views should be as follows - *Attributes*: grpc.method, grpc.target, grpc.status
*Type*: Histogram (Latency Buckets)
-### Server Instruments +#### Server Instruments * **grpc.server.call.started**
The total number of RPCs started, including those that have not completed.
@@ -111,7 +113,21 @@ Buckets for histograms in default views should be as follows - *Attributes*: grpc.method, grpc.authority, grpc.status
*Type*: Histogram (Latency Buckets)
-## Language Specifics +## OpenTelemetry Plugin Architecture + +To-be-filled + +This section describes a CallTracer approach to collect the client and server +per-attempt/call metrics. A CallTracer is a class that is instantiated for every +call. This class has various methods that are invoked during the lifetime of the +call. On the client side, the CallTracer knows about multiple attempts on the +same call. Needs to support more than one CallTracer per call. + +The OT plugin will basically be a way of configuring a CallTracer factory on +gRPC clients and servers. On the client side, the CallTracer will be created by +an interceptor; on the server side, it will be a ServerCallTracerFactory. + +## Language-Specific Details Each language implementation will provide an API for registering an OpenTelemetry plugin. Overall, the APIs should have the following capabilities - @@ -176,7 +192,9 @@ in the OpenTelemetry spec at present. Two migration strategies are also proposed for customers who are satisfied with the stats coverage provided by the current OpenTelemetry spec. -### Differences from gRPC OpenCensus Spec +### Metric Schema Comparison + +#### Differences from gRPC OpenCensus Spec * OpenTelemetry instrument names don’t allow ‘/’ so we use ‘.’ as the separator. We also get rid of the “.io” suffix in “grpc.io” as it doesn’t @@ -186,7 +204,7 @@ OpenTelemetry spec. spec (detailed below). * OpenTelemetry has attributes similar to tags in OpenCensus, and the OpenCensus tag names already seem to match the OpenTelemetry spec - except - for ‘_’ vs ‘.’ for namespaces. So we just replace the ‘_’ with ‘.’. Note + for ‘\_’ vs ‘.’ for namespaces. So we just replace the ‘\_’ with ‘.’. Note that the 'client' and 'server' distinction has also been removed since it does not add any benefit. * grpc_client_method -> grpc.method @@ -199,7 +217,7 @@ OpenTelemetry spec. * Latency metrics in the OpenTelemetry spec use the recommended `s` unit instead of `ms`. -### Metrics with Corresponding Equivalent +#### Metrics with Corresponding Equivalent The following OpenCensus metrics have an equivalent in the OpenTelemetry spec (with the above noted differences) allowing for receivers of the telemetry data @@ -214,7 +232,7 @@ grpc.io/server/started_rpcs | grpc.server.call.started grpc.io/server/completed_rpcs | (Derivable from grpc.server.call.duration) grpc.io/server/server_latency | grpc.server.call.duration -### Metrics with Nuanced Differences +#### Metrics with Nuanced Differences Unfortunately, the implementations of the gRPC OpenCensus spec in the various languages do not agree on the definition of the following size metrics. Go @@ -230,7 +248,7 @@ grpc.io/client/received_bytes_per_rpc | grpc.client.attempt.rcvd_total_compresse grpc.io/server/sent_bytes_per_rpc | grpc.server.call.sent_total_compressed_message_size grpc.io/server/received_bytes_per_rpc | grpc.server.call.rcvd_total_compressed_message_size -### Additional gRPC OpenCensus Metrics not supported in first iteration +#### OpenCensus Metrics not Initially Supported in OpenTelemetry There are some additional metrics defined in the gRPC OpenCensus spec and retry stats which we will not be supporting in the first iteration of the @@ -259,7 +277,9 @@ OpenTelemetry spec with the appropriate changes. * grpc.io/client/transparent_retries * grpc.io/client/retry_delay_per_call -### Migration Strategy 1 +## Migration Strategies + +### Migrate on a Per-Client Basis * Update telemetry dashboards and alerts to join the results from the OpenCensus metrics and the OpenTelemetry metrics. @@ -268,7 +288,7 @@ OpenTelemetry spec with the appropriate changes. * After 100% rollout and some duration (to maintain previous history), update telemetry dashboards and alerts to not query OpenCensus metrics. -### Migration Strategy 2 - Supporting multiple stats plugins +### Duplicate Metrics During Migration For this strategy, gRPC stacks need to support registration of both the OpenCensus and the OpenTelemetry plugins at the same time and allow both metrics From be41d1d5c05a9003f37861ca0233160baa2312f0 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Thu, 24 Aug 2023 22:09:34 +0000 Subject: [PATCH 07/30] More details --- A66-otel-stats.md | 79 ++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 64 insertions(+), 15 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index 0a49bdb76..a75fc5f54 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -50,16 +50,36 @@ Buckets for histograms in default views should be as follows - * Count : 0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536 +These buckets were chosen to maintain compatibility with the gRPC OpenCensus +spec. The OpenTelemetry API has added an experimental feature for +[advice](https://opentelemetry.io/docs/specs/otel/metrics/api/#instrument-advice) +that would allow the gRPC library to provide these buckets as a hint. Since this +is still an experimental feature and not yet implemented in all languages, it is +upto the user to choose the right bucket boundaries. + +Also note that, as per an +[OpenTelemetry proposal on stability](https://docs.google.com/document/d/1Nvcf1wio7nDUVcrXxVUN_f8MNmcs0OzVAZLvlth1lYY/edit#heading=h.dy1cg9doaq26) +though, changes to bucket boundaries might not be considered a breaking change. +Depending on the proposal, this recommendation would change to use +`ExponentialHistogram`s instead, which would allow for automatic adjustments of +the scale to better fit the data. + #### Attributes * `grpc.method` : Full gRPC method name, including package, service and - method, e.g. "google.bigtable.v2.Bigtable/CheckAndMutateRow" + method, e.g. "google.bigtable.v2.Bigtable/CheckAndMutateRow". Note that some + gRPC implementations allow server to handle generic method names, i.e., not + registering method names in advance with the server. This allows clients to + send arbitrary method names that could potentially open up the server to + malicious attacks that result in metrics being stored with a high + cardinality. To prevent this, unregistered/generic method names should by + default be reported with "generic" value instead. Implementations can + provide the option to override this behavior to allow recording generic + method names as well. * `grpc.status` : gRPC server status code received, e.g. "OK", "CANCELLED", "DEADLINE_EXCEEDED" * `grpc.target` : Target URI used when creating gRPC Channel, e.g. "dns:///pubsub.googleapis.com:443", "xds:///helloworld-gke:8000" -* `grpc.authority` : Authority used by the call/attempt, e.g. - "pubsub.googleapis.com", "helloworld-gke" #### Client Per-Attempt Instruments @@ -115,17 +135,21 @@ Buckets for histograms in default views should be as follows - ## OpenTelemetry Plugin Architecture -To-be-filled - -This section describes a CallTracer approach to collect the client and server +This section describes a `CallTracer` approach to collect the client and server per-attempt/call metrics. A CallTracer is a class that is instantiated for every call. This class has various methods that are invoked during the lifetime of the call. On the client side, the CallTracer knows about multiple attempts on the -same call. Needs to support more than one CallTracer per call. +same call, and creates a `CallAttemptTracer` object for each attempt, and the +`CallAttemptTracer` gets invoked during the lifetime of the attempt. + +The OTel plugin will basically be a way of configuring CallTracer factories on +gRPC clients and servers. -The OT plugin will basically be a way of configuring a CallTracer factory on -gRPC clients and servers. On the client side, the CallTracer will be created by -an interceptor; on the server side, it will be a ServerCallTracerFactory. +Implementations should allow multiple call/attempt tracers to be registered to a +single call since there could be multiple plugins registered. For example, there +could be an OpenCensus and an OpenTelemetry stats plugin registered together. It +should also allow multiple OpenTelemetry plugins to be registered providing the +ability to configure the different plugins with different MeterProviders. ## Language-Specific Details @@ -135,7 +159,11 @@ OpenTelemetry plugin. Overall, the APIs should have the following capabilities - * Allow installing multiple OpenTelemetry plugins. * Allow setting a [MeterProvider](https://opentelemetry.io/docs/specs/otel/metrics/api/#meterprovider) - on individual plugins. + on individual plugins. Implementations should require a MeterProvider being + set. A MeterProvider not being set should either not be allowed, fail + registering of the plugin or result in a no-op. Some OpenTelemetry language + APIs have a global MeterProvider. gRPC implementations should *NOT* fallback + on this global. * Optionally allow enabling/disabling metrics. This would allow optimizations to avoid computation and collection of expensive stats within the gRPC library. Note that even without this capability, users of OpenTelemetry @@ -149,19 +177,24 @@ Note that implementations of the gRPC OpenTelemetry plugin [should prefer](https://opentelemetry.io/docs/specs/otel/overview/) to only depend on the OpenTelemetry API and not the OpenTelemetry SDK. +The [Meter](https://opentelemetry.io/docs/specs/otel/metrics/api/#get-a-meter) +creation should use a `name` that identifies the library, for example, +"grpc-c++", "grpc-java", "grpc-go". The `version` should be the same as the +release version of the gRPC library, for example, "1.57.1". + ### C++ ```c++ Class OpenTelemetryPluginBuilder { public: // Enables base set of metrics by default - OpenTelemetryPluginBuilder() = default; - // If `SetMeterProvider` is called, \a meter_provider will be used instead of the global default. - OpenTelemetryPluginBuilder& SetMeterProvider(shared_ptr meter_provider); + OpenTelemetryPluginBuilder(shared_ptr meter_provider) = default; // Enable metric \a metric_name. OpenTelemetryPluginBuilder& EnableMetric(absl::string_view metric_name); // Disable metric \a metric_name OpenTelemetryPluginBuilder& DisableMetric(absl::string_view metric_name); + // If set, is invoked by gRPC when a generic method type RPC is seen. \a generic_method_filter should return true if the generic method name should be recorded. Returning false results in the method name being replaced with "generic" in the recorded metrics. + OpenTelemetryPluginBuilder& SetGenericMethodFilter(absl::AnyInvocable generic_method_filter); // Builds and registers a OpenTelemetry Plugin void BuildAndRegisterGlobal(); }; @@ -306,10 +339,26 @@ disabling the OpenCensus plugin. ## Rationale OpenCensus is no longer being actively maintained and is being deprecated, with -OpenTelemetry suggested as the successor framework. The OpenTelemetry spec aim +OpenTelemetry suggested as the successor framework. The OpenTelemetry spec aims to maintain compatibility with the gRPC OpenCensus spec wherever reasonable to allow for an easy migration path. +There is a +[General RPC conventions](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/rpc-metrics.md) +doc that is currently in `experimental` status. Given the different nuances that +each RPC system has, it seems difficult to adopt one convention that would make +sense for all systems. For gRPC specifically, the following differences are +immediately obvious - + +* gRPC differentiates between the concept of a `call` and an `attempt`. Each + `call` can have multiple `attempts` with retries/hedging. +* The various gRPC implementations can record the compressed message lengths, + but not all implementations can get the uncompressed message length (as + recommended by OTel RPC conventions.) + +This gRFC, hence, intends to override the [General RPC conventions] for gRPC's +purposes. + ## Implementation Implementations for the OpenTelemetry plugin are currently planned for C++, From c05f4edaca215af3cdb9ce7a0d06bad509766215 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Thu, 24 Aug 2023 22:18:26 +0000 Subject: [PATCH 08/30] Sample implementation --- A66-otel-stats.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index a75fc5f54..cab37f047 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -151,6 +151,9 @@ could be an OpenCensus and an OpenTelemetry stats plugin registered together. It should also allow multiple OpenTelemetry plugins to be registered providing the ability to configure the different plugins with different MeterProviders. +A sample implementation of this approach is available in +[gRPC Core](https://github.com/grpc/grpc/blob/v1.57.x/src/core/lib/channel/call_tracer.h). + ## Language-Specific Details Each language implementation will provide an API for registering an From f66acdf11f2b6619b875ea1121402908d1760949 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Fri, 25 Aug 2023 21:52:12 +0000 Subject: [PATCH 09/30] Canonicalized name --- A66-otel-stats.md | 49 +++++++++++++++++++++++++++++------------------ 1 file changed, 30 insertions(+), 19 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index cab37f047..6f28855b7 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -78,8 +78,10 @@ the scale to better fit the data. method names as well. * `grpc.status` : gRPC server status code received, e.g. "OK", "CANCELLED", "DEADLINE_EXCEEDED" -* `grpc.target` : Target URI used when creating gRPC Channel, e.g. - "dns:///pubsub.googleapis.com:443", "xds:///helloworld-gke:8000" +* `grpc.target` : Canonicalized target URI used when creating gRPC Channel, + e.g. "dns:///pubsub.googleapis.com:443", "xds:///helloworld-gke:8000". + Canonicalized target URI is its form with the scheme if the user didn't + mention the scheme. #### Client Per-Attempt Instruments @@ -115,22 +117,22 @@ the scale to better fit the data. * **grpc.server.call.started**
The total number of RPCs started, including those that have not completed.
- *Attributes*: grpc.method, grpc.authority
+ *Attributes*: grpc.method
*Type*: counter
*Unit*: {call}
* **grpc.server.call.sent_total_compressed_message_size**
Total bytes (compressed but not encrypted) sent across all response messages (metadata excluded) per RPC; does not include grpc or transport framing bytes.
- *Attributes*: grpc.method, grpc.authority, grpc.status
+ *Attributes*: grpc.method, grpc.status
*Type*: Histogram (Size Buckets)
* **grpc.server.call.rcvd_total_compressed_message_size**
Total bytes (compressed but not encrypted) received across all request messages (metadata excluded) per RPC; does not include grpc or transport framing bytes.
- *Attributes*: grpc.method, grpc.authority, grpc.status
+ *Attributes*: grpc.method, grpc.status
*Type*: Histogram (Size Buckets)
* **grpc.server.call.duration**
This metric aims to measure the end2end time an RPC takes from the server transport’s (HTTP2/ inproc / cronet) perspective.
Start timestamp - After the transport knows that it's got a new stream. For HTTP2, this would be after the first header frame for the stream has been received and decoded. Whether the timestamp is recorded before or after HPACK is left to the implementation.
End timestamp - Ends at the first point where the transport considers the stream done. For HTTP2, this would be when scheduling a trailing header with END_STREAM to be written, or RST_STREAM, or a connection abort. Note that this wouldn’t necessarily mean that the bytes have also been immediately scheduled to be written by TCP.
- *Attributes*: grpc.method, grpc.authority, grpc.status
+ *Attributes*: grpc.method, grpc.status
*Type*: Histogram (Latency Buckets)
## OpenTelemetry Plugin Architecture @@ -163,10 +165,9 @@ OpenTelemetry plugin. Overall, the APIs should have the following capabilities - * Allow setting a [MeterProvider](https://opentelemetry.io/docs/specs/otel/metrics/api/#meterprovider) on individual plugins. Implementations should require a MeterProvider being - set. A MeterProvider not being set should either not be allowed, fail - registering of the plugin or result in a no-op. Some OpenTelemetry language - APIs have a global MeterProvider. gRPC implementations should *NOT* fallback - on this global. + set. A MeterProvider not being set should result in a no-op. Some + OpenTelemetry language APIs have a global MeterProvider. gRPC + implementations should *NOT* fallback on this global. * Optionally allow enabling/disabling metrics. This would allow optimizations to avoid computation and collection of expensive stats within the gRPC library. Note that even without this capability, users of OpenTelemetry @@ -190,13 +191,24 @@ release version of the gRPC library, for example, "1.57.1". ```c++ Class OpenTelemetryPluginBuilder { public: - // Enables base set of metrics by default - OpenTelemetryPluginBuilder(shared_ptr meter_provider) = default; - // Enable metric \a metric_name. - OpenTelemetryPluginBuilder& EnableMetric(absl::string_view metric_name); - // Disable metric \a metric_name - OpenTelemetryPluginBuilder& DisableMetric(absl::string_view metric_name); - // If set, is invoked by gRPC when a generic method type RPC is seen. \a generic_method_filter should return true if the generic method name should be recorded. Returning false results in the method name being replaced with "generic" in the recorded metrics. + // If `SetMeterProvider()` is not called, the stats plugin is a no-op. + OpenTelemetryPluginBuilder& SetMeterProvider( + std::shared_ptr meter_provider); + // Enable metrics in \a metric_names. Only these metrics are recorded by gRPC. + // Sample - + // OpenTelemetryPluginBuilder().EnableMetrics(BaseMetrics()).SetMeterProvider(mp).BuildAndRegisterGlobal(); + OpenTelemetryPluginBuilder& EnableMetrics( + const absl::flat_hash_set& metric_names); + // The base set of metrics - + // grpc.client.attempt.started + // grpc.client.attempt.duration + // grpc.client.attempt.sent_total_compressed_message_size + // grpc.client.attempt.rcvd_total_compressed_message_size + // grpc.server.call.started + // grpc.server.call.duration + // grpc.server.call.sent_total_compressed_message_size + // grpc.server.call.rcvd_total_compressed_message_size + static absl::flat_hash_set BaseMetrics(); OpenTelemetryPluginBuilder& SetGenericMethodFilter(absl::AnyInvocable generic_method_filter); // Builds and registers a OpenTelemetry Plugin void BuildAndRegisterGlobal(); @@ -247,9 +259,8 @@ OpenTelemetry spec. * grpc_client_status -> grpc.status * grpc_server_method -> grpc.method * grpc_server_status -> grpc.status -* Two new attributes have been added. +* One new attribute has been added. * grpc.target - Added on client metrics - * grpc.authority - Added on server metrics * Latency metrics in the OpenTelemetry spec use the recommended `s` unit instead of `ms`. From 3135189ee82717bb71c0d6a7922086a2c3bcd6e0 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Mon, 28 Aug 2023 03:46:43 +0000 Subject: [PATCH 10/30] C++ API changes and target considerations --- A66-otel-stats.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index 6f28855b7..d38a8406f 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -73,7 +73,7 @@ the scale to better fit the data. send arbitrary method names that could potentially open up the server to malicious attacks that result in metrics being stored with a high cardinality. To prevent this, unregistered/generic method names should by - default be reported with "generic" value instead. Implementations can + default be reported with "other" value instead. Implementations should provide the option to override this behavior to allow recording generic method names as well. * `grpc.status` : gRPC server status code received, e.g. "OK", "CANCELLED", @@ -81,7 +81,11 @@ the scale to better fit the data. * `grpc.target` : Canonicalized target URI used when creating gRPC Channel, e.g. "dns:///pubsub.googleapis.com:443", "xds:///helloworld-gke:8000". Canonicalized target URI is its form with the scheme if the user didn't - mention the scheme. + mention the scheme. For channels such as inprocess channels where a target + URI is not available, implementations can synthesize a target URI. It is + possible for some channels to use IP addresses as target strings and this + might again blow up the cardinality. Implementations should provide the + option to override recorded target names with "other". #### Client Per-Attempt Instruments @@ -209,7 +213,10 @@ Class OpenTelemetryPluginBuilder { // grpc.server.call.sent_total_compressed_message_size // grpc.server.call.rcvd_total_compressed_message_size static absl::flat_hash_set BaseMetrics(); + // If \a generic_method_filter returns true for a method_name, that method_name is recorded as is, otherwise it is recorded as "other". OpenTelemetryPluginBuilder& SetGenericMethodFilter(absl::AnyInvocable generic_method_filter); + // If \a target_filter returns true for a target, that target is recorded as is, otherwise it is recorded as "other". + OpenTelemetryPluginBuilder& SetTargetFilter(absl::AnyInvocable target_filter); // Builds and registers a OpenTelemetry Plugin void BuildAndRegisterGlobal(); }; From e8ac5505c33afa8e8bacd00848b6dd05a286090b Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Fri, 8 Sep 2023 22:07:47 +0000 Subject: [PATCH 11/30] Fill out CallTracer details --- A66-otel-stats.md | 49 ++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 46 insertions(+), 3 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index d38a8406f..c771d8e98 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -144,12 +144,53 @@ the scale to better fit the data. This section describes a `CallTracer` approach to collect the client and server per-attempt/call metrics. A CallTracer is a class that is instantiated for every call. This class has various methods that are invoked during the lifetime of the -call. On the client side, the CallTracer knows about multiple attempts on the +call. On the client-side, the CallTracer knows about multiple attempts on the same call, and creates a `CallAttemptTracer` object for each attempt, and the -`CallAttemptTracer` gets invoked during the lifetime of the attempt. +`CallAttemptTracer` gets invoked during the lifetime of the attempt. On the +server-side, we have an equivalent `ServerCallTracer`. (There is no concept of +an attempt on the server-side.) The OTel plugin will basically be a way of configuring CallTracer factories on -gRPC clients and servers. +gRPC channels and servers. + +A CallTracer needs to know the channel's target in the canonical form, and the +full qualified method name for filling in the attributes needed on the metrics. +Similarly on the server-side, the `ServerCallTracer` needs to know the method of +the incoming call. Depending on the implementation details, the method may be +propagated as part of the initial metadata. + +The following call-outs are needed on the `CallTracer` - + +* When the call has been created. This call-out should be before payload + serialization. +* When new attempts are created on the call along with information on whether + the attempt was a transparent retry or not. (Attempts are created after name + resolution but before the pick.) This is also when it's expected for the + `CallAttemptTracer` to be created. +* When an attempt ends. This will be needed future stats around retries and + hedging. This information can also be propagated through the + `CallAttemptTracer` if the `CallAttemptTracer` keeps a reference to the + parent `CallTracer` object. +* When the call ends. This along with the call creation call-out allows the + `CallTracer` to calculate the call duration. + +The following call-outs are needed on the `CallAttemptTracer` - + +* When a new message is sent/received. The message should be in its compressed + form. +* When the trailing metadata/status is received for the attempt. Receipt of + this indicates that the attempt has ended. Implementations may choose to + delegate the responsibility of notifying the `CallTracer` about the attempt + end to the `CallAttemptTracer`. + +The following call-outs are needed on the `ServerCallTracer` - + +* When initial metadata is received by the transport for a call. This + indicates the start time of a new call. +* When a new message is sent/received. The message should be in its compressed + form. +* When trailing metadata/status is sent. This call-out should be as close to + the transport as possible to be able to capture the total time of the call. Implementations should allow multiple call/attempt tracers to be registered to a single call since there could be multiple plugins registered. For example, there @@ -180,6 +221,8 @@ OpenTelemetry plugin. Overall, the APIs should have the following capabilities - the MeterProvider. * Optionally allow setting of a OpenTelemetry plugin for a specific channel or server, instead of setting it globally. +* Optionally allowing setting of a map of constant attributes that are + recorded on all metrics associated with that plugin. Note that implementations of the gRPC OpenTelemetry plugin [should prefer](https://opentelemetry.io/docs/specs/otel/overview/) to only From 353c884ec976afcfa5cae78b5dd206dc5f6fcea0 Mon Sep 17 00:00:00 2001 From: Vindhya Ningegowda Date: Mon, 11 Sep 2023 16:47:19 -0700 Subject: [PATCH 12/30] added Java API for OpenTelemetry metrics --- A66-otel-stats.md | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index c771d8e98..cb2a4d9b8 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -271,7 +271,37 @@ for a particular channel or server builder. ### Java -To be filled +``` +public static class OpenTelemetryModuleBuilder { + /** + * OpenTelemetry instance is used to configure metrics settings. + * + * Sample + * SdkMeterProvider sdkMeterProvider = SdkMeterProvider.builder() + * .registerMetricReader( + * PeriodicMetricReader.builder( + * OtlpGrpcMetricExporter.builder().build()).build()) + * .build(); + * + * OpenTelemetry openTelemetry = OpenTelemetrySdk.builder() + * .setMeterProvider(sdkMeterProvider) + * .build(); + * + * If MeterProvider is not configured, no-op meterProvider will be used by default. + * It provides meters which do not record or emit. + */ + public OpenTelemetryModuleBuilder openTelemetry(OpenTelemetry openTelemetry); + + /* Enable metrics for listed metrics. */ + public OpenTelmetryBuilder enableMetrics(Set metricNames); + + /* If targetFilter returns true for a target, target is recorded as is. + * Otherwise it will be recorded as "other". */ + public OpenTelemetryBuilder targetFilter(Predicate targetFilter); + + public OpenTelemetryModule build(); +} +``` ### Go From 6fab0c18dfd18d423b870af7564b102ddffb07e0 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Mon, 18 Sep 2023 09:35:07 +0000 Subject: [PATCH 13/30] Reviewer comments --- A66-otel-stats.md | 88 ++++++++++++++++++++++++++++------------------- 1 file changed, 53 insertions(+), 35 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index cb2a4d9b8..530e01085 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -22,6 +22,7 @@ with OpenTelemetry suggested as the successor framework. ### Related Proposals: +* [A6: gRPC Retry Design](A6-client-retries.md) * [A45: Exposing OpenCensus Metrics and Tracing for gRPC retry](A45-retry-stats.md) ## Proposal @@ -139,13 +140,18 @@ the scale to better fit the data. *Attributes*: grpc.method, grpc.status
*Type*: Histogram (Latency Buckets)
-## OpenTelemetry Plugin Architecture +### OpenTelemetry Plugin Architecture This section describes a `CallTracer` approach to collect the client and server -per-attempt/call metrics. A CallTracer is a class that is instantiated for every -call. This class has various methods that are invoked during the lifetime of the -call. On the client-side, the CallTracer knows about multiple attempts on the -same call, and creates a `CallAttemptTracer` object for each attempt, and the +per-attempt/call metrics. Implementations are free to choose different ways of +representing/naming the classes and methods described here. The implementation +can choose to not create a class either as long as the overall capabilities +remain equivalent. + +A CallTracer is a class that is instantiated for every call. This class has +various methods that are invoked during the lifetime of the call. On the +client-side, the CallTracer knows about multiple attempts on the same call, and +creates a `CallAttemptTracer` object for each attempt, and the `CallAttemptTracer` gets invoked during the lifetime of the attempt. On the server-side, we have an equivalent `ServerCallTracer`. (There is no concept of an attempt on the server-side.) @@ -165,10 +171,10 @@ The following call-outs are needed on the `CallTracer` - serialization. * When new attempts are created on the call along with information on whether the attempt was a transparent retry or not. (Attempts are created after name - resolution but before the pick.) This is also when it's expected for the + resolution but before the LB pick.) This is also when it's expected for the `CallAttemptTracer` to be created. -* When an attempt ends. This will be needed future stats around retries and - hedging. This information can also be propagated through the +* When an attempt ends. This will be needed for future stats around retries + and hedging. This information can also be propagated through the `CallAttemptTracer` if the `CallAttemptTracer` keeps a reference to the parent `CallTracer` object. * When the call ends. This along with the call creation call-out allows the @@ -201,7 +207,7 @@ ability to configure the different plugins with different MeterProviders. A sample implementation of this approach is available in [gRPC Core](https://github.com/grpc/grpc/blob/v1.57.x/src/core/lib/channel/call_tracer.h). -## Language-Specific Details +### Language-Specific Details Each language implementation will provide an API for registering an OpenTelemetry plugin. Overall, the APIs should have the following capabilities - @@ -233,20 +239,16 @@ creation should use a `name` that identifies the library, for example, "grpc-c++", "grpc-java", "grpc-go". The `version` should be the same as the release version of the gRPC library, for example, "1.57.1". -### C++ +#### C++ ```c++ Class OpenTelemetryPluginBuilder { public: - // If `SetMeterProvider()` is not called, the stats plugin is a no-op. + // If `SetMeterProvider()` is not called, no metrics are collected. OpenTelemetryPluginBuilder& SetMeterProvider( std::shared_ptr meter_provider); - // Enable metrics in \a metric_names. Only these metrics are recorded by gRPC. - // Sample - - // OpenTelemetryPluginBuilder().EnableMetrics(BaseMetrics()).SetMeterProvider(mp).BuildAndRegisterGlobal(); - OpenTelemetryPluginBuilder& EnableMetrics( - const absl::flat_hash_set& metric_names); - // The base set of metrics - + // Methods to manipulate which instruments are enabled in the OTel Stats + // Plugin. The default set of instruments are - // grpc.client.attempt.started // grpc.client.attempt.duration // grpc.client.attempt.sent_total_compressed_message_size @@ -255,12 +257,28 @@ Class OpenTelemetryPluginBuilder { // grpc.server.call.duration // grpc.server.call.sent_total_compressed_message_size // grpc.server.call.rcvd_total_compressed_message_size - static absl::flat_hash_set BaseMetrics(); - // If \a generic_method_filter returns true for a method_name, that method_name is recorded as is, otherwise it is recorded as "other". - OpenTelemetryPluginBuilder& SetGenericMethodFilter(absl::AnyInvocable generic_method_filter); - // If \a target_filter returns true for a target, that target is recorded as is, otherwise it is recorded as "other". - OpenTelemetryPluginBuilder& SetTargetFilter(absl::AnyInvocable target_filter); - // Builds and registers a OpenTelemetry Plugin + OpenTelemetryPluginBuilder& EnableMetric(absl::string_view metric_name); + OpenTelemetryPluginBuilder& DisableMetric(absl::string_view metric_name); + OpenTelemetryPluginBuilder& DisableAllMetrics(); + // Allows setting a labels injector on calls traced through this plugin. + OpenTelemetryPluginBuilder& SetLabelsInjector( + std::unique_ptr labels_injector); + // If set, \a target_selector is called per channel to decide whether to + // collect metrics on that target or not. + OpenTelemetryPluginBuilder& SetTargetSelector( + absl::AnyInvocable + target_selector); + // If set, \a target_attribute_filter is called per channel to decide whether + // to record the target attribute on client or to replace it with "other". + OpenTelemetryPluginBuilder& SetTargetAttributeFilter( + absl::AnyInvocable + target_attribute_filter); + // If set, \a generic_method_attribute_filter is called per call with a + // generic method name. If it returns true for \a generic_method_name, that + // method name is recorded as is, otherwise it is recorded as "other" + OpenTelemetryPluginBuilder& SetGenericMethodAttributeFilter( + absl::AnyInvocable + generic_method_attribute_filter); void BuildAndRegisterGlobal(); }; @@ -269,7 +287,7 @@ Class OpenTelemetryPluginBuilder { In the future, additional API might be provided to allow registering the plugin for a particular channel or server builder. -### Java +#### Java ``` public static class OpenTelemetryModuleBuilder { @@ -303,15 +321,15 @@ public static class OpenTelemetryModuleBuilder { } ``` -### Go +#### Go To be filled -### Python +#### Python To be filled -## Migration from OpenCensus +### Migration from OpenCensus The following sections show the differences between the gRPC OpenCensus spec and the proposed gRPC OpenTelemetry spec and the mapping of metrics between the two. @@ -320,9 +338,9 @@ in the OpenTelemetry spec at present. Two migration strategies are also proposed for customers who are satisfied with the stats coverage provided by the current OpenTelemetry spec. -### Metric Schema Comparison +#### Metric Schema Comparison -#### Differences from gRPC OpenCensus Spec +##### Differences from gRPC OpenCensus Spec * OpenTelemetry instrument names don’t allow ‘/’ so we use ‘.’ as the separator. We also get rid of the “.io” suffix in “grpc.io” as it doesn’t @@ -344,7 +362,7 @@ OpenTelemetry spec. * Latency metrics in the OpenTelemetry spec use the recommended `s` unit instead of `ms`. -#### Metrics with Corresponding Equivalent +##### Metrics with Corresponding Equivalent The following OpenCensus metrics have an equivalent in the OpenTelemetry spec (with the above noted differences) allowing for receivers of the telemetry data @@ -359,7 +377,7 @@ grpc.io/server/started_rpcs | grpc.server.call.started grpc.io/server/completed_rpcs | (Derivable from grpc.server.call.duration) grpc.io/server/server_latency | grpc.server.call.duration -#### Metrics with Nuanced Differences +##### Metrics with Nuanced Differences Unfortunately, the implementations of the gRPC OpenCensus spec in the various languages do not agree on the definition of the following size metrics. Go @@ -375,7 +393,7 @@ grpc.io/client/received_bytes_per_rpc | grpc.client.attempt.rcvd_total_compresse grpc.io/server/sent_bytes_per_rpc | grpc.server.call.sent_total_compressed_message_size grpc.io/server/received_bytes_per_rpc | grpc.server.call.rcvd_total_compressed_message_size -#### OpenCensus Metrics not Initially Supported in OpenTelemetry +##### OpenCensus Metrics not Initially Supported in OpenTelemetry There are some additional metrics defined in the gRPC OpenCensus spec and retry stats which we will not be supporting in the first iteration of the @@ -404,9 +422,9 @@ OpenTelemetry spec with the appropriate changes. * grpc.io/client/transparent_retries * grpc.io/client/retry_delay_per_call -## Migration Strategies +#### Migration Strategies -### Migrate on a Per-Client Basis +##### Migrate on a Per-Client Basis * Update telemetry dashboards and alerts to join the results from the OpenCensus metrics and the OpenTelemetry metrics. @@ -415,7 +433,7 @@ OpenTelemetry spec with the appropriate changes. * After 100% rollout and some duration (to maintain previous history), update telemetry dashboards and alerts to not query OpenCensus metrics. -### Duplicate Metrics During Migration +##### Duplicate Metrics During Migration For this strategy, gRPC stacks need to support registration of both the OpenCensus and the OpenTelemetry plugins at the same time and allow both metrics From 9227f9b23ef4072e3c6a0f2eb7980a1afff650be Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Mon, 18 Sep 2023 09:49:49 +0000 Subject: [PATCH 14/30] Reviewer comments --- A66-otel-stats.md | 84 ++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 69 insertions(+), 15 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index 530e01085..16c8bc62f 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -242,8 +242,9 @@ release version of the gRPC library, for example, "1.57.1". #### C++ ```c++ -Class OpenTelemetryPluginBuilder { +class OpenTelemetryPluginBuilder { public: + OpenTelemetryPluginBuilder(); // If `SetMeterProvider()` is not called, no metrics are collected. OpenTelemetryPluginBuilder& SetMeterProvider( std::shared_ptr meter_provider); @@ -260,24 +261,21 @@ Class OpenTelemetryPluginBuilder { OpenTelemetryPluginBuilder& EnableMetric(absl::string_view metric_name); OpenTelemetryPluginBuilder& DisableMetric(absl::string_view metric_name); OpenTelemetryPluginBuilder& DisableAllMetrics(); - // Allows setting a labels injector on calls traced through this plugin. - OpenTelemetryPluginBuilder& SetLabelsInjector( - std::unique_ptr labels_injector); - // If set, \a target_selector is called per channel to decide whether to - // collect metrics on that target or not. - OpenTelemetryPluginBuilder& SetTargetSelector( - absl::AnyInvocable - target_selector); // If set, \a target_attribute_filter is called per channel to decide whether // to record the target attribute on client or to replace it with "other". + // This helps reduce the cardinality on metrics in cases where many channels + // are created with different targets in the same binary (which might happen + // for example, if the channel target string uses IP addresses directly). OpenTelemetryPluginBuilder& SetTargetAttributeFilter( absl::AnyInvocable target_attribute_filter); // If set, \a generic_method_attribute_filter is called per call with a - // generic method name. If it returns true for \a generic_method_name, that - // method name is recorded as is, otherwise it is recorded as "other" + // generic method type to decide whether to record the method name or to + // replace it with "other". Non-generic or pre-registered methods remain + // unaffected. If not set, by default, generic method names are replaced with + // "other" when recording metrics. OpenTelemetryPluginBuilder& SetGenericMethodAttributeFilter( - absl::AnyInvocable + absl::AnyInvocable generic_method_attribute_filter); void BuildAndRegisterGlobal(); }; @@ -323,7 +321,62 @@ public static class OpenTelemetryModuleBuilder { #### Go -To be filled +``` +import ( + "go.opentelemetry.io/otel/attribute" + "go.opentelemetry.io/otel/metric" +) + + +package opentelemetry + +// MetricsOptions are the metrics options for OpenTelemetry instrumentation. +type MetricsOptions struct { + // MeterProvider is the MeterProvider instance that will be used for access + // to Named Meter instances to instrument an application. To enable metrics + // collection, set a meter provider. If unset, no metrics will be recorded. + // Any implementation knobs (i.e. views, bounds) set in the passed in object + // take precedence over the API calls from the interface in this component + // (i.e. it will create default views for unset views). + MeterProvider metric.MeterProvider + + // Metrics are the metrics to instrument. Will turn on the corresponding + // metric supported by the client and server instrumentation components if + // applicable. + Metrics []string + + // Attributes are constant attributes applied to every recorded metric. + Attributes []attribute.KeyValue +} + +// DialOption returns a dial option which enables OpenTelemetry instrumentation +// code for a grpc.ClientConn. +// +// Client applications interested in instrumenting their grpc.ClientConn should +// pass the dial option returned from this function as a dial option to +// grpc.Dial(). +// +// For the metrics supported by this instrumentation code, a user needs to +// specify the client metrics to record in metrics options. A user also needs to +// provide an implementation of a MeterProvider. If the passed in Meter Provider +// does not have the view configured for an individual metric turned on, the API +// call in this component will create a default view for that metric. +func DialOption(mo MetricsOptions) grpc.DialOption {} + +// ServerOption returns a server option which enables OpenTelemetry +// instrumentation code for a grpc.Server. +// +// Server applications interested in instrumenting their grpc.Server should pass +// the server option returned from this function as an argument to +// grpc.NewServer(). +// +// For the metrics supported by this instrumentation code, a user needs to +// specify the client metrics to record in metrics options. A user also needs to +// provide an implementation of a MeterProvider. If the passed in Meter Provider +// does not have the view configured for an individual metric turned on, the API +// call in this component will create a default view for that metric. +func ServerOption(mo MetricsOptions) grpc.ServerOption {} +``` #### Python @@ -468,8 +521,9 @@ immediately obvious - but not all implementations can get the uncompressed message length (as recommended by OTel RPC conventions.) -This gRFC, hence, intends to override the [General RPC conventions] for gRPC's -purposes. +This gRFC, hence, intends to override the +[General RPC conventions](https://opentelemetry.io/docs/specs/otel/metrics/semantic_conventions/rpc-metrics/) +for gRPC's purposes. ## Implementation From 185358b43a4abc92223f8ec3417ba3ecd4c52805 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Mon, 18 Sep 2023 10:03:48 +0000 Subject: [PATCH 15/30] Reviewer comments --- A66-otel-stats.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index 16c8bc62f..3755437f1 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -23,6 +23,7 @@ with OpenTelemetry suggested as the successor framework. ### Related Proposals: * [A6: gRPC Retry Design](A6-client-retries.md) +* [A39: xDS HTTP Filter Support](A39-xds-http-filters.md) * [A45: Exposing OpenCensus Metrics and Tracing for gRPC retry](A45-retry-stats.md) ## Proposal @@ -144,9 +145,8 @@ the scale to better fit the data. This section describes a `CallTracer` approach to collect the client and server per-attempt/call metrics. Implementations are free to choose different ways of -representing/naming the classes and methods described here. The implementation -can choose to not create a class either as long as the overall capabilities -remain equivalent. +representing/naming the classes and methods described here as long as the +overall capabilities remain equivalent. A CallTracer is a class that is instantiated for every call. This class has various methods that are invoked during the lifetime of the call. On the @@ -171,7 +171,7 @@ The following call-outs are needed on the `CallTracer` - serialization. * When new attempts are created on the call along with information on whether the attempt was a transparent retry or not. (Attempts are created after name - resolution but before the LB pick.) This is also when it's expected for the + resolution and after any xDS HTTP filters but before the LB pick.) This is also when it's expected for the `CallAttemptTracer` to be created. * When an attempt ends. This will be needed for future stats around retries and hedging. This information can also be propagated through the From e349577da17c42e95533edff561d0f69beb6c4a8 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Mon, 18 Sep 2023 18:26:27 +0000 Subject: [PATCH 16/30] Reviewer comments --- A66-otel-stats.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index 3755437f1..c56f1e25e 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -150,11 +150,11 @@ overall capabilities remain equivalent. A CallTracer is a class that is instantiated for every call. This class has various methods that are invoked during the lifetime of the call. On the -client-side, the CallTracer knows about multiple attempts on the same call, and -creates a `CallAttemptTracer` object for each attempt, and the -`CallAttemptTracer` gets invoked during the lifetime of the attempt. On the -server-side, we have an equivalent `ServerCallTracer`. (There is no concept of -an attempt on the server-side.) +client-side, the CallTracer knows about multiple attempts on the same call (due +to retries or hedging), and creates a `CallAttemptTracer` object for each +attempt, and the `CallAttemptTracer` gets invoked during the lifetime of the +attempt. On the server-side, we have an equivalent `ServerCallTracer`. (There is +no concept of an attempt on the server-side.) The OTel plugin will basically be a way of configuring CallTracer factories on gRPC channels and servers. @@ -171,8 +171,8 @@ The following call-outs are needed on the `CallTracer` - serialization. * When new attempts are created on the call along with information on whether the attempt was a transparent retry or not. (Attempts are created after name - resolution and after any xDS HTTP filters but before the LB pick.) This is also when it's expected for the - `CallAttemptTracer` to be created. + resolution and after any xDS HTTP filters but before the LB pick.) This is + also when it's expected for the `CallAttemptTracer` to be created. * When an attempt ends. This will be needed for future stats around retries and hedging. This information can also be propagated through the `CallAttemptTracer` if the `CallAttemptTracer` keeps a reference to the From 7190fdc5dc8ed43f51b85f920956cca77cb11445 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Mon, 18 Sep 2023 22:58:19 +0000 Subject: [PATCH 17/30] Reviewer comments --- A66-otel-stats.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index c56f1e25e..310317741 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -79,7 +79,8 @@ the scale to better fit the data. provide the option to override this behavior to allow recording generic method names as well. * `grpc.status` : gRPC server status code received, e.g. "OK", "CANCELLED", - "DEADLINE_EXCEEDED" + "DEADLINE_EXCEEDED". + [(Full list)](https://grpc.github.io/grpc/core/md_doc_statuscodes.html) * `grpc.target` : Canonicalized target URI used when creating gRPC Channel, e.g. "dns:///pubsub.googleapis.com:443", "xds:///helloworld-gke:8000". Canonicalized target URI is its form with the scheme if the user didn't @@ -227,8 +228,8 @@ OpenTelemetry plugin. Overall, the APIs should have the following capabilities - the MeterProvider. * Optionally allow setting of a OpenTelemetry plugin for a specific channel or server, instead of setting it globally. -* Optionally allowing setting of a map of constant attributes that are - recorded on all metrics associated with that plugin. +* Optionally allow setting of a map of constant attributes that are recorded + on all metrics associated with that plugin. Note that implementations of the gRPC OpenTelemetry plugin [should prefer](https://opentelemetry.io/docs/specs/otel/overview/) to only @@ -277,6 +278,8 @@ class OpenTelemetryPluginBuilder { OpenTelemetryPluginBuilder& SetGenericMethodAttributeFilter( absl::AnyInvocable generic_method_attribute_filter); + // Registers a global plugin that acts on all channels and servers running on + // the process. void BuildAndRegisterGlobal(); }; @@ -388,8 +391,7 @@ The following sections show the differences between the gRPC OpenCensus spec and the proposed gRPC OpenTelemetry spec and the mapping of metrics between the two. It also presents metrics present in OpenCensus spec that do not map to a metric in the OpenTelemetry spec at present. Two migration strategies are also proposed -for customers who are satisfied with the stats coverage provided by the current -OpenTelemetry spec. +for customers who are satisfied with the stats coverage provided by this spec. #### Metric Schema Comparison From 7cb164490f402a371d05d51e7be1b7a74397c13c Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Fri, 22 Sep 2023 01:17:30 +0000 Subject: [PATCH 18/30] Reviewer comments --- A66-otel-stats.md | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index 310317741..0446d1b0e 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -57,7 +57,8 @@ spec. The OpenTelemetry API has added an experimental feature for [advice](https://opentelemetry.io/docs/specs/otel/metrics/api/#instrument-advice) that would allow the gRPC library to provide these buckets as a hint. Since this is still an experimental feature and not yet implemented in all languages, it is -upto the user to choose the right bucket boundaries. +upto the user to choose the right bucket boundaries and set it through the +[OTel SDK](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view). Also note that, as per an [OpenTelemetry proposal on stability](https://docs.google.com/document/d/1Nvcf1wio7nDUVcrXxVUN_f8MNmcs0OzVAZLvlth1lYY/edit#heading=h.dy1cg9doaq26) @@ -69,10 +70,12 @@ the scale to better fit the data. #### Attributes * `grpc.method` : Full gRPC method name, including package, service and - method, e.g. "google.bigtable.v2.Bigtable/CheckAndMutateRow". Note that some - gRPC implementations allow server to handle generic method names, i.e., not - registering method names in advance with the server. This allows clients to - send arbitrary method names that could potentially open up the server to + method, e.g. "google.bigtable.v2.Bigtable/CheckAndMutateRow". Note that gRPC + servers can receive arbitrary method names, i.e., method names that have not + been registered in advance with the server. This normally results in those + RPCs being rejected with an UNIMPLEMENTED status. Some gRPC implementations + allow servers to handle such generic method names. Since the stats plugin + would be recording all of these RPCs, this could open up the server to malicious attacks that result in metrics being stored with a high cardinality. To prevent this, unregistered/generic method names should by default be reported with "other" value instead. Implementations should @@ -96,19 +99,22 @@ the scale to better fit the data. The total number of RPC attempts started, including those that have not completed.
*Attributes*: grpc.method, grpc.target
*Type*: Counter
- *Unit*: {attempt}
+ *Unit*: `{attempt}`
* **grpc.client.attempt.duration**
End-to-end time taken to complete an RPC attempt including the time it takes to pick a subchannel.
*Attributes*: grpc.method, grpc.target, grpc.status
*Type*: Histogram (Latency Buckets)
+ *Unit*: `s`
* **grpc.client.attempt.sent_total_compressed_message_size**
Total bytes (compressed but not encrypted) sent across all request messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.
Attributes: grpc.method, grpc.target, grpc.status
Type: Histogram (Size Buckets)
+ *Unit*: `By`
* **grpc.client.attempt.rcvd_total_compressed_message_size**
Total bytes (compressed but not encrypted) received across all response messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.
*Attributes*: grpc.method, grpc.target, grpc.status
*Type*: Histogram (Size Buckets)
+ *Unit*: `By`
#### Client Per-Call Instruments @@ -119,6 +125,7 @@ the scale to better fit the data. If the implementation uses an interceptor then the exact start and end timestamps would depend on the ordering of the interceptors. Non-interceptor implementations should record the timestamps as close as possible to the top of the gRPC stack, i.e., payload serialization should be included in the measurement.
*Attributes*: grpc.method, grpc.target, grpc.status
*Type*: Histogram (Latency Buckets)
+ *Unit*: `s`
#### Server Instruments @@ -131,16 +138,19 @@ the scale to better fit the data. Total bytes (compressed but not encrypted) sent across all response messages (metadata excluded) per RPC; does not include grpc or transport framing bytes.
*Attributes*: grpc.method, grpc.status
*Type*: Histogram (Size Buckets)
+ *Unit*: `By`
* **grpc.server.call.rcvd_total_compressed_message_size**
Total bytes (compressed but not encrypted) received across all request messages (metadata excluded) per RPC; does not include grpc or transport framing bytes.
*Attributes*: grpc.method, grpc.status
*Type*: Histogram (Size Buckets)
+ *Unit*: `By`
* **grpc.server.call.duration**
This metric aims to measure the end2end time an RPC takes from the server transport’s (HTTP2/ inproc / cronet) perspective.
Start timestamp - After the transport knows that it's got a new stream. For HTTP2, this would be after the first header frame for the stream has been received and decoded. Whether the timestamp is recorded before or after HPACK is left to the implementation.
End timestamp - Ends at the first point where the transport considers the stream done. For HTTP2, this would be when scheduling a trailing header with END_STREAM to be written, or RST_STREAM, or a connection abort. Note that this wouldn’t necessarily mean that the bytes have also been immediately scheduled to be written by TCP.
*Attributes*: grpc.method, grpc.status
*Type*: Histogram (Latency Buckets)
+ *Unit*: `s`
### OpenTelemetry Plugin Architecture @@ -157,8 +167,8 @@ attempt, and the `CallAttemptTracer` gets invoked during the lifetime of the attempt. On the server-side, we have an equivalent `ServerCallTracer`. (There is no concept of an attempt on the server-side.) -The OTel plugin will basically be a way of configuring CallTracer factories on -gRPC channels and servers. +The OTel plugin will configure CallTracer factories on gRPC channels and +servers. A CallTracer needs to know the channel's target in the canonical form, and the full qualified method name for filling in the attributes needed on the metrics. From 906e5db03c133a2ae5a071a45493a080f4bc89ff Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Mon, 25 Sep 2023 20:26:43 +0000 Subject: [PATCH 19/30] Reviewer comments --- A66-otel-stats.md | 69 +++++++++++++++-------------------------------- 1 file changed, 21 insertions(+), 48 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index 0446d1b0e..f53b915b3 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -57,7 +57,7 @@ spec. The OpenTelemetry API has added an experimental feature for [advice](https://opentelemetry.io/docs/specs/otel/metrics/api/#instrument-advice) that would allow the gRPC library to provide these buckets as a hint. Since this is still an experimental feature and not yet implemented in all languages, it is -upto the user to choose the right bucket boundaries and set it through the +up to the user to choose the right bucket boundaries and set it through the [OTel SDK](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view). Also note that, as per an @@ -230,16 +230,6 @@ OpenTelemetry plugin. Overall, the APIs should have the following capabilities - set. A MeterProvider not being set should result in a no-op. Some OpenTelemetry language APIs have a global MeterProvider. gRPC implementations should *NOT* fallback on this global. -* Optionally allow enabling/disabling metrics. This would allow optimizations - to avoid computation and collection of expensive stats within the gRPC - library. Note that even without this capability, users of OpenTelemetry - would be able to customize - [views](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view) through - the MeterProvider. -* Optionally allow setting of a OpenTelemetry plugin for a specific channel or - server, instead of setting it globally. -* Optionally allow setting of a map of constant attributes that are recorded - on all metrics associated with that plugin. Note that implementations of the gRPC OpenTelemetry plugin [should prefer](https://opentelemetry.io/docs/specs/otel/overview/) to only @@ -248,7 +238,12 @@ depend on the OpenTelemetry API and not the OpenTelemetry SDK. The [Meter](https://opentelemetry.io/docs/specs/otel/metrics/api/#get-a-meter) creation should use a `name` that identifies the library, for example, "grpc-c++", "grpc-java", "grpc-go". The `version` should be the same as the -release version of the gRPC library, for example, "1.57.1". +release version of the gRPC library, for example, "1.57.1". The instruments +described above will be created from this meter. + +Users of the gRPC OpenTelemetry plugin will use the OTel SDK's MeterProvider to +[control the views](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view) +and customize the metrics that will be exported. #### C++ @@ -259,19 +254,6 @@ class OpenTelemetryPluginBuilder { // If `SetMeterProvider()` is not called, no metrics are collected. OpenTelemetryPluginBuilder& SetMeterProvider( std::shared_ptr meter_provider); - // Methods to manipulate which instruments are enabled in the OTel Stats - // Plugin. The default set of instruments are - - // grpc.client.attempt.started - // grpc.client.attempt.duration - // grpc.client.attempt.sent_total_compressed_message_size - // grpc.client.attempt.rcvd_total_compressed_message_size - // grpc.server.call.started - // grpc.server.call.duration - // grpc.server.call.sent_total_compressed_message_size - // grpc.server.call.rcvd_total_compressed_message_size - OpenTelemetryPluginBuilder& EnableMetric(absl::string_view metric_name); - OpenTelemetryPluginBuilder& DisableMetric(absl::string_view metric_name); - OpenTelemetryPluginBuilder& DisableAllMetrics(); // If set, \a target_attribute_filter is called per channel to decide whether // to record the target attribute on client or to replace it with "other". // This helps reduce the cardinality on metrics in cases where many channels @@ -290,6 +272,20 @@ class OpenTelemetryPluginBuilder { generic_method_attribute_filter); // Registers a global plugin that acts on all channels and servers running on // the process. + // The most common way to use this API is - + // + // OpenTelemetryPluginBuilder().SetMeterProvider(provider) + // .BuildAndRegisterGlobal(); + // + // The set of instruments available are - + // grpc.client.attempt.started + // grpc.client.attempt.duration + // grpc.client.attempt.sent_total_compressed_message_size + // grpc.client.attempt.rcvd_total_compressed_message_size + // grpc.server.call.started + // grpc.server.call.duration + // grpc.server.call.sent_total_compressed_message_size + // grpc.server.call.rcvd_total_compressed_message_size void BuildAndRegisterGlobal(); }; @@ -321,9 +317,6 @@ public static class OpenTelemetryModuleBuilder { */ public OpenTelemetryModuleBuilder openTelemetry(OpenTelemetry openTelemetry); - /* Enable metrics for listed metrics. */ - public OpenTelmetryBuilder enableMetrics(Set metricNames); - /* If targetFilter returns true for a target, target is recorded as is. * Otherwise it will be recorded as "other". */ public OpenTelemetryBuilder targetFilter(Predicate targetFilter); @@ -352,14 +345,6 @@ type MetricsOptions struct { // take precedence over the API calls from the interface in this component // (i.e. it will create default views for unset views). MeterProvider metric.MeterProvider - - // Metrics are the metrics to instrument. Will turn on the corresponding - // metric supported by the client and server instrumentation components if - // applicable. - Metrics []string - - // Attributes are constant attributes applied to every recorded metric. - Attributes []attribute.KeyValue } // DialOption returns a dial option which enables OpenTelemetry instrumentation @@ -368,12 +353,6 @@ type MetricsOptions struct { // Client applications interested in instrumenting their grpc.ClientConn should // pass the dial option returned from this function as a dial option to // grpc.Dial(). -// -// For the metrics supported by this instrumentation code, a user needs to -// specify the client metrics to record in metrics options. A user also needs to -// provide an implementation of a MeterProvider. If the passed in Meter Provider -// does not have the view configured for an individual metric turned on, the API -// call in this component will create a default view for that metric. func DialOption(mo MetricsOptions) grpc.DialOption {} // ServerOption returns a server option which enables OpenTelemetry @@ -382,12 +361,6 @@ func DialOption(mo MetricsOptions) grpc.DialOption {} // Server applications interested in instrumenting their grpc.Server should pass // the server option returned from this function as an argument to // grpc.NewServer(). -// -// For the metrics supported by this instrumentation code, a user needs to -// specify the client metrics to record in metrics options. A user also needs to -// provide an implementation of a MeterProvider. If the passed in Meter Provider -// does not have the view configured for an individual metric turned on, the API -// call in this component will create a default view for that metric. func ServerOption(mo MetricsOptions) grpc.ServerOption {} ``` From 0c94c6411af735af84685a02b650802c660676cf Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Mon, 25 Sep 2023 21:29:54 +0000 Subject: [PATCH 20/30] Add Python API --- A66-otel-stats.md | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index f53b915b3..dddc25d86 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -57,7 +57,8 @@ spec. The OpenTelemetry API has added an experimental feature for [advice](https://opentelemetry.io/docs/specs/otel/metrics/api/#instrument-advice) that would allow the gRPC library to provide these buckets as a hint. Since this is still an experimental feature and not yet implemented in all languages, it is -up to the user to choose the right bucket boundaries and set it through the +up to the user of the gRPC OpenTelemetry plugin to choose the right bucket +boundaries and set it through the [OTel SDK](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view). Also note that, as per an @@ -296,7 +297,7 @@ for a particular channel or server builder. #### Java -``` +```java public static class OpenTelemetryModuleBuilder { /** * OpenTelemetry instance is used to configure metrics settings. @@ -327,7 +328,7 @@ public static class OpenTelemetryModuleBuilder { #### Go -``` +```go import ( "go.opentelemetry.io/otel/attribute" "go.opentelemetry.io/otel/metric" @@ -366,7 +367,25 @@ func ServerOption(mo MetricsOptions) grpc.ServerOption {} #### Python -To be filled +```python + + # This class is part of an EXPERIMENTAL API and subject to major changes. +class OpenTelemetryObservability: + def set_meter_provider(meter_provider: MeterProvider) -> None: + # If `set_meter_provider()` is not called, no metrics are collected. + pass + + def set_target_attribute_filter(filter: Callable[str, bool]) -> None: + # If set, this filter will be called per channel to decide whether to + # record the target attribute on client or to replace it with "other". + pass + + def set_generic_method_attribute_filter(filter: Callable[str, bool]) -> None: + # If set, this filter will be called per call with a generic method type + # to decide whether record the target attribute on client or to replace + # it with "other". + pass +``` ### Migration from OpenCensus From 082d554c6eba6dfe140891b85b34d642c136641a Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Mon, 25 Sep 2023 22:17:48 +0000 Subject: [PATCH 21/30] Reviewer comments --- A66-otel-stats.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index dddc25d86..f44eb3d9b 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -62,8 +62,8 @@ boundaries and set it through the [OTel SDK](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view). Also note that, as per an -[OpenTelemetry proposal on stability](https://docs.google.com/document/d/1Nvcf1wio7nDUVcrXxVUN_f8MNmcs0OzVAZLvlth1lYY/edit#heading=h.dy1cg9doaq26) -though, changes to bucket boundaries might not be considered a breaking change. +[OpenTelemetry proposal on stability](https://docs.google.com/document/d/1Nvcf1wio7nDUVcrXxVUN_f8MNmcs0OzVAZLvlth1lYY/edit#heading=h.dy1cg9doaq26), +changes to bucket boundaries might not be considered a breaking change. Depending on the proposal, this recommendation would change to use `ExponentialHistogram`s instead, which would allow for automatic adjustments of the scale to better fit the data. @@ -92,7 +92,8 @@ the scale to better fit the data. URI is not available, implementations can synthesize a target URI. It is possible for some channels to use IP addresses as target strings and this might again blow up the cardinality. Implementations should provide the - option to override recorded target names with "other". + option to override recorded target names with "other". If no such override + is provided, the default behavior will be to record the target as is. #### Client Per-Attempt Instruments @@ -326,6 +327,13 @@ public static class OpenTelemetryModuleBuilder { } ``` +Note: For non-generated methods, method names are recorded as "other" for +`grpc.method` attribute. If you are interested in recording the method names for +these methods, set +[`isSampledToLocalTracing`](https://grpc.github.io/grpc-java/javadoc/io/grpc/MethodDescriptor.html#isSampledToLocalTracing\(\)) +to `true` while defining your methods in +[`HandlerRegistry`](https://grpc.github.io/grpc-java/javadoc/io/grpc/HandlerRegistry.html). + #### Go ```go From 7663a4ef6defb19dbb56ff9666d47d72a3daf8c6 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Tue, 26 Sep 2023 00:39:49 +0000 Subject: [PATCH 22/30] OTel Plugin Arch details for Java and GO --- A66-otel-stats.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index f44eb3d9b..91ea8c765 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -220,6 +220,23 @@ ability to configure the different plugins with different MeterProviders. A sample implementation of this approach is available in [gRPC Core](https://github.com/grpc/grpc/blob/v1.57.x/src/core/lib/channel/call_tracer.h). +In grpc-java, a client interceptor is provided by the gRPC OTel plugin. This +interceptor adds a `CallAttemptTracerFactory` to the client call. This factory +is equivalent to the `CallTracer`. For each attempt, this factory is invoked to +create a `ClientStreamTracer` analogous to `CallAttemptTracer` for each attempt. +On the server-side, a `ServerStreamTracer.Factory` is used to create tracers +analogous to `ServerCallTracer` for each incoming call. + +In grpc-go, similar to grpc-java, an interceptor is invoked per call. This +interceptor is registered when the OTel Dial Option is passed in to the channel, +and has access to a context scoped to the call. `StatsHandler` object owned by +the channel gets call-outs for each event that happens on the lifetime of an +attempt. Along with each call-out gets, a context object scoped to the attempt +is passed in, making it equivalent to the functionality of the +`CallAttemptTracer`. On the server side, a `StatsHandler` object gets call-outs +similarly along with a server call scoped context object, to get +`ServerCallTracer` equivalent functionality. + ### Language-Specific Details Each language implementation will provide an API for registering an From b0c22c74050e5dd6bb491e1fa118c1fcb8f71ca9 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Tue, 26 Sep 2023 23:22:19 +0000 Subject: [PATCH 23/30] Fix hyperlink --- A66-otel-stats.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index 91ea8c765..f6685f4c7 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -17,8 +17,8 @@ There are a collection of [metrics](https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/gRPC.md) proposed by OpenCensus for gRPC. OpenCensus is no longer being actively maintained and is being -[deprecated](https://opentelemetry.io/blog/2023/sunsetting-opencensus/#:~:text=Compatibility%20specification%204.-,What%20to%20Expect%20After%20July%2031st%2C%202023,found%20will%20not%20be%20patched.), -with OpenTelemetry suggested as the successor framework. +[deprecated](https://opentelemetry.io/blog/2023/sunsetting-opencensus/), with +OpenTelemetry suggested as the successor framework. ### Related Proposals: From 8751cf1529937254b0493d30a9cb955487eef325 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Wed, 27 Sep 2023 17:38:16 +0000 Subject: [PATCH 24/30] Reviewer comments --- A66-otel-stats.md | 44 ++++++++++++++++++++++---------------------- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index f6685f4c7..16eb6683b 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -61,12 +61,12 @@ up to the user of the gRPC OpenTelemetry plugin to choose the right bucket boundaries and set it through the [OTel SDK](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view). -Also note that, as per an +Note that, according to an [OpenTelemetry proposal on stability](https://docs.google.com/document/d/1Nvcf1wio7nDUVcrXxVUN_f8MNmcs0OzVAZLvlth1lYY/edit#heading=h.dy1cg9doaq26), -changes to bucket boundaries might not be considered a breaking change. -Depending on the proposal, this recommendation would change to use -`ExponentialHistogram`s instead, which would allow for automatic adjustments of -the scale to better fit the data. +changes to bucket boundaries may not be considered as breaking. Depending on the +proposal, this recommendation would change to use `ExponentialHistogram`s +instead, which would allow for automatic adjustments of the scale to better fit +the data. #### Attributes @@ -87,13 +87,14 @@ the scale to better fit the data. [(Full list)](https://grpc.github.io/grpc/core/md_doc_statuscodes.html) * `grpc.target` : Canonicalized target URI used when creating gRPC Channel, e.g. "dns:///pubsub.googleapis.com:443", "xds:///helloworld-gke:8000". - Canonicalized target URI is its form with the scheme if the user didn't - mention the scheme. For channels such as inprocess channels where a target - URI is not available, implementations can synthesize a target URI. It is - possible for some channels to use IP addresses as target strings and this - might again blow up the cardinality. Implementations should provide the - option to override recorded target names with "other". If no such override - is provided, the default behavior will be to record the target as is. + Canonicalized target URI is the form with the scheme included if the user + didn't mention the scheme ([scheme]:///[target]). For channels such as + inprocess channels where a target URI is not available, implementations can + synthesize a target URI. It is possible for some channels to use IP + addresses as target strings and this might again blow up the cardinality. + Implementations should provide the option to override recorded target names + with "other" instead of the actual target. If no such override is provided, + the default behavior will be to record the target as is. #### Client Per-Attempt Instruments @@ -173,7 +174,7 @@ The OTel plugin will configure CallTracer factories on gRPC channels and servers. A CallTracer needs to know the channel's target in the canonical form, and the -full qualified method name for filling in the attributes needed on the metrics. +fully qualified method name for filling in the attributes needed on the metrics. Similarly on the server-side, the `ServerCallTracer` needs to know the method of the incoming call. Depending on the implementation details, the method may be propagated as part of the initial metadata. @@ -243,11 +244,10 @@ Each language implementation will provide an API for registering an OpenTelemetry plugin. Overall, the APIs should have the following capabilities - * Allow installing multiple OpenTelemetry plugins. -* Allow setting a +* Implementations must provide an option to set [MeterProvider](https://opentelemetry.io/docs/specs/otel/metrics/api/#meterprovider) - on individual plugins. Implementations should require a MeterProvider being - set. A MeterProvider not being set should result in a no-op. Some - OpenTelemetry language APIs have a global MeterProvider. gRPC + on individual plugins. A MeterProvider not being set should result in a + no-op. Some OpenTelemetry language APIs have a global MeterProvider. gRPC implementations should *NOT* fallback on this global. Note that implementations of the gRPC OpenTelemetry plugin @@ -462,11 +462,11 @@ grpc.io/server/server_latency | grpc.server.call.duration ##### Metrics with Nuanced Differences Unfortunately, the implementations of the gRPC OpenCensus spec in the various -languages do not agree on the definition of the following size metrics. Go -records uncompressed message bytes for the OpenCensus metric, while C++ and Java -record the compressed message bytes. The OpenTelemetry spec proposed here calls -for recording the compressed message bytes, resulting in an equivalence between -the metrics definitions for C++ and Java, but not for Go. +languages do not agree on the definition of the following message size metrics. +Go records uncompressed message bytes for the OpenCensus metric, while C++ and +Java record the compressed message bytes. The OpenTelemetry spec proposed here +calls for recording the compressed message bytes, resulting in an equivalence +between the metrics definitions for C++ and Java, but not for Go. gRPC OpenCensus | gRPC OpenTelemetry ------------------------------------- | ------------------ From 368f45b441d634f10c02a99b341f8c075c6205b3 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Wed, 27 Sep 2023 18:12:22 +0000 Subject: [PATCH 25/30] Update go API doc --- A66-otel-stats.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index 16eb6683b..d2a05fee4 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -367,9 +367,6 @@ type MetricsOptions struct { // MeterProvider is the MeterProvider instance that will be used for access // to Named Meter instances to instrument an application. To enable metrics // collection, set a meter provider. If unset, no metrics will be recorded. - // Any implementation knobs (i.e. views, bounds) set in the passed in object - // take precedence over the API calls from the interface in this component - // (i.e. it will create default views for unset views). MeterProvider metric.MeterProvider } From 41d9e5c3bbaf11f4ba14c00fe48e8b71a05caa95 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Wed, 27 Sep 2023 20:16:15 +0000 Subject: [PATCH 26/30] Java API doc --- A66-otel-stats.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index d2a05fee4..c587cd207 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -336,8 +336,8 @@ public static class OpenTelemetryModuleBuilder { */ public OpenTelemetryModuleBuilder openTelemetry(OpenTelemetry openTelemetry); - /* If targetFilter returns true for a target, target is recorded as is. - * Otherwise it will be recorded as "other". */ + /* If targetFilter is set, and returns true for a target, target is recorded as is. Records "other" on false. + If targetFilter is not set, target is recorded as is. */ public OpenTelemetryBuilder targetFilter(Predicate targetFilter); public OpenTelemetryModule build(); From 8045b2203a31d24eeffc488fce81304c10cc0b07 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Wed, 27 Sep 2023 23:58:42 +0000 Subject: [PATCH 27/30] Reviewer comments --- A66-otel-stats.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index c587cd207..ad011058d 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -15,10 +15,9 @@ Propose a metrics data model for gRPC OpenTelemetry metrics. There are a collection of [metrics](https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/gRPC.md) -proposed by OpenCensus for gRPC. OpenCensus is no longer being actively -maintained and is being -[deprecated](https://opentelemetry.io/blog/2023/sunsetting-opencensus/), with -OpenTelemetry suggested as the successor framework. +proposed by OpenCensus for gRPC. OpenCensus is +[no longer being actively maintained](https://opentelemetry.io/blog/2023/sunsetting-opencensus/), +with OpenTelemetry suggested as the successor framework. ### Related Proposals: @@ -148,7 +147,7 @@ the data. *Type*: Histogram (Size Buckets)
*Unit*: `By`
* **grpc.server.call.duration**
- This metric aims to measure the end2end time an RPC takes from the server transport’s (HTTP2/ inproc / cronet) perspective.
+ This metric aims to measure the end2end time an RPC takes from the server transport’s (HTTP2/ inproc) perspective.
Start timestamp - After the transport knows that it's got a new stream. For HTTP2, this would be after the first header frame for the stream has been received and decoded. Whether the timestamp is recorded before or after HPACK is left to the implementation.
End timestamp - Ends at the first point where the transport considers the stream done. For HTTP2, this would be when scheduling a trailing header with END_STREAM to be written, or RST_STREAM, or a connection abort. Note that this wouldn’t necessarily mean that the bytes have also been immediately scheduled to be written by TCP.
*Attributes*: grpc.method, grpc.status
@@ -529,10 +528,10 @@ disabling the OpenCensus plugin. ## Rationale -OpenCensus is no longer being actively maintained and is being deprecated, with -OpenTelemetry suggested as the successor framework. The OpenTelemetry spec aims -to maintain compatibility with the gRPC OpenCensus spec wherever reasonable to -allow for an easy migration path. +OpenCensus is no longer being actively maintained, with OpenTelemetry suggested +as the successor framework. The OpenTelemetry spec aims to maintain +compatibility with the gRPC OpenCensus spec wherever reasonable to allow for an +easy migration path. There is a [General RPC conventions](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/rpc-metrics.md) From 6dcb1b951b409da7ead652ec57d8e2a9dd6b457b Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Thu, 28 Sep 2023 00:01:57 +0000 Subject: [PATCH 28/30] s/OTel/OpenTelemetry --- A66-otel-stats.md | 33 +++++++++++++++++---------------- 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index ad011058d..c9c2be741 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -58,7 +58,7 @@ that would allow the gRPC library to provide these buckets as a hint. Since this is still an experimental feature and not yet implemented in all languages, it is up to the user of the gRPC OpenTelemetry plugin to choose the right bucket boundaries and set it through the -[OTel SDK](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view). +[OpenTelemetry SDK](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view). Note that, according to an [OpenTelemetry proposal on stability](https://docs.google.com/document/d/1Nvcf1wio7nDUVcrXxVUN_f8MNmcs0OzVAZLvlth1lYY/edit#heading=h.dy1cg9doaq26), @@ -169,8 +169,8 @@ attempt, and the `CallAttemptTracer` gets invoked during the lifetime of the attempt. On the server-side, we have an equivalent `ServerCallTracer`. (There is no concept of an attempt on the server-side.) -The OTel plugin will configure CallTracer factories on gRPC channels and -servers. +The OpenTelemetry plugin will configure CallTracer factories on gRPC channels +and servers. A CallTracer needs to know the channel's target in the canonical form, and the fully qualified method name for filling in the attributes needed on the metrics. @@ -220,19 +220,19 @@ ability to configure the different plugins with different MeterProviders. A sample implementation of this approach is available in [gRPC Core](https://github.com/grpc/grpc/blob/v1.57.x/src/core/lib/channel/call_tracer.h). -In grpc-java, a client interceptor is provided by the gRPC OTel plugin. This -interceptor adds a `CallAttemptTracerFactory` to the client call. This factory -is equivalent to the `CallTracer`. For each attempt, this factory is invoked to -create a `ClientStreamTracer` analogous to `CallAttemptTracer` for each attempt. -On the server-side, a `ServerStreamTracer.Factory` is used to create tracers -analogous to `ServerCallTracer` for each incoming call. +In grpc-java, a client interceptor is provided by the gRPC OpenTelemetry plugin. +This interceptor adds a `CallAttemptTracerFactory` to the client call. This +factory is equivalent to the `CallTracer`. For each attempt, this factory is +invoked to create a `ClientStreamTracer` analogous to `CallAttemptTracer` for +each attempt. On the server-side, a `ServerStreamTracer.Factory` is used to +create tracers analogous to `ServerCallTracer` for each incoming call. In grpc-go, similar to grpc-java, an interceptor is invoked per call. This -interceptor is registered when the OTel Dial Option is passed in to the channel, -and has access to a context scoped to the call. `StatsHandler` object owned by -the channel gets call-outs for each event that happens on the lifetime of an -attempt. Along with each call-out gets, a context object scoped to the attempt -is passed in, making it equivalent to the functionality of the +interceptor is registered when the OpenTelemetry Dial Option is passed in to the +channel, and has access to a context scoped to the call. `StatsHandler` object +owned by the channel gets call-outs for each event that happens on the lifetime +of an attempt. Along with each call-out gets, a context object scoped to the +attempt is passed in, making it equivalent to the functionality of the `CallAttemptTracer`. On the server side, a `StatsHandler` object gets call-outs similarly along with a server call scoped context object, to get `ServerCallTracer` equivalent functionality. @@ -259,7 +259,8 @@ creation should use a `name` that identifies the library, for example, release version of the gRPC library, for example, "1.57.1". The instruments described above will be created from this meter. -Users of the gRPC OpenTelemetry plugin will use the OTel SDK's MeterProvider to +Users of the gRPC OpenTelemetry plugin will use the OpenTelemetry SDK's +MeterProvider to [control the views](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view) and customize the metrics that will be exported. @@ -544,7 +545,7 @@ immediately obvious - `call` can have multiple `attempts` with retries/hedging. * The various gRPC implementations can record the compressed message lengths, but not all implementations can get the uncompressed message length (as - recommended by OTel RPC conventions.) + recommended by OpenTelemetry RPC conventions.) This gRFC, hence, intends to override the [General RPC conventions](https://opentelemetry.io/docs/specs/otel/metrics/semantic_conventions/rpc-metrics/) From d24ba7ec9ca30d8f7d346d13c27c5faab176673b Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Thu, 28 Sep 2023 17:37:27 +0000 Subject: [PATCH 29/30] Reviewer comments --- A66-otel-stats.md | 260 +++++++++++++++++++++++----------------------- 1 file changed, 131 insertions(+), 129 deletions(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index c9c2be741..fbbc94e3a 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -4,12 +4,14 @@ * Approver: Mark Roth (@markdroth) * Status: In Review * Implemented in: -* Last updated: Jul 20, 2023 +* Last updated: Sep 28, 2023 * Discussion at: https://groups.google.com/g/grpc-io/c/po-deqYEQzE ## Abstract -Propose a metrics data model for gRPC OpenTelemetry metrics. +Describe a cross-language plugin architecture for collecting OpenTelemetry +metrics in the various gRPC implementations and propose a data model for gRPC +OpenTelemetry metrics. ## Background @@ -27,133 +29,6 @@ with OpenTelemetry suggested as the successor framework. ## Proposal -### Metrics Schema - -#### Units - -Following the -[OpenTelemetry Metrics Semantic Conventions](https://opentelemetry.io/docs/specs/otel/metrics/semantic_conventions/), -the following units are used - - -* Latencies are measured in float64 seconds, `s` -* Sizes are measured in bytes, `By` -* Counts for number of calls are measured in `{call}` -* Counts for number of attempts are measured in `{attempt}` - -Buckets for histograms in default views should be as follows - - -* Latency : 0, 0.00001, 0.00005, 0.0001, 0.0003, 0.0006, 0.0008, 0.001, 0.002, - 0.003, 0.004, 0.005, 0.006, 0.008, 0.01, 0.013, 0.016, 0.02, 0.025, 0.03, - 0.04, 0.05, 0.065, 0.08, 0.1, 0.13, 0.16, 0.2, 0.25, 0.3, 0.4, 0.5, 0.65, - 0.8, 1, 2, 5, 10, 20, 50, 100 -* Size : 0, 1024, 2048, 4096, 16384, 65536, 262144, 1048576, 4194304, - 16777216, 67108864, 268435456, 1073741824, 4294967296 -* Count : 0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, - 16384, 32768, 65536 - -These buckets were chosen to maintain compatibility with the gRPC OpenCensus -spec. The OpenTelemetry API has added an experimental feature for -[advice](https://opentelemetry.io/docs/specs/otel/metrics/api/#instrument-advice) -that would allow the gRPC library to provide these buckets as a hint. Since this -is still an experimental feature and not yet implemented in all languages, it is -up to the user of the gRPC OpenTelemetry plugin to choose the right bucket -boundaries and set it through the -[OpenTelemetry SDK](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view). - -Note that, according to an -[OpenTelemetry proposal on stability](https://docs.google.com/document/d/1Nvcf1wio7nDUVcrXxVUN_f8MNmcs0OzVAZLvlth1lYY/edit#heading=h.dy1cg9doaq26), -changes to bucket boundaries may not be considered as breaking. Depending on the -proposal, this recommendation would change to use `ExponentialHistogram`s -instead, which would allow for automatic adjustments of the scale to better fit -the data. - -#### Attributes - -* `grpc.method` : Full gRPC method name, including package, service and - method, e.g. "google.bigtable.v2.Bigtable/CheckAndMutateRow". Note that gRPC - servers can receive arbitrary method names, i.e., method names that have not - been registered in advance with the server. This normally results in those - RPCs being rejected with an UNIMPLEMENTED status. Some gRPC implementations - allow servers to handle such generic method names. Since the stats plugin - would be recording all of these RPCs, this could open up the server to - malicious attacks that result in metrics being stored with a high - cardinality. To prevent this, unregistered/generic method names should by - default be reported with "other" value instead. Implementations should - provide the option to override this behavior to allow recording generic - method names as well. -* `grpc.status` : gRPC server status code received, e.g. "OK", "CANCELLED", - "DEADLINE_EXCEEDED". - [(Full list)](https://grpc.github.io/grpc/core/md_doc_statuscodes.html) -* `grpc.target` : Canonicalized target URI used when creating gRPC Channel, - e.g. "dns:///pubsub.googleapis.com:443", "xds:///helloworld-gke:8000". - Canonicalized target URI is the form with the scheme included if the user - didn't mention the scheme ([scheme]:///[target]). For channels such as - inprocess channels where a target URI is not available, implementations can - synthesize a target URI. It is possible for some channels to use IP - addresses as target strings and this might again blow up the cardinality. - Implementations should provide the option to override recorded target names - with "other" instead of the actual target. If no such override is provided, - the default behavior will be to record the target as is. - -#### Client Per-Attempt Instruments - -* **grpc.client.attempt.started**
- The total number of RPC attempts started, including those that have not completed.
- *Attributes*: grpc.method, grpc.target
- *Type*: Counter
- *Unit*: `{attempt}`
-* **grpc.client.attempt.duration**
- End-to-end time taken to complete an RPC attempt including the time it takes to pick a subchannel.
- *Attributes*: grpc.method, grpc.target, grpc.status
- *Type*: Histogram (Latency Buckets)
- *Unit*: `s`
-* **grpc.client.attempt.sent_total_compressed_message_size**
- Total bytes (compressed but not encrypted) sent across all request messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.
- Attributes: grpc.method, grpc.target, grpc.status
- Type: Histogram (Size Buckets)
- *Unit*: `By`
-* **grpc.client.attempt.rcvd_total_compressed_message_size**
- Total bytes (compressed but not encrypted) received across all response messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.
- *Attributes*: grpc.method, grpc.target, grpc.status
- *Type*: Histogram (Size Buckets)
- *Unit*: `By`
- -#### Client Per-Call Instruments - -* **grpc.client.call.duration**
- This metric aims to measure the end-to-end time the gRPC library takes to complete an RPC from the application’s perspective.
- Start timestamp - After the client application starts the RPC.
- End timestamp - Before the status of the RPC is delivered to the application.
- If the implementation uses an interceptor then the exact start and end timestamps would depend on the ordering of the interceptors. Non-interceptor implementations should record the timestamps as close as possible to the top of the gRPC stack, i.e., payload serialization should be included in the measurement.
- *Attributes*: grpc.method, grpc.target, grpc.status
- *Type*: Histogram (Latency Buckets)
- *Unit*: `s`
- -#### Server Instruments - -* **grpc.server.call.started**
- The total number of RPCs started, including those that have not completed.
- *Attributes*: grpc.method
- *Type*: counter
- *Unit*: {call}
-* **grpc.server.call.sent_total_compressed_message_size**
- Total bytes (compressed but not encrypted) sent across all response messages (metadata excluded) per RPC; does not include grpc or transport framing bytes.
- *Attributes*: grpc.method, grpc.status
- *Type*: Histogram (Size Buckets)
- *Unit*: `By`
-* **grpc.server.call.rcvd_total_compressed_message_size**
- Total bytes (compressed but not encrypted) received across all request messages (metadata excluded) per RPC; does not include grpc or transport framing bytes.
- *Attributes*: grpc.method, grpc.status
- *Type*: Histogram (Size Buckets)
- *Unit*: `By`
-* **grpc.server.call.duration**
- This metric aims to measure the end2end time an RPC takes from the server transport’s (HTTP2/ inproc) perspective.
- Start timestamp - After the transport knows that it's got a new stream. For HTTP2, this would be after the first header frame for the stream has been received and decoded. Whether the timestamp is recorded before or after HPACK is left to the implementation.
- End timestamp - Ends at the first point where the transport considers the stream done. For HTTP2, this would be when scheduling a trailing header with END_STREAM to be written, or RST_STREAM, or a connection abort. Note that this wouldn’t necessarily mean that the bytes have also been immediately scheduled to be written by TCP.
- *Attributes*: grpc.method, grpc.status
- *Type*: Histogram (Latency Buckets)
- *Unit*: `s`
- ### OpenTelemetry Plugin Architecture This section describes a `CallTracer` approach to collect the client and server @@ -409,6 +284,133 @@ class OpenTelemetryObservability: pass ``` +### Metrics Schema + +#### Units + +Following the +[OpenTelemetry Metrics Semantic Conventions](https://opentelemetry.io/docs/specs/otel/metrics/semantic_conventions/), +the following units are used - + +* Latencies are measured in float64 seconds, `s` +* Sizes are measured in bytes, `By` +* Counts for number of calls are measured in `{call}` +* Counts for number of attempts are measured in `{attempt}` + +Buckets for histograms in default views should be as follows - + +* Latency : 0, 0.00001, 0.00005, 0.0001, 0.0003, 0.0006, 0.0008, 0.001, 0.002, + 0.003, 0.004, 0.005, 0.006, 0.008, 0.01, 0.013, 0.016, 0.02, 0.025, 0.03, + 0.04, 0.05, 0.065, 0.08, 0.1, 0.13, 0.16, 0.2, 0.25, 0.3, 0.4, 0.5, 0.65, + 0.8, 1, 2, 5, 10, 20, 50, 100 +* Size : 0, 1024, 2048, 4096, 16384, 65536, 262144, 1048576, 4194304, + 16777216, 67108864, 268435456, 1073741824, 4294967296 +* Count : 0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, + 16384, 32768, 65536 + +These buckets were chosen to maintain compatibility with the gRPC OpenCensus +spec. The OpenTelemetry API has added an experimental feature for +[advice](https://opentelemetry.io/docs/specs/otel/metrics/api/#instrument-advice) +that would allow the gRPC library to provide these buckets as a hint. Since this +is still an experimental feature and not yet implemented in all languages, it is +up to the user of the gRPC OpenTelemetry plugin to choose the right bucket +boundaries and set it through the +[OpenTelemetry SDK](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view). + +Note that, according to an +[OpenTelemetry proposal on stability](https://docs.google.com/document/d/1Nvcf1wio7nDUVcrXxVUN_f8MNmcs0OzVAZLvlth1lYY/edit#heading=h.dy1cg9doaq26), +changes to bucket boundaries may not be considered as breaking. Depending on the +proposal, this recommendation would change to use `ExponentialHistogram`s +instead, which would allow for automatic adjustments of the scale to better fit +the data. + +#### Attributes + +* `grpc.method` : Full gRPC method name, including package, service and + method, e.g. "google.bigtable.v2.Bigtable/CheckAndMutateRow". Note that gRPC + servers can receive arbitrary method names, i.e., method names that have not + been registered in advance with the server. This normally results in those + RPCs being rejected with an UNIMPLEMENTED status. Some gRPC implementations + allow servers to handle such generic method names. Since the stats plugin + would be recording all of these RPCs, this could open up the server to + malicious attacks that result in metrics being stored with a high + cardinality. To prevent this, unregistered/generic method names should by + default be reported with "other" value instead. Implementations should + provide the option to override this behavior to allow recording generic + method names as well. +* `grpc.status` : gRPC server status code received, e.g. "OK", "CANCELLED", + "DEADLINE_EXCEEDED". + [(Full list)](https://grpc.github.io/grpc/core/md_doc_statuscodes.html) +* `grpc.target` : Canonicalized target URI used when creating gRPC Channel, + e.g. "dns:///pubsub.googleapis.com:443", "xds:///helloworld-gke:8000". + Canonicalized target URI is the form with the scheme included if the user + didn't mention the scheme (`scheme://[authority]/path`). For channels such + as inprocess channels where a target URI is not available, implementations + can synthesize a target URI. It is possible for some channels to use IP + addresses as target strings and this might again blow up the cardinality. + Implementations should provide the option to override recorded target names + with "other" instead of the actual target. If no such override is provided, + the default behavior will be to record the target as is. + +#### Client Per-Attempt Instruments + +* **grpc.client.attempt.started**
+ The total number of RPC attempts started, including those that have not completed.
+ *Attributes*: grpc.method, grpc.target
+ *Type*: Counter
+ *Unit*: `{attempt}`
+* **grpc.client.attempt.duration**
+ End-to-end time taken to complete an RPC attempt including the time it takes to pick a subchannel.
+ *Attributes*: grpc.method, grpc.target, grpc.status
+ *Type*: Histogram (Latency Buckets)
+ *Unit*: `s`
+* **grpc.client.attempt.sent_total_compressed_message_size**
+ Total bytes (compressed but not encrypted) sent across all request messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.
+ Attributes: grpc.method, grpc.target, grpc.status
+ Type: Histogram (Size Buckets)
+ *Unit*: `By`
+* **grpc.client.attempt.rcvd_total_compressed_message_size**
+ Total bytes (compressed but not encrypted) received across all response messages (metadata excluded) per RPC attempt; does not include grpc or transport framing bytes.
+ *Attributes*: grpc.method, grpc.target, grpc.status
+ *Type*: Histogram (Size Buckets)
+ *Unit*: `By`
+ +#### Client Per-Call Instruments + +* **grpc.client.call.duration**
+ This metric aims to measure the end-to-end time the gRPC library takes to complete an RPC from the application’s perspective.
+ Start timestamp - After the client application starts the RPC.
+ End timestamp - Before the status of the RPC is delivered to the application.
+ If the implementation uses an interceptor then the exact start and end timestamps would depend on the ordering of the interceptors. Non-interceptor implementations should record the timestamps as close as possible to the top of the gRPC stack, i.e., payload serialization should be included in the measurement.
+ *Attributes*: grpc.method, grpc.target, grpc.status
+ *Type*: Histogram (Latency Buckets)
+ *Unit*: `s`
+ +#### Server Instruments + +* **grpc.server.call.started**
+ The total number of RPCs started, including those that have not completed.
+ *Attributes*: grpc.method
+ *Type*: counter
+ *Unit*: {call}
+* **grpc.server.call.sent_total_compressed_message_size**
+ Total bytes (compressed but not encrypted) sent across all response messages (metadata excluded) per RPC; does not include grpc or transport framing bytes.
+ *Attributes*: grpc.method, grpc.status
+ *Type*: Histogram (Size Buckets)
+ *Unit*: `By`
+* **grpc.server.call.rcvd_total_compressed_message_size**
+ Total bytes (compressed but not encrypted) received across all request messages (metadata excluded) per RPC; does not include grpc or transport framing bytes.
+ *Attributes*: grpc.method, grpc.status
+ *Type*: Histogram (Size Buckets)
+ *Unit*: `By`
+* **grpc.server.call.duration**
+ This metric aims to measure the end2end time an RPC takes from the server transport’s (HTTP2/ inproc) perspective.
+ Start timestamp - After the transport knows that it's got a new stream. For HTTP2, this would be after the first header frame for the stream has been received and decoded. Whether the timestamp is recorded before or after HPACK is left to the implementation.
+ End timestamp - Ends at the first point where the transport considers the stream done. For HTTP2, this would be when scheduling a trailing header with END_STREAM to be written, or RST_STREAM, or a connection abort. Note that this wouldn’t necessarily mean that the bytes have also been immediately scheduled to be written by TCP.
+ *Attributes*: grpc.method, grpc.status
+ *Type*: Histogram (Latency Buckets)
+ *Unit*: `s`
+ ### Migration from OpenCensus The following sections show the differences between the gRPC OpenCensus spec and From 9b1bcd8fb09c502f4cc23ec6812ab4b2df33dc27 Mon Sep 17 00:00:00 2001 From: Yash Tibrewal Date: Thu, 28 Sep 2023 20:16:41 +0000 Subject: [PATCH 30/30] Moving gRFC to Final status --- A66-otel-stats.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A66-otel-stats.md b/A66-otel-stats.md index fbbc94e3a..9126ca3f3 100644 --- a/A66-otel-stats.md +++ b/A66-otel-stats.md @@ -2,7 +2,7 @@ * Author: Yash Tibrewal (@yashykt) * Approver: Mark Roth (@markdroth) -* Status: In Review +* Status: Final * Implemented in: * Last updated: Sep 28, 2023 * Discussion at: https://groups.google.com/g/grpc-io/c/po-deqYEQzE