[router][common] Multiple fixes in Opentelemetry #1483

m-nagarajan · 2025-01-30T00:20:57Z

Summary

Added otel.venice.metrics.export.interval.in.seconds for OpenTelemetry (OTel) metrics export, with a default value of 60 seconds which is the same without this config right now.
Updated MetricEntityState to maintain a 1:1 relationship between OTel instruments and Tehuti sensors, rather than a 1:n relationship, to eliminate unnecessary lookups during the hot path.
Stopped emitting OTel metrics for the total store in the router. The aggregation will be done on the receiving side. This will be helpful during creation of pre-aggregates in the metrics processing systems by not having to do storeName != total.
Modified venice.response.status_code_category to use success/fail instead of healthy/unhealthy/tardy/throttled/bad_request to keep it standard. Tardy/throttled/bad_request can be inferred from the response status.
~~Updated the existing Tehuti metrics key_num and bad_request_key_num to record 1 for single GET requests as well, ensuring consistency.~~ This will churn more dimension objects for all single get requests, so will revisit this after caching the dimensions.
Introduced a new OTel metric, incoming_key_count, to measure data similar to the tehuti metrics key_num and bad_request_key_num at the request handling path. This is inline with incoming_call_count.
Renamed the existing OTel metric call_key_count (which previously handled the functionality described in point 6) to key_count and converted it into a histogram. This metric will now measure key counts on the response handling side, including success/fail details and response codes, similar to call_time, and will provide a distribution for key counts.
Fixed a bug where the exponential histogram view was configured for only one metric.

How was this PR tested?

GH CI and below log via integration tests shows all the new changes

2025-01-29 16:18:12 - [] INFO [VeniceOpenTelemetryMetricsRepository] [PeriodicMetricReader-1] Logging OpenTelemetry metrics for debug purpose: [ImmutableMetricData{resource=Resource{schemaUrl=null, attributes={}}, instrumentationScopeInfo=InstrumentationScopeInfo{name=venice.router, version=null, schemaUrl=null, attributes={}}, name=venice.router.call_time, description=Latency based on all responses, unit=MILLISECOND, type=EXPONENTIAL_HISTOGRAM, data=ImmutableExponentialHistogramData{aggregationTemporality=DELTA, points=[ImmutableExponentialHistogramPointData{getStartEpochNanos=1738196287869358000, getEpochNanos=1738196292871917000, getAttributes={http.response.status_code="200", http.response.status_code_category="2xx", venice.cluster.name="venice-cluster_151afd3f0e85_107697df", venice.request.method="single_get", venice.response.status_code_category="success", venice.store.name="store_153aff205485_2ccc527a"}, getScale=3, getSum=48.970623999999994, getCount=200, getZeroCount=0, hasMin=true, getMin=0.130709, hasMax=true, getMax=0.934542, getPositiveBuckets=DoubleExponentialHistogramBuckets{scale: 3, offset: -24, counts: {-24=5,-23=7,-22=9,-21=12,-20=22,-19=20,-18=35,-17=14,-16=17,-15=19,-14=18,-13=10,-12=3,-11=3,-10=3,-9=1,-8=0,-7=0,-6=0,-5=0,-4=1,-3=0,-2=0,-1=1} }, getNegativeBuckets=EmptyExponentialHistogramBuckets{scale=3, offset=0, bucketCounts=[], totalCount=0}, getExemplars=[]}, ImmutableExponentialHistogramPointData{getStartEpochNanos=1738196287869358000, getEpochNanos=1738196292871917000, getAttributes={http.response.status_code="200", http.response.status_code_category="2xx", venice.cluster.name="venice-cluster_151afd3f0e85_107697df", venice.request.method="multi_get_streaming", venice.response.status_code_category="success", venice.store.name="store_153aff205485_2ccc527a"}, getScale=3, getSum=8.768124999999998, getCount=20, getZeroCount=0, hasMin=true, getMin=0.31975, hasMax=true, getMax=0.644333, getPositiveBuckets=DoubleExponentialHistogramBuckets{scale: 3, offset: -14, counts: {-14=3,-13=1,-12=4,-11=2,-10=1,-9=4,-8=2,-7=2,-6=1} }, getNegativeBuckets=EmptyExponentialHistogramBuckets{scale=3, offset=0, bucketCounts=[], totalCount=0}, getExemplars=[]}]}}, ImmutableMetricData{resource=Resource{schemaUrl=null, attributes={}}, instrumentationScopeInfo=InstrumentationScopeInfo{name=venice.router, version=null, schemaUrl=null, attributes={}}, name=venice.router.incoming_key_count, description=Count of keys in all requests, unit=NUMBER, type=HISTOGRAM, data=ImmutableHistogramData{aggregationTemporality=DELTA, points=[ImmutableHistogramPointData{getStartEpochNanos=1738196287869358000, getEpochNanos=1738196292871917000, getAttributes={venice.cluster.name="venice-cluster_151afd3f0e85_107697df", venice.request.method="multi_get_streaming", venice.request.validation_outcome="valid", venice.store.name="store_153aff205485_2ccc527a"}, getSum=200.0, getCount=20, hasMin=true, getMin=10.0, hasMax=true, getMax=10.0, getBoundaries=[], getCounts=[20], getExemplars=[]}, ImmutableHistogramPointData{getStartEpochNanos=1738196287869358000, getEpochNanos=1738196292871917000, getAttributes={venice.cluster.name="venice-cluster_151afd3f0e85_107697df", venice.request.method="single_get", venice.request.validation_outcome="valid", venice.store.name="store_153aff205485_2ccc527a"}, getSum=200.0, getCount=200, hasMin=true, getMin=1.0, hasMax=true, getMax=1.0, getBoundaries=[], getCounts=[200], getExemplars=[]}]}}, ImmutableMetricData{resource=Resource{schemaUrl=null, attributes={}}, instrumentationScopeInfo=InstrumentationScopeInfo{name=venice.router, version=null, schemaUrl=null, attributes={}}, name=venice.router.incoming_call_count, description=Count of all incoming requests, unit=NUMBER, type=LONG_SUM, data=ImmutableSumData{points=[ImmutableLongPointData{startEpochNanos=1738196287869358000, epochNanos=1738196292871917000, attributes={venice.cluster.name="venice-cluster_151afd3f0e85_107697df", venice.request.method="multi_get_streaming", venice.store.name="store_153aff205485_2ccc527a"}, value=20, exemplars=[]}, ImmutableLongPointData{startEpochNanos=1738196287869358000, epochNanos=1738196292871917000, attributes={venice.cluster.name="venice-cluster_151afd3f0e85_107697df", venice.request.method="single_get", venice.store.name="store_153aff205485_2ccc527a"}, value=200, exemplars=[]}], monotonic=true, aggregationTemporality=DELTA}}, ImmutableMetricData{resource=Resource{schemaUrl=null, attributes={}}, instrumentationScopeInfo=InstrumentationScopeInfo{name=venice.router, version=null, schemaUrl=null, attributes={}}, name=venice.router.key_count, description=Count of keys in all responses, unit=NUMBER, type=EXPONENTIAL_HISTOGRAM, data=ImmutableExponentialHistogramData{aggregationTemporality=DELTA, points=[ImmutableExponentialHistogramPointData{getStartEpochNanos=1738196287869358000, getEpochNanos=1738196292871917000, getAttributes={http.response.status_code="200", http.response.status_code_category="2xx", venice.cluster.name="venice-cluster_151afd3f0e85_107697df", venice.request.method="single_get", venice.response.status_code_category="success", venice.store.name="store_153aff205485_2ccc527a"}, getScale=3, getSum=200.0, getCount=200, getZeroCount=0, hasMin=true, getMin=1.0, hasMax=true, getMax=1.0, getPositiveBuckets=DoubleExponentialHistogramBuckets{scale: 3, offset: -1, counts: {-1=200} }, getNegativeBuckets=EmptyExponentialHistogramBuckets{scale=3, offset=0, bucketCounts=[], totalCount=0}, getExemplars=[]}, ImmutableExponentialHistogramPointData{getStartEpochNanos=1738196287869358000, getEpochNanos=1738196292871917000, getAttributes={http.response.status_code="200", http.response.status_code_category="2xx", venice.cluster.name="venice-cluster_151afd3f0e85_107697df", venice.request.method="multi_get_streaming", venice.response.status_code_category="success", venice.store.name="store_153aff205485_2ccc527a"}, getScale=3, getSum=200.0, getCount=20, getZeroCount=0, hasMin=true, getMin=10.0, hasMax=true, getMax=10.0, getPositiveBuckets=DoubleExponentialHistogramBuckets{scale: 3, offset: 26, counts: {26=20} }, getNegativeBuckets=EmptyExponentialHistogramBuckets{scale=3, offset=0, bucketCounts=[], totalCount=0}, getExemplars=[]}]}}, ImmutableMetricData{resource=Resource{schemaUrl=null, attributes={}}, instrumentationScopeInfo=InstrumentationScopeInfo{name=venice.router, version=null, schemaUrl=null, attributes={}}, name=venice.router.call_count, description=Count of all requests with response details, unit=NUMBER, type=LONG_SUM, data=ImmutableSumData{points=[ImmutableLongPointData{startEpochNanos=1738196287869358000, epochNanos=1738196292871917000, attributes={http.response.status_code="200", http.response.status_code_category="2xx", venice.cluster.name="venice-cluster_151afd3f0e85_107697df", venice.request.method="single_get", venice.response.status_code_category="success", venice.store.name="store_153aff205485_2ccc527a"}, value=200, exemplars=[]}, ImmutableLongPointData{startEpochNanos=1738196287869358000, epochNanos=1738196292871917000, attributes={http.response.status_code="200", http.response.status_code_category="2xx", venice.cluster.name="venice-cluster_151afd3f0e85_107697df", venice.request.method="multi_get_streaming", venice.response.status_code_category="success", venice.store.name="store_153aff205485_2ccc527a"}, value=20, exemplars=[]}], monotonic=true, aggregationTemporality=DELTA}}]

Does this PR introduce any user-facing changes?

No. You can skip the rest of this section.
Yes. Make sure to explain your proposed changes and call out the behavior change.

otel.venice.metrics.export.interval.in.seconds config with a default value of 60 seconds.
venice.response.status_code_category dimension will emit values success/fail

… default 2. Change venice.response.status_code_category to hold success/fail instead of healthy/unhealthy/tardy/throttled/bad_request 3. Change the existing tehuti metrics key_num and bad_request_key_num to record metrics(1) for single gets as well to keep things uniform 4. Introduce otel metric incoming_key_count that will measure the data similar to key_num and bad_request_key_num at request handling path 5. change otel metic call_key_count to key_count which will now measures key counts on the response handling side with success/fail details as well as response codes

…tric

...-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceResponseStatusCategory.java

ZacAttack · 2025-02-06T20:14:13Z

services/venice-router/src/main/java/com/linkedin/venice/router/api/VenicePathParser.java

+
+      // record key num details for all types of requests to keep the metrics behavior uniform
+      keyNum = path.getPartitionKeys().size();
+      aggRouterHttpRequestStats.recordKeyNum(storeName, keyNum);


This looks maybe a touch heavy to me? I don't think we need to address it in this PR, but walking down this code path we're gonna generate some objects for every single key lookup. We should monitor GC behavior/latency when enabling otel after this change.

Do you mean within the recordKeyNum()? That is a good point.
One of the changes I will be working on next is to precreate/cache the Attributes objects, so we don't have to generate it everytime. I will take a look at this.

If we have a safety measure coming up I think we should either throw the above tweak behind a config that we remove once your cache changes are ready, or, we just don't include this portion in this PR? My fear is we go to the site with something that is detrimental to performance.

Agree and reverted this change in this PR. Will add it as part of the next PR wiht the safety measure. Thanks.

ZacAttack · 2025-02-06T20:29:21Z

...ces/venice-router/src/test/java/com/linkedin/venice/router/stats/RouterMetricEntityTest.java

                VeniceMetricsDimensions.HTTP_RESPONSE_STATUS_CODE_CATEGORY,
                VeniceMetricsDimensions.VENICE_RESPONSE_STATUS_CODE_CATEGORY)));
    expectedMetrics.put(
-        RouterMetricEntity.CALL_KEY_COUNT,
+        RouterMetricEntity.INCOMING_KEY_COUNT,


Can you explain this change actually? What is the semantic that's trying to be conveyed in this name change?

for counting requests we have

incoming_call_count at request handling path

call_count at response handling path

This can help us correlate with the incoming vs response handled cases. I was not inclined to be optimistic here to measure these just at response handling path.
This change makes counting keys to be similar (ie incoming_key_count and key_count) to be able to get similar insights from the response handling path like pass/fail, etc.

I see. Let's document that, maybe even with a bit of javadoc. I'm not sure I would understand this when looking at a dashboard, and I'm not sure I'm clever enough to suggest a rename that would capture this nuance.

MetricEntity has a description field which has this info. Made it more explicit and added java doc as well.

ZacAttack

These changes overall look good, I'm giving a provisional ship it pending some of the things I've asked. I think overall I have some latent concern about the enum tweak and the potential new overhead for single key lookup, but aside from that I don't have any major objections. Thanks!

m-nagarajan

Thanks @ZacAttack for the review. Replied to your comments.

...-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceResponseStatusCategory.java

m-nagarajan · 2025-02-06T21:48:20Z

...ces/venice-router/src/test/java/com/linkedin/venice/router/stats/RouterMetricEntityTest.java

                VeniceMetricsDimensions.HTTP_RESPONSE_STATUS_CODE_CATEGORY,
                VeniceMetricsDimensions.VENICE_RESPONSE_STATUS_CODE_CATEGORY)));
    expectedMetrics.put(
-        RouterMetricEntity.CALL_KEY_COUNT,
+        RouterMetricEntity.INCOMING_KEY_COUNT,


for counting requests we have

incoming_call_count at request handling path

call_count at response handling path

This can help us correlate with the incoming vs response handled cases. I was not inclined to be optimistic here to measure these just at response handling path.
This change makes counting keys to be similar (ie incoming_key_count and key_count) to be able to get similar insights from the response handling path like pass/fail, etc.

m-nagarajan · 2025-02-06T21:59:47Z

services/venice-router/src/main/java/com/linkedin/venice/router/api/VenicePathParser.java

+
+      // record key num details for all types of requests to keep the metrics behavior uniform
+      keyNum = path.getPartitionKeys().size();
+      aggRouterHttpRequestStats.recordKeyNum(storeName, keyNum);


Do you mean within the recordKeyNum()? That is a good point.
One of the changes I will be working on next is to precreate/cache the Attributes objects, so we don't have to generate it everytime. I will take a look at this.

…ibute generation

lluwm

Looks good and it makes sense to me!

m-nagarajan added 8 commits January 27, 2025 16:44

make MetricEntityState 1 otel instrument : 1 tehuti sensor

79e53fc

Fix unit test failures due to missing return for mock

34216c5

Stop emitting otel metrics for total store in router

1a3e7eb

make key_count metric to histogram to capture the distribution

3b53473

fix jdk 8 failure

d335a60

add tests

8f2b70b

Fix a bug where exponential histogram view was set only for only 1 me…

e525805

…tric

ZacAttack reviewed Feb 6, 2025

View reviewed changes

...-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceResponseStatusCategory.java Show resolved Hide resolved

ZacAttack reviewed Feb 6, 2025

View reviewed changes

ZacAttack previously approved these changes Feb 6, 2025

View reviewed changes

m-nagarajan commented Feb 6, 2025

View reviewed changes

add javadoc for new otel metrics and updated desc

6b17988

m-nagarajan dismissed ZacAttack’s stale review via 6b17988 February 8, 2025 00:13

revert recording key num for single get pending optimization for attr…

ced945c

…ibute generation

lluwm approved these changes Feb 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[router][common] Multiple fixes in Opentelemetry #1483

[router][common] Multiple fixes in Opentelemetry #1483

m-nagarajan commented Jan 30, 2025 •

edited

Loading

ZacAttack Feb 6, 2025

m-nagarajan Feb 6, 2025 •

edited

Loading

ZacAttack Feb 7, 2025

m-nagarajan Feb 8, 2025

ZacAttack Feb 6, 2025

m-nagarajan Feb 6, 2025

ZacAttack Feb 7, 2025

m-nagarajan Feb 8, 2025

ZacAttack left a comment

m-nagarajan left a comment

m-nagarajan Feb 6, 2025

m-nagarajan Feb 6, 2025 •

edited

Loading

lluwm left a comment

[router][common] Multiple fixes in Opentelemetry #1483

Are you sure you want to change the base?

[router][common] Multiple fixes in Opentelemetry #1483

Conversation

m-nagarajan commented Jan 30, 2025 • edited Loading

Summary

How was this PR tested?

Does this PR introduce any user-facing changes?

Choose a reason for hiding this comment

m-nagarajan Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZacAttack left a comment

Choose a reason for hiding this comment

m-nagarajan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-nagarajan Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

lluwm left a comment

Choose a reason for hiding this comment

m-nagarajan commented Jan 30, 2025 •

edited

Loading

m-nagarajan Feb 6, 2025 •

edited

Loading

m-nagarajan Feb 6, 2025 •

edited

Loading