-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[router][common] Multiple fixes in Opentelemetry #1483
base: main
Are you sure you want to change the base?
Conversation
… default 2. Change venice.response.status_code_category to hold success/fail instead of healthy/unhealthy/tardy/throttled/bad_request 3. Change the existing tehuti metrics key_num and bad_request_key_num to record metrics(1) for single gets as well to keep things uniform 4. Introduce otel metric incoming_key_count that will measure the data similar to key_num and bad_request_key_num at request handling path 5. change otel metic call_key_count to key_count which will now measures key counts on the response handling side with success/fail details as well as response codes
...-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceResponseStatusCategory.java
Show resolved
Hide resolved
|
||
// record key num details for all types of requests to keep the metrics behavior uniform | ||
keyNum = path.getPartitionKeys().size(); | ||
aggRouterHttpRequestStats.recordKeyNum(storeName, keyNum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks maybe a touch heavy to me? I don't think we need to address it in this PR, but walking down this code path we're gonna generate some objects for every single key lookup. We should monitor GC behavior/latency when enabling otel after this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean within the recordKeyNum()
? That is a good point.
One of the changes I will be working on next is to precreate/cache the Attributes
objects, so we don't have to generate it everytime. I will take a look at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a safety measure coming up I think we should either throw the above tweak behind a config that we remove once your cache changes are ready, or, we just don't include this portion in this PR? My fear is we go to the site with something that is detrimental to performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree and reverted this change in this PR. Will add it as part of the next PR wiht the safety measure. Thanks.
VeniceMetricsDimensions.HTTP_RESPONSE_STATUS_CODE_CATEGORY, | ||
VeniceMetricsDimensions.VENICE_RESPONSE_STATUS_CODE_CATEGORY))); | ||
expectedMetrics.put( | ||
RouterMetricEntity.CALL_KEY_COUNT, | ||
RouterMetricEntity.INCOMING_KEY_COUNT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this change actually? What is the semantic that's trying to be conveyed in this name change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for counting requests we have
incoming_call_count
at request handling pathcall_count
at response handling path
This can help us correlate with the incoming vs response handled cases. I was not inclined to be optimistic here to measure these just at response handling path.
This change makes counting keys to be similar (ie incoming_key_count
and key_count
) to be able to get similar insights from the response handling path like pass/fail, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Let's document that, maybe even with a bit of javadoc. I'm not sure I would understand this when looking at a dashboard, and I'm not sure I'm clever enough to suggest a rename that would capture this nuance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetricEntity
has a description field which has this info. Made it more explicit and added java doc as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes overall look good, I'm giving a provisional ship it pending some of the things I've asked. I think overall I have some latent concern about the enum tweak and the potential new overhead for single key lookup, but aside from that I don't have any major objections. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ZacAttack for the review. Replied to your comments.
...-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceResponseStatusCategory.java
Show resolved
Hide resolved
VeniceMetricsDimensions.HTTP_RESPONSE_STATUS_CODE_CATEGORY, | ||
VeniceMetricsDimensions.VENICE_RESPONSE_STATUS_CODE_CATEGORY))); | ||
expectedMetrics.put( | ||
RouterMetricEntity.CALL_KEY_COUNT, | ||
RouterMetricEntity.INCOMING_KEY_COUNT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for counting requests we have
incoming_call_count
at request handling pathcall_count
at response handling path
This can help us correlate with the incoming vs response handled cases. I was not inclined to be optimistic here to measure these just at response handling path.
This change makes counting keys to be similar (ie incoming_key_count
and key_count
) to be able to get similar insights from the response handling path like pass/fail, etc.
|
||
// record key num details for all types of requests to keep the metrics behavior uniform | ||
keyNum = path.getPartitionKeys().size(); | ||
aggRouterHttpRequestStats.recordKeyNum(storeName, keyNum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean within the recordKeyNum()
? That is a good point.
One of the changes I will be working on next is to precreate/cache the Attributes
objects, so we don't have to generate it everytime. I will take a look at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and it makes sense to me!
Summary
otel.venice.metrics.export.interval.in.seconds
for OpenTelemetry (OTel) metrics export, with a default value of60
seconds which is the same without this config right now.MetricEntityState
to maintain a 1:1 relationship between OTel instruments and Tehuti sensors, rather than a 1:n relationship, to eliminate unnecessary lookups during the hot path.total
store in the router. The aggregation will be done on the receiving side. This will be helpful during creation of pre-aggregates in the metrics processing systems by not having to dostoreName != total
.venice.response.status_code_category
to usesuccess/fail
instead ofhealthy/unhealthy/tardy/throttled/bad_request
to keep it standard.Tardy/throttled/bad_request
can be inferred from the response status.Updated the existing Tehuti metricsThis will churn more dimension objects for all single get requests, so will revisit this after caching the dimensions.key_num
andbad_request_key_num
to record 1 for single GET requests as well, ensuring consistency.incoming_key_count
, to measure data similar to the tehuti metricskey_num
andbad_request_key_num
at the request handling path. This is inline withincoming_call_count
.call_key_count
(which previously handled the functionality described in point 6) tokey_count
and converted it into a histogram. This metric will now measure key counts on the response handling side, including success/fail details and response codes, similar tocall_time
, and will provide a distribution for key counts.How was this PR tested?
GH CI and below log via integration tests shows all the new changes
Does this PR introduce any user-facing changes?
otel.venice.metrics.export.interval.in.seconds
config with a default value of60
seconds.venice.response.status_code_category
dimension will emit valuessuccess/fail