-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emit cloudprovider and throttling metrics #806
Comments
cc @ayberk @AndyXiangLi . We can simply add these metrics. Or better yet I would like us to solve #393. there are 3 cloud provider abstractions. our cloud.go, v1/legacy, and v2 and I want to avoid duplication. https://github.com/kubernetes/cloud-provider-aws/tree/6c10f4b1ba2d0377665cbaa7a2cbe6730591ec64/pkg/providers/v2 i think we should design the v2 interface to replace our cloud.go https://github.com/kubernetes/cloud-provider-aws/tree/6c10f4b1ba2d0377665cbaa7a2cbe6730591ec64/pkg/providers/v2 and then use it as a library. There is also an argument for using v1 because for migration purposes if we want to ensure 1:1 compatibility and avoid aws related bugs like handlign of eventual consistency then we should use v1. |
Also if we use v1 then we emit these metrics for free i think but it's not that hard to add them so i dont buy it |
/assign |
the throttling in particular is hard to emit unless we copy the retry handler and sdk setup from cloud provider : https://github.com/kubernetes/cloud-provider-aws/blob/5f394ba297bf280ceb3edfc38922630b4bd83f46/pkg/providers/v1/retry_handler.go#L100 . our custom cloud.go is much simpler. I am still trying to weigh what is easier, copy/pasting what we need from the old cloud provider, or refactoring to depend on it completely... |
One bummer when I last investigated around the retry/throttling issue is - ideally AWS should return a header(forgot what is called) when it is safe to retry a certain request and clients are supposed to respect it. But in my testing I found that header to be entirely missing and hence k8s implements its own retry/throttling. It might be nice to offload some the code in aws-sdk if we can (but I know outside of this project's scope). |
Yeah I agree ideally we should rely on aws-sdk for retry handling as much as possible. We can configure a min delay so even if that Retry-After header is missing, we should't need the custom k8s retryer: https://github.com/aws/aws-sdk-go/blob/e2d6cb448883e4f4fcc5246650f89bde349041ec/aws/client/default_retryer.go#L95. (In fact we recently started tuning the sdk built-in retryer already to mitigate issue w/ throttling causing leaked vols: #769, my thinking now is that we should continue tuning it instead of going back and copying the cloudprovider retryer.) SO... what I will do is copy the metric part [1] from the v1 cloudprovider retryer but not copy the actual retry logic part [2]. I guess the k8s implementation is already outdated anyway because now the SDK lets us not only configure the default retryer but provide a totaly custom [1] https://github.com/kubernetes/cloud-provider-aws/blob/5f394ba297bf280ceb3edfc38922630b4bd83f46/pkg/providers/v1/retry_handler.go#L90 In terms of the wider "cloudprovider dependency"/"duplicating code in 2/3 places" issue I was rambling about, now what I am picturing is this:
|
other major difference callled out in comment: https://github.com/kubernetes/cloud-provider-aws/blob/5f394ba297bf280ceb3edfc38922630b4bd83f46/pkg/providers/v1/retry_handler.go#L38 |
The intree cloudprovider emits cloudprovider API metrics and throttling metrics.
These metrics are important for debugging - https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/aws/aws_metrics.go
The text was updated successfully, but these errors were encountered: