-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Distributed Tracing using OpenTelemetry #12460
Comments
/cc @ptabor @wenjiaswe |
Today in etcd, tracing is implemented in a similar fashion as in kubernetes (https://github.com/kubernetes/utils/tree/master/trace). /cc original author for etcd tracing: @YoyinZyc |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Hey, @dashpole 👋 I am always on board when it comes to adding more observability signals into any project and tracing is something that is very useful! With that being said, I have two concerns with adding OpenTelemetry:
Thank you! |
For latency of creating a span, the numbers are currently within 0.01 milliseconds:
I'll ask about memory usage and get back. The first RC is expected to be cut in 2-3 weeks, and 1.0 for the go client is a prerequisite for beta in kubernetes. I'd expect a 1.0 in 2 months or so, to be conservative. But it's OSS, so anything can happen. |
@dashpole Thank you for that! I think the value of having one trace context propagated from apiserver to etcd would be amazing for insight! But code freeze is mid-May for etcd #12330 (comment), so I am wondering if it's possible we even will be able to make that deadline, given the stability of the OpenTel 1.0 go client to be 2 weeks or so. Happy to help out drive this along! |
I suspect a lot of requests will be served directly from the watch-cache, so we probably shouldn't be expecting all requests to trace back to etcd. Paginated lists will, watches won't and I can't remember if we are serving strongly consistent reads from the cache (there was a KEP proposing it at one point). |
I think we should definitely still do it, I'm just trying to temper expectations is all.. |
Following-up from the earlier comment, there isn't a fixed-size buffer. It looks like this is a case where we will just have to try it to find out how much cpu/memory it takes... The timing would probably be too tight to make it into the next etcd release. |
@logicalhan makes sense thank you for that yes!
@dashpole Thanks! Will try to play around a bit and do a POC locally to see where the cpu/memory is at now, I know it was fairly high in the past, but lets see the current state and I will report back. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Reopening to track graduation of distributed tracing flags as proposed in #13775 |
Can I upgrade to the latest version? open-telemetry/opentelemetry-go#2676 |
Hi. We had some long apply times in our etcd cluster used by OKD, and enabled tracing to help with debugging this. I was able to setup it with We checked few traces, and there are few missing elements for it to be really usable. For RPCs received and sent. Only grpc service and method names are included, but no details about request and response. Only status and uncompressed_sizes info. Here are some events as logged by
I believe it would be good to include more grpc data that in spans. For large requests, truncate some big fields if needed (i.e. body content on create / get response / put request, or list of all keys when doing a list response), but other fields should be included (most notably the key in get / put / watch requests). I checked https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc and there is no transparent way of doing this automatically as far as I can see, but maybe adding few manual attributes (i.e. key / key range) in It also looks that only grpc public api server is intercepted and instrumented, but grpc clients, and the internal (between replicas) grpc server is not instrumented, again reducing usability, especially for things like request forwarding from non-master to master (leader), and proposals sending (i.e. for Put and Txn) to other cluster members. |
Discussed during sig-etcd triage meeting. This is potentially a candidate for graduation via the new design for etcd server feature flags in future. We could still add an e2e test, assigned to @vivekpatani |
@jmhbnz: GitHub didn't allow me to assign the following users: vivekpatani. Note that only etcd-io members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Prior issues/PRs related to distributed tracing: #11166, #9242, #5425.
In the kubernetes instrumentation sig, we are adding working to add Distributed Tracing support using OpenTelemetry to the kube-apiserver: kubernetes/enhancements/keps/sig-instrumentation/647-apiserver-tracing. We plan to have Alpha (disabled by default) support for this feature in kubernetes v1.22. Progress can be tracked in kubernetes/enhancements#647. This will generate spans for incoming and outgoing requests, including requests to etcd. It will also propagate the w3c trace context and w3c baggage along with outgoing requests.
This presents an opportunity to make Etcd easier to debug when using kubernetes. Since distributed tracing provides a unified view of a single request across multiple components, a trace including the API Server and Etcd would show both the API call to the APIServer and the resulting Etcd call, providing better context on the call to Etcd. Based on #5425 (comment), there is already interest in distributed tracing, so I won't elaborate further.
Previous attempts at distributed tracing were looking into OpenTracing, OpenCensus, and jaeger formats and libraries. Thankfully, these have since unified behind the OpenTelemetry project, which makes a potential evaluation considerably easier. OpenTelemetry will be GA soon, so now seems like the right time to open a discussion around adding distributed tracing to Etcd.
ccing people who have been involved with or interested in this topic.
@logicalhan @tedsuo @ehashman @brancz @bg451 @bhs @gyuho @jingyih @serathius
The text was updated successfully, but these errors were encountered: