-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add comprehensive tracing #8578
Comments
@guseggert: a couple of quick questions on this:
|
|
@guseggert : thanks! With local barebones traces, what kind of metrics will we get? |
We can get timing info and some basic request metadata for BlockStore, BlockService, DAGService, PeerStore, Pinner, and IPNS, possibly others. Simple example, here's a req I recorded locally showing what's happening as |
cc: @gmasgras for visibility |
Draft PR for gateway tracing: #8595 |
Netops has a summit this week they are busy with, so things will be slow this week on infra side. @thattommyhall is working on code changes to install a collector on the public gateway hosts. |
@thattommyhall Could you share the current state of this implementation? |
FYI i gathered some feedback around HTTP-request metrics in #8441 |
2022-03-24 conversation: @guseggert is going to define the done criteria for the current attack at this. He'll also list potential followups. ipfs/go-datastore#188 is a related PR. |
I would suggest decoupling the technical implementation from the infrastructure deployment since having tracing capability earlier is very valuable. I reviewed #8595 and left a few minor comments, but based on my testing I consider that PR essentially done. It's enough to start gathering traces on a small scale and threading tracing code throughout lower level libraries is going to be extremely valuable. However, getting useful traces at scale on our infrastructure needs more work. I don't think it's feasible or desirable to gather traces from all requests (sample rate 1) due to the volume and the utility of sampling at lower levels is limited if it doesn't include the most interesting samples. I'd like to see some more sampling options added so we can restrict by top level API operation (e.g. Gateway.Request) and by trace time (e.g. only record traces for requests with duration greater than 5s) I used the go-ipfs branch and added tracing to go-namesys, go-path and go-blockservice to experiment a bit. The results are really useful: sample tracebcaeefd4fd6df40ca4f93d8ffaf83a5a
|
I forgot to mention that I can contribute time to getting this issue moved forward |
Reopening this one since it is a larger tracking issue. #8595 was the initial push. |
I'm working on adding tracing to go-blockservice and go-path (noting to avoid duplication of effort) |
High-level go-bitswap spans: ipfs/go-bitswap#562 |
go-blockservice: ipfs/go-blockservice#91 |
go-path: ipfs/go-path#59 |
Maybe add peer id in those tags. If tracing multiple node, this way we can tell which node is having problem. |
This might cause cardinality problems with systems that are ingesting the traces. There are a lot of possible values for peer id. |
Absolutely, there are already some warnings in the Grafana cardinality
dashboard re: the hydras labeling peerid (they are at 2k), care needed
…On Wed, 14 Sept 2022, 12:54 Ian Davis, ***@***.***> wrote:
Maybe add peer id in those tags. If tracing multiple node, this way we can
tell which node is having problem.
This might cause cardinality problems with systems that are ingesting the
traces. There are a lot of possible values for peer id.
—
Reply to this email directly, view it on GitHub
<#8578 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA7PFF4GGXSIXZ2TY24VATV6GVGRANCNFSM5JHTAWIQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
That said, I think being able to tell the origin of the trace is quite usefull for finding node having problem. Any alternative? |
Master issue for adding comprehensive tracing throughout the go-ipfs codebase.
Some requirements:
I plan to implement this against OpenTelemetry. Tactically I want to start at a high level (e.g. gateway reqs) and iteratively work down into the guts, driven by trying to answer questions about real requests so that we are instrumenting the right things.
Some related issues:
Todos:
The text was updated successfully, but these errors were encountered: