Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tracing for improved diagnostics #1533

Closed
thomaseizinger opened this issue Mar 31, 2020 · 6 comments · Fixed by #4282
Closed

Use tracing for improved diagnostics #1533

thomaseizinger opened this issue Mar 31, 2020 · 6 comments · Fixed by #4282
Assignees

Comments

@thomaseizinger
Copy link
Contributor

thomaseizinger commented Mar 31, 2020

Tracing: A scoped, structured logging and diagnostics system.

rust-libp2p is designed as an async, event-based system. This makes it fairly hard to add rich logs in user land because peer information is not always available. For example, the InboundUpgrade trait doesn't know, which peer it actually connected to.

It might be worth considering to add a dependency on the tracing crate and create Spans around the calls to these methods that set certain variables in the context.

A downstream crate that also uses tracing would then automatically pick those scopes up and messages printed to the log file would automatically include this context. For example, logging a message like: "Upgrading inbound substream to /foo/bar/1.0.0" would show up as:

Mar 31 15:38:36.158 INFO libp2p{peer_id=QmbhXbFML9JbBHr96eHKXWbc7t2fWCLdnh6aNbzFAoz2rw} app::network: Upgrading inbound substream to /foo/bar/1.0.0

In this example, libp2p is a tracing scope and peer_id is a variable inside that scope.

Scopes also have log levels. A "trace" level scope will not show up as part of an "info" message etc.

Note: tracing is maintained by the folks from tokio but works totally independent of the tokio runtime.

@tomaka
Copy link
Member

tomaka commented Apr 29, 2020

I'm personally not a fan of forcing these kind of additions into the code. For example I already strongly dislike the fact that we're using the log crate.

But what I dislike is the "forcing" part. I'm in favour of adding (optional) wrappers around ProtocolsHandler and other traits.
In other words, the user has to explicitly opt-in these measurements.

@thomaseizinger
Copy link
Contributor Author

thomaseizinger commented May 2, 2020

In other words, the user has to explicitly opt-in these measurements.

Would a disabled-by-default cargo feature also fulfill this requirement? (Assuming it doesn't add too much of a maintenance burden to the codebase.)

I'm in favour of adding (optional) wrappers around ProtocolsHandler and other traits.

From what I understand, the main additions of tracing spans and events would have to happen in libp2p-core as it is the part of the code that polls other components and has the necessary contextual information. Not sure if wrappers can achieve the same thing.

I'll try to get up a draft PR to show what I mean.

@folex
Copy link
Contributor

folex commented Nov 26, 2020

It would be totally awesome if I could somehow mark whole execution flow with some generated ID, and then figure out which request triggered what, and for how long it spent where. But I imagine it would require big rehaul of the libp2p-core codebase, which is unfortunate at best.

@thomaseizinger what do you have ( did? :) ) in mind, mark all tracing events with peer_id from inject_event, or what kind of info would you add?

@thomaseizinger
Copy link
Contributor Author

@thomaseizinger what do you have ( did? :) ) in mind, mark all tracing events with peer_id from inject_event, or what kind of info would you add?

Back when I created this ticket, I was primarily interested in the PeerId being included in a Span for all the code within libp2p that is implicitly scoped to a single peer: ProtocolHandlers and upgrades. Once you reach the NetworkBehaviour abstraction level, the PeerId is provided to you as a parameter anyway and hence this would not be necessary.

Unfortunately, I never got around to follow up with a draft PR 😅

I think it might be possible to achieve this with wrappers around components that are passed to the SwarmBuilder. However, I am not familiar enough with the internals of libp2p to point out where exactly we would have to hook ourselves into.

Conceptually, what we need is:

  • Access to the state at which point we have already learned the PeerIdof the other party. I think this implies that we are past the authentication stage. Maybe a custom upgrade that wraps any other authentication upgrade is enough?
  • Once we have the state of the PeerId + connection, we would need to keep passing on wrappers as the connection flows through libp2p-core and swarm and create the necessary spans.

@eserilev
Copy link
Contributor

eserilev commented Aug 1, 2023

Hello, i listened in on the maintainers call today and heard you guys mention this issue. From what I understood, adding tracing should be relatively straightforward and can be done in chunks. I think it was also mentioned that it would be nice to have this tracing specifically for Kademlia, so maybe thats a good place to start? Thomas also provided this PR firezone/firezone#1741 as an example of how he implemented tracing in a different repo, so this could serve as a nice blueprint to work off from.

I'd really like to work on this feature. I have virtually no experience with the inner workings of libp2p, but maybe a task like this would be a good introduction?

@thomaseizinger
Copy link
Contributor Author

I'd really like to work on this feature.

Great! Thank you :)

I think a useful place to start is here: https://github.com/libp2p/rust-libp2p/blob/master/swarm/src/connection.rs

This is the main state machine that drives each connection. It has the context of which peer we are connected to and what connection ID we assigned to it. I would start by capturing this in an "error" span.

Following from that, you can change one of the protocols from log to tracing and modify an example or test to initialize a tracing subscriber instead of a log subscriber. You should then see those scopes in the logs.

@eserilev eserilev mentioned this issue Aug 1, 2023
4 tasks
@thomaseizinger thomaseizinger changed the title Consider using tracing for improved diagnostics Use tracing for improved diagnostics Nov 1, 2023
@mergify mergify bot closed this as completed in #4282 Nov 2, 2023
mergify bot pushed a commit that referenced this issue Nov 2, 2023
We replace `log` with `tracing` across the codebase. Where possible, we make use of structured logging now instead of templating strings.

`tracing` offers the ability to also record "span"s. A span lasts until its dropped and describes the entire duration that it is active for. All logs (in `tracing` term "events") are hierarchically embedded in all parent-spans).

We introduce several spans:

- On debug level: One for `new_outgoing_connection`, `new_incoming_connection` and `new_established_connection`
- On debug level: `Connection::poll`, `Swarm::poll` and `Pool::poll`
- On trace level: `NetworkBehaviour::poll` for each implementation of `NetworkBehaviour`
- On trace level: `ConnectionHandler::poll` for each implementation of (protocol) `ConnectionHandler`s

The idea here is that logging on debug level gives you a decent overview of what the system is doing. You get spans for the duration of connections and how often each connection gets polled.

Dropping down to trace level gives you an extremely detailed view of how long each individual `ConnectionHandler` was executed as part of `poll` which could be used for detailed analysis on how busy certain handlers are.

Most importantly, simply logging on `info` does not give you any spans. We consider `info` to be a good default that should be reasonably quiet.

Resolves #1533.

Pull-Request: #4282.
umgefahren pushed a commit to umgefahren/rust-libp2p that referenced this issue Mar 8, 2024
We replace `log` with `tracing` across the codebase. Where possible, we make use of structured logging now instead of templating strings.

`tracing` offers the ability to also record "span"s. A span lasts until its dropped and describes the entire duration that it is active for. All logs (in `tracing` term "events") are hierarchically embedded in all parent-spans).

We introduce several spans:

- On debug level: One for `new_outgoing_connection`, `new_incoming_connection` and `new_established_connection`
- On debug level: `Connection::poll`, `Swarm::poll` and `Pool::poll`
- On trace level: `NetworkBehaviour::poll` for each implementation of `NetworkBehaviour`
- On trace level: `ConnectionHandler::poll` for each implementation of (protocol) `ConnectionHandler`s

The idea here is that logging on debug level gives you a decent overview of what the system is doing. You get spans for the duration of connections and how often each connection gets polled.

Dropping down to trace level gives you an extremely detailed view of how long each individual `ConnectionHandler` was executed as part of `poll` which could be used for detailed analysis on how busy certain handlers are.

Most importantly, simply logging on `info` does not give you any spans. We consider `info` to be a good default that should be reasonably quiet.

Resolves libp2p#1533.

Pull-Request: libp2p#4282.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants