Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clearer Kademlia client lifecycle and internals #2122

Closed
izolyomi opened this issue Jul 6, 2021 · 10 comments
Closed

Clearer Kademlia client lifecycle and internals #2122

izolyomi opened this issue Jul 6, 2021 · 10 comments

Comments

@izolyomi
Copy link
Contributor

izolyomi commented Jul 6, 2021

I've been experimenting with creating a sample application that discovers its peers with Kademlia, nodes start with a peerid to connect to for encrypted "chatting". Though the libp2p variant of Kademlia is documented here, I've found only minimal tutorial and blog article on working with the Rust library and nearly nothing on its Rust implementation details, so I might have misconceptions how it works.

I'm using tcp transport with noise and a custom NetworkBehaviour composing the identify protocol to report external addresses, kademlia for peer discovery and initially the request-response protocol for ping-pong style sample messaging. Though I managed to have the "chat" working, I still don't get the whole picture of the experienced behaviour. As far as I understand, sending a request with request-response to a specific peerid always triggers dialing the peer if not connected yet.

Though I always drive the event loop by polling swarm.next_event(), dialing a peer always fails immediately with DialFailure unless some active Kademlia query is being executed like get_providers(). Not all queries work however, e.g. running bootstrap() is not enough, dial attempts always fail. Btw, tutorials simply ignore bootstrapping Kademlia, so I'm not completely sure when it is needed at all.

Is this expected behaviour in any way? Is there any documentation how I should drive the client to work properly? Is there something I can do for easier debugging?

I know the description above is quite vague. I'm not yet sure how to provide a usable minimal codebase for reproducing the same behaviour, e.g. I'm testing in a real world environment and not using circuit-relays yet, so currently the peer to connect to must not be behind NAT. Is there any way I can help with more details to make my issue clearer?

@mxinden
Copy link
Member

mxinden commented Jul 6, 2021

Thanks for reaching out @izolyomi.

Though the libp2p variant of Kademlia is documented here, I've found only minimal tutorial and blog article on working with the Rust library and nearly nothing on its Rust implementation details, so I might have misconceptions how it works.

I wish I had more time to work on this. Contributions in any form, especially on documentation and tutorials are very much appreciated.

dialing a peer always fails immediately with DialFailure

When spawning a request to a PeerId, is the libp2p-request-response or any other running NetworkBehehaviour aware of the Multiaddr of the remote peer?

/// > **Note**: In order for such a dialing attempt to succeed,
/// > the `RequestResonse` protocol must either be embedded
/// > in another `NetworkBehaviour` that provides peer and
/// > address discovery, or known addresses of peers must be
/// > managed via [`RequestResponse::add_address`] and
/// > [`RequestResponse::remove_address`].

Btw, tutorials simply ignore bootstrapping Kademlia, so I'm not completely sure when it is needed at all.

Depends on your use-case. In tiny setups (e.g. /examples) bootstrapping is not needed, given that peers already know each other. When connecting to larger DHTs, I would do it on a regular basis (see https://github.com/mxinden/rust-libp2p-server/blob/c046927214d44b2f1f7917aa82735c93909fbf1b/src/main.rs#L31).

Is there any way I can help with more details to make my issue clearer?

Can you run with RUST_LOG=debug and post the output up until you receive the DialFailure? Make sure to add env_logger::init(); to the top of your fn main.

@izolyomi
Copy link
Contributor Author

izolyomi commented Jul 7, 2021

@mxinden thanks for the answer.

Investigating some more, I can confirm that all problems regarding sending requests are caused by having no known addresses of the target peer. Meanwhile I've managed upgrading to the unreleased 0.39, it seems to be much more stable in practice with the kademlia debug_assert problem fixed here. Fortunately, 0.39 seems to discover addresses better as well.

Unfortunately, I still experience a behaviour that I think is a deterministic bug somewhere around Kademlia. Whenever a Kademlia query (typically get_closest_peers(peer_id)) is finished and emits an event with its final result, all previously discovered addresses of the target peer stored in the NetworkBehaviour are somehow suddenly cleared and lost. Hence the client is forced to rediscover addresses again which is successfully done as long as the get_closest_peers query returns a final result and the address is cleared again, everything starts over.

I'll try to prepare and attach some usable logs here showing this pattern as you asked.

@izolyomi
Copy link
Contributor Author

izolyomi commented Jul 7, 2021

address_cleanedup.log

Please find the logs attached, there's quite a lot of messages in about 5MBs. My sample codebase goes by libp2p_sandbox with only a few messages, everything else is logged by internal libp2p crates.

@mxinden
Copy link
Member

mxinden commented Jul 8, 2021

Unfortunately, I still experience a behaviour that I think is a deterministic bug somewhere around Kademlia. Whenever a Kademlia query (typically get_closest_peers(peer_id)) is finished and emits an event with its final result, all previously discovered addresses of the target peer stored in the NetworkBehaviour are somehow suddenly cleared and lost. Hence the client is forced to rediscover addresses again which is successfully done as long as the get_closest_peers query returns a final result and the address is cleared again, everything starts over.

I am not quite sure I follow. I am having difficulties deducing this behavior from the logs. Do I understand correctly that the addresses you add via RequestResponse::add_address are removed each time Kademlia emits an OutboundQueryCompleted event?

@izolyomi
Copy link
Contributor Author

izolyomi commented Jul 8, 2021

Sorry for the unclear explanation. Your understanding is mostly right above, but the whole point of my experimentation is to somehow automate address discovery, i.e. to contact a peer without its address known in advance. Hence I don't manually add addresses, but leave address discovery to identify + kademlia which typically does the job quite well.

However, once the external address is discovered and the peer is connected, request/response messages are sent and everything seems to work as expected, as soon as the get_closest_peer query finishes and triggers an event with its final result, all previously discovered addresses of the peer are somehow cleared up and requests are not delivered via the existing connection anymore. I do not receive an event about the connection closed, but the client has either silently closed or could not reuse the existing connection and thus forced to rediscover the address and reconnect.

This is deterministic behaviour even with the latest 0.39 patched with the debug_assert fix. I've double checked that the connection is not prematurely closed for being idle by adding a Ping protocol to NetworkBehaviour as well, but it didn't change anything.

Soon we are about to release new versions of some open source crates I'm using, so I'll probably be able to share sources for reproducing this a bit easier if needed. Until then, I'll try to give the relevant parts of the log, but that lacks the internal debug details that you'd probably need for fixing.

@izolyomi
Copy link
Contributor Author

izolyomi commented Jul 8, 2021

The remote peer 12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA currently just echoes back the request contents with an Echo ... prefix. My relevant custom logs showing the behaviour I'm talking about are below, I hope it helps understanding the issue:

[2021-07-08T13:45:14Z INFO  libp2p_sandbox::config] Libp2p local peer id: PeerId("12D3KooWAYFzVvNfpu8j6HZ8cS6tXngRe8kLXCrwEeXBw86JG9h6")
[2021-07-08T13:45:14Z INFO  libp2p_sandbox::swarm] Bootstrapping Kademlia network

... Sending request containing "send some request" ...

[2021-07-08T13:47:20Z INFO  libp2p_sandbox::swarm] Address lookup for peer 12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA resulted: []
[2021-07-08T13:47:20Z INFO  libp2p_sandbox] Dialing 12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA
[2021-07-08T13:47:25Z INFO  libp2p_sandbox] Failed to reach 12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA via "/ip4/10.156.0.36/tcp/2077/p2p/12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA", remaining 3 attempts, cause: Transport(Other(Custom { kind: Other, error: Timeout }))
[2021-07-08T13:47:25Z INFO  libp2p_sandbox] Connected to 12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA via Dialer { address: "/ip4/35.198.135.131/tcp/2077/p2p/12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA" } times 1
[2021-07-08T13:47:25Z INFO  libp2p_sandbox::swarm] Exchanged peer 12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA addresses IdentifyInfo { public_key: Ed25519(PublicKey(compressed): bf93df5b99f68593f398ade3a86bf81c18e033492bbebf5875c32c8062443), protocol_version: "/iop/mercury/1.0.0", agent_version: "rust-libp2p/0.30.0", listen_addrs: ["/ip4/35.198.135.131/tcp/2077", "/ip4/178.132.7.102/tcp/2077", "/ip4/10.156.0.36/tcp/2077", "/ip4/127.0.0.1/tcp/2077"], protocols: ["/ipfs/ping/1.0.0", "/ipfs/id/1.0.0", "/ipfs/id/push/1.0.0", "/ipfs/kad/1.0.0", "/iop/mercury/1.0.0"], observed_addr: "/ip4/81.182.9.72/tcp/51722" }

... Sending request containing "does it work?" ... 

[2021-07-08T13:47:49Z INFO  libp2p_sandbox::swarm] Address lookup for peer 12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA resulted: ["/ip4/35.198.135.131/tcp/2077", "/ip4/10.156.0.36/tcp/2077", "/ip4/127.0.0.1/tcp/2077"]
[2021-07-08T13:47:49Z INFO  libp2p_sandbox::swarm] Peer 12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA sent response to request 2: "Echo does it work?"

... Sending request containing "yes it does" ...

[2021-07-08T13:48:05Z INFO  libp2p_sandbox::swarm] Address lookup for peer 12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA resulted: ["/ip4/35.198.135.131/tcp/2077", "/ip4/10.156.0.36/tcp/2077", "/ip4/127.0.0.1/tcp/2077"]
[2021-07-08T13:48:05Z INFO  libp2p_sandbox::swarm] Peer 12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA sent response to request 3: "Echo yes it does"
[2021-07-08T13:48:18Z INFO  libp2p_sandbox::swarm] Got closest peers: [PeerId("12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA"), PeerId("QmRrwBYiqw4dRCz6q1QiC8oALnC2Xwnn6hMXAC3H751F3F"), PeerId("QmXLRDid6aqEqAhm4oysZbATZmnUKRwpYbsJPy2LkSho9x"), PeerId("QmfCZs3N6EFTaVLw3TftPidxeVfc1yaRLncip2eiSjzSsP"), PeerId("12D3KooWKkP21nnErYFbXdbetWoTP7JNRinVSaaNzEgaoGC56Yae"), PeerId("QmQDNpKbQHwQMZVuxKKpUC53BJhGc9NakwbEbvHqBY7Bin"), PeerId("12D3KooWKiBpHPZw3A5ngB9Bo1YAiDcNgWq2dJfRfbKhzADnE3oc"), PeerId("12D3KooWExH2CGBuSdu9mWLiwZ3BA7EFdpWd39ZEgQEVWfa92nM8"), PeerId("QmebdvScs8nYAKYDmPK8fNtH5BnqPe9LAeWyxUFSbhDPwN"), PeerId("Qma1HgJ3AcAMsDe4MvawEpttTByjjkNGSQqkrFQ7qNJTVK"), PeerId("12D3KooWBqXURGrk6gtc2Vsdi2DnJDxgpWJopDTCRjV3RADtnfy2"), PeerId("12D3KooWDRsCKTf2tCuGJ1nfz8CnBWrhiooL1UrHrXLxoW59Nffs"), PeerId("QmT7KNY9zNw5xnm1r1mDGVWKxHGZ4xJhxvhXXdFNSwtEPr"), PeerId("QmWXUfNE7TNvTsdWobfegHSKBJP7TPmbgV5U5Z9sh8QaN5"), PeerId("12D3KooWAFDYqX2ejMP9ywb8KpwFAv6MnjsxyMAwARFCJL88Na6W"), PeerId("QmUHvRgW8z4qDCy7cA9v7wcGevemsetCeT9SVE3djCvh89"), PeerId("QmSwvcZNQjXkvERqpA7iEtxqjrx3a9q6YXYhTMyRY9swwe"), PeerId("QmNoxRtxV9m7rjYMSjf1o19i3VgZyfnfUSUADaas7htLJL"), PeerId("12D3KooWMmRtmMMBmhC3EMJcySfnybEmbqWuosLWv8G5hYiPze8w"), PeerId("QmPsW3FRHmAgwCeiabsGizADUuALKJrYw7S5LrVAU8jBnz")]

... Sending request containing "but it does not work anymore" ...

[2021-07-08T13:48:35Z INFO  libp2p_sandbox::swarm] Address lookup for peer 12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA resulted: []
[2021-07-08T13:48:35Z INFO  libp2p_sandbox::swarm] Searching for peer 12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA

... but manually exiting the peer is immediatly detected by emitting a connection closed event ...

[2021-07-08T13:49:17Z INFO  libp2p_sandbox] Connection closed to 12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA via Dialer { address: "/ip4/35.198.135.131/tcp/2077/p2p/12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA" }, remaining 0 connections, cause: Some(IO(Custom { kind: Other, error: Closed }))

@mxinden
Copy link
Member

mxinden commented Jul 14, 2021

Thank you @izolyomi for the distilled logs! Indeed, this is surprising.


Soon we are about to release new versions of some open source crates I'm using, so I'll probably be able to share sources for reproducing this a bit easier if needed.

That would be great.


Some vague conjecture: The get_closest_peer query finds peers that push 12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA out of its k-bucket. In other words, Kademlia discovers other peers which it deems to be better suited to be kept in its k-buckets and thus evicts 12D3KooWNiCpQtkSmKHm9FtVCfZUswe4oFzbMyuGQb9ZzuTsMLLA as well as its addresses.

In order to confirm or refute the above conjecture, could you log all of the Kademlia events? More particularly we would need the KademliaEvent::RoutingUpdated.

As a more general thought, when using libp2p-kad on a large DHT, e.g. the IPFS DHT, one can not depend on libp2p-kad to keep track of all peers and all their addresses at all times in its k-buckets. I would imagine the following process to be better suited for you:

  1. Do a lookup for peer ID X via libp2p-kad get_closest_peer to get the addresses of X.
  2. Add the addresses of X to libp2p-request-response via add_address.
  3. Trigger the request via send_request.

Also, take a look at #2133, which I would expect would simplify your use-case once implemented.

@izolyomi
Copy link
Contributor Author

izolyomi commented Jul 29, 2021

@mxinden thank you for the detailed description. Based on your feedback, I did some more investigation.

  1. Do a lookup for peer ID X via libp2p-kad get_closest_peer to get the addresses of X.
  2. Add the addresses of X to libp2p-request-response via add_address.
  3. Trigger the request via send_request.

Function get_closest_peers reports only peerIds without addresses. But if I also use the identify protocol in my NetworkBehaviour and watch for address exchange events about the target peer, the method you described above works nicely. Addresses do not disappear anymore, so my original problem is fixed with this approach.

I've also checked the topic of #2133 and I agree that it might make address handling more intuitive. Besides manual address book handling, I'd expect a swarm/behaviour configuration option that enables automated filling and updating the address book with all newly discovered addresses and changes. But I agree that collected data might really consume a lot of memory and the automation might cause more implementation complexity, so it makes sense to skip this feature.

I've also tried creating a smaller code example from my experiments showing the same symptoms. Doing so, I've found that dropping start_providing and get_provider calls from Kademlia I'm using for a single "swarmId hash", addresses were still cleaned from Kademlia, but the Request/Response protocol somehow did not "lose" the existing connection and could further use it to send messages. Previously the same existing connection somehow became "invisible" to the req/resp protocol while the connection was still alive, e.g. closing the remote peer properly ended with a ConnectionClosed event but requests always failed.

I'm unsure if this worths more investigation, if so I can now share the codebase and maybe we can hunt down the root cause.

@mxinden
Copy link
Member

mxinden commented Jul 30, 2021

Thank you for the detailed comment.

I'm unsure if this worths more investigation, if so I can now share the codebase and maybe we can hunt down the root cause.

Unless you or anyone else is still facing the issues, I would not dig deeper.

@thomaseizinger
Copy link
Contributor

Closing as resolved. Please comment if there is still something actionable here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants