Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSN sync can get stuck indefinitely #2729

Closed
nazar-pc opened this issue Apr 30, 2024 · 7 comments
Closed

DSN sync can get stuck indefinitely #2729

nazar-pc opened this issue Apr 30, 2024 · 7 comments
Assignees
Labels
bug Something isn't working networking Subspace networking (DSN) node Node (service library/node app)

Comments

@nazar-pc
Copy link
Member

nazar-pc commented Apr 30, 2024

@shamil-gadelshin, not sure how to debug this, but there must be still a bug in libp2p that causes requests to be stuck sometimes.
I just had my node rebooted after a few minutes of being offline and it was not able to finish DSN sync in 30 minutes.
After restart it synced in a few minutes successfully.

Users report the same thing from time to time and we should look for a way to:

  1. Detect this
  2. Work around and log something useful that will facilitate debugging

While restart helps, it is a suboptimal experience and for Space Acres users that don't read logs all the time it is even more confusing.

@nazar-pc nazar-pc added bug Something isn't working networking Subspace networking (DSN) node Node (service library/node app) labels Apr 30, 2024
@nazar-pc nazar-pc added this to the Protocol UX Improvements milestone Apr 30, 2024
@nazar-pc
Copy link
Member Author

I think we should start by tracking all libp2p queries and when they started, then periodically print those that have not completed for a long time. By knowing which kind of query didn't finish and for how long we can narrow-down the problem more accurately.

As to DSN sync itself, I think we can add generous timeouts and assuming networking stack doesn't get stuck completely we should be able to simply restart DSN sync again and it will hopefully succeed from second attempt.

@shamil-gadelshin
Copy link
Member

Researching existing timeouts makes sense. I also have a suspicion that PeerId::random() (segment header requests) search behaves differently than piece_index search likely because of the timeouts as well.

@nazar-pc nazar-pc moved this from Todo to In Progress in Subspace core (node, farmer, etc.) May 30, 2024
@nazar-pc
Copy link
Member Author

@shamil-gadelshin I narrowed it down to at least Kademlia bootstrap getting stuck sometimes: libp2p/rust-libp2p#5432

It is possible that other requests might get stuck too, but that is the only one I reproduced so far.

@lionsoul2014
Copy link

info.log

@nazar-pc logs since the node created, i copy the db from a old snapshot.

@nazar-pc
Copy link
Member Author

nazar-pc commented Jun 2, 2024

If you start node with RUST_LOG=info,subspace_service=trace,sync=debug environment variable it might give us more information about why it doesn't want to sync. Something is off, but logs are not sufficient to understand what.

@lionsoul2014
Copy link

If you start node with RUST_LOG=info,subspace_service=trace,sync=debug environment variable it might give us more information about why it doesn't want to sync. Something is off, but logs are not sufficient to understand what.

I've already override the db from a synced one.

@nazar-pc
Copy link
Member Author

nazar-pc commented Jun 6, 2024

I'll close this for now, we had workaround and a fix included in main already that will be a part of the next release and full fix will likely come once we upgrade to release that includes libp2p/rust-libp2p#5349, at which point we'll be able to remove workaround for good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working networking Subspace networking (DSN) node Node (service library/node app)
Projects
Development

No branches or pull requests

3 participants