Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bridge Node Stuck After Archive RPC Restart #4090

Open
kj89 opened this issue Feb 5, 2025 · 6 comments · May be fixed by #4093
Open

Bridge Node Stuck After Archive RPC Restart #4090

kj89 opened this issue Feb 5, 2025 · 6 comments · May be fixed by #4093
Assignees
Labels
bug Something isn't working external Issues created by non node team members

Comments

@kj89
Copy link

kj89 commented Feb 5, 2025

Celestia Node version

v0.21.3-mocha

Celestia Consensus Node version

3.3.0-mocha

OS

22.04.5 LTS (Jammy Jellyfish)

Steps to reproduce it

Restart the archive RPC while the bridge node is running

Expected result

The bridge node should be able to recover from an archive RPC restart without requiring a manual service restart.

Actual result

The bridge node became unresponsive. It repeatedly logged fetcher and listener errors. The issue persisted until a manual restart was performed.

Relevant log output

2025-02-05T07:28:54.515Z        INFO    header/store    store/store.go:367      new head        {"height": 4529796, "hash": "4881C54F1993FBC1BEDC574CBBA71A798210B88C17572DBE970B0673E83FC110"}
2025-02-05T07:28:59.391Z        INFO    header/store    store/store.go:367      new head        {"height": 4529797, "hash": "7953D784D1982641504D365BF30BF8AF3ADE797EB722B87996296C87FAB85BF1"}
2025-02-05T07:29:04.389Z        INFO    header/store    store/store.go:367      new head        {"height": 4529798, "hash": "3A798993AD62B133BB73EA63FEBDF29B1D06B21AA650335B9ADAEB05A38CF754"}
2025-02-05T07:29:09.274Z        INFO    header/store    store/store.go:367      new head        {"height": 4529799, "hash": "B6F80329D1DA76F3602C1D5FB91B27C04835E4AA142F6221E7A175A01015E55B"}
2025-02-05T07:29:14.243Z        INFO    header/store    store/store.go:367      new head        {"height": 4529800, "hash": "BCE184F68E3C165FF60C2BB569CA1DB37C78F2356B34039FFD4261557A049DAF"}
2025-02-05T07:29:17.963Z        INFO    bitswap/server/decision decision/engine.go:733  handling wantlist overflow      {"local": "12D3KooWKdtKbb6KHYnnEjxdMgBT3eVKmuvT8ykjgbajcSKz89LP", "from": "12D3KooWDHfTaYvQE1z4DBKRtH5rKnJc5RrBpjrYhM1EFYttyCkt", "wantlistSize": 4093, "overflowSize": 3}
2025-02-05T07:29:19.217Z        INFO    header/store    store/store.go:367      new head        {"height": 4529801, "hash": "ACA69B7C5EFD9F98372B8D1E18255FBD30B77AD7D76F4299C486B3391278E01C"}
2025-02-05T07:29:24.286Z        INFO    header/store    store/store.go:367      new head        {"height": 4529802, "hash": "91B489EC15D4F1D5CFC0D7B1BD8AC8DF37D5A6FA6159836FB9F09A8E2F8C91D0"}
2025-02-05T07:29:29.497Z        INFO    header/store    store/store.go:367      new head        {"height": 4529803, "hash": "3530B64936561099C82A9E3DF4E4BDBF6359BEC8D196C299A9A155755453ED88"}
2025-02-05T07:29:31.733Z        ERROR   core    core/fetcher.go:191     fetcher: error receiving new height     {"err": "rpc error: code = Unavailable desc = error reading from server: EOF"}
2025-02-05T07:29:31.734Z        ERROR   core    core/fetcher.go:191     fetcher: error receiving new height     {"err": "rpc error: code = Unavailable desc = error reading from server: EOF"}
2025-02-05T07:29:31.734Z        ERROR   core    core/fetcher.go:191     fetcher: error receiving new height     {"err": "rpc error: code = Unavailable desc = error reading from server: EOF"}
2025-02-05T07:29:31.734Z        ERROR   core    core/fetcher.go:191     fetcher: error receiving new height     {"err": "rpc error: code = Unavailable desc = error reading from server: EOF"}
2025-02-05T07:29:31.734Z        ERROR   core    core/fetcher.go:191     fetcher: error receiving new height     {"err": "rpc error: code = Unavailable desc = error reading from server: EOF"}
...
2025-02-05T07:30:09.864Z        ERROR   core    core/fetcher.go:191     fetcher: error receiving new height     {"err": "rpc error: code = Unavailable desc = error reading from server: EOF"}
2025-02-05T07:30:09.864Z        ERROR   core    core/fetcher.go:191     fetcher: error receiving new height     {"err": "rpc error: code = Unavailable desc = error reading from server: EOF"}
2025-02-05T07:30:09.864Z        ERROR   core    core/fetcher.go:191     fetcher: error receiving new height     {"err": "rpc error: code = Unavailable desc = error reading from server: EOF"}
2025-02-05T07:30:09.864Z        ERROR   core    core/fetcher.go:191     fetcher: error receiving new height     {"err": "rpc error: code = Unavailable desc = error reading from server: EOF"}
2025-02-05T07:30:09.864Z        ERROR   core    core/fetcher.go:191     fetcher: error receiving new height     {"err": "rpc error: code = Unavailable desc = error reading from server: EOF"}
Suppressed 1363430 messages from celestia-bridge.service
2025-02-05T07:30:39.499Z        ERROR   core    core/listener.go:171    listener: resubscribe   {"err": "already subscribed to new blocks"}
2025-02-05T07:30:44.499Z        ERROR   core    core/listener.go:171    listener: resubscribe   {"err": "already subscribed to new blocks"}
2025-02-05T07:30:47.994Z        INFO    bitswap/server/decision decision/engine.go:733  handling wantlist overflow      {"local": "12D3KooWKdtKbb6KHYnnEjxdMgBT3eVKmuvT8ykjgbajcSKz89LP", "from": "12D3KooWDHfTaYvQE1z4DBKRtH5rKnJc5RrBpjrYhM1EFYttyCkt", "wantlistSize": 4094, "overflowSize": 2}
2025-02-05T07:30:49.498Z        ERROR   core    core/listener.go:171    listener: resubscribe   {"err": "already subscribed to new blocks"}
2025-02-05T07:30:54.498Z        ERROR   core    core/listener.go:171    listener: resubscribe   {"err": "already subscribed to new blocks"}
2025-02-05T07:30:59.499Z        ERROR   core    core/listener.go:171    listener: resubscribe   {"err": "already subscribed to new blocks"}

Is the node "stuck"? Has it stopped syncing?

Yes

Notes

We encountered an issue today on our testnet bridge node. After restarting the archive RPC, our bridge node got stuck and could not recover on its own. However, a manual service restart resolved the issue.

@kj89 kj89 added the bug Something isn't working label Feb 5, 2025
@github-actions github-actions bot added the external Issues created by non node team members label Feb 5, 2025
@cristaloleg
Copy link
Contributor

Can you share the version of the archival consensus node?

@YakupAltay
Copy link

We have also encountered this issue on Ubuntu 24.04 with celestia-app version v3.3.0-mocha and celestia version v0.21.3-mocha.

Actions Taken and Observations
1) Restarting only the bridge node (leaving the full node running)

  • Initially, the bridge node began syncing headers.
  • However, after approximately 12 hours, it stopped syncing and became unresponsive.

2) Restarting both the bridge and full node

  • The behavior remained the same: syncing resumed initially but eventually stalled.

To prevent potential downtime, we have migrated the node for now. That's because without manually restarting the node, the issue is not recoverable.

@kj89
Copy link
Author

kj89 commented Feb 5, 2025

Can you share the version of the archival consensus node?

added it to description

@vgonkivs vgonkivs self-assigned this Feb 5, 2025
@YakupAltay
Copy link

Hey @cristaloleg! 👋

I think it is not about the consensus node. I have just upgraded the node to v3.3.1-mocha and after restarting the node, the bridge node stalled.

I had to manually restart the bridge node, it seems fine right now.

I will update here if something happens.

@redwest88
Copy link

We can confirm encountering the same issue: after restarting the RPC node, the bridge node became stuck. Restarting the bridge node resolves the problem.

celestia-node version: v0.21.3-mocha
celestia-appd version: v3.3.1-mocha
OS: Ubuntu 22.04.5 LTS
Go version: go1.22.3

Celestia bridge service logs:

2025-02-05T18:49:08.291Z        ERROR   core    core/listener.go:171    listener: resubscribe   {"err": "already subscribed to new blocks"}
2025-02-05T18:49:13.290Z        ERROR   core    core/listener.go:171    listener: resubscribe   {"err": "already subscribed to new blocks"}
2025-02-05T18:49:18.290Z        ERROR   core    core/listener.go:171    listener: resubscribe   {"err": "already subscribed to new blocks"}
2025-02-05T18:49:23.290Z        ERROR   core    core/listener.go:171    listener: resubscribe   {"err": "already subscribed to new blocks"}
2025-02-05T18:49:23.775Z        INFO    bitswap/server/decision decision/engine.go:733  handling wantlist overflow      {"local": "12D3KooWCV98RHzGD6JzD77P4KWN2vbTDdSGFwnDz8uKhzQFFmhC", "from": "12D3KooWNj4EKMgrcQY6wouxHYu2W9zpg2v49Fei32aSQpiVJjqc", "wantlistSize": 4080, "overflowSize": 16}
2025-02-05T18:49:28.290Z        ERROR   core    core/listener.go:171    listener: resubscribe   {"err": "already subscribed to new blocks"}
2025-02-05T18:49:33.290Z        ERROR   core    core/listener.go:171    listener: resubscribe   {"err": "already subscribed to new blocks"}
2025-02-05T18:49:38.290Z        ERROR   core    core/listener.go:171    listener: resubscribe   {"err": "already subscribed to new blocks"}
2025-02-05T18:49:43.290Z        ERROR   core    core/listener.go:171    listener: resubscribe   {"err": "already subscribed to new blocks"}
2025-02-05T18:49:48.291Z        ERROR   core    core/listener.go:171    listener: resubscribe   {"err": "already subscribed to new blocks"}
2025-02-05T18:49:53.290Z        ERROR   core    core/listener.go:171    listener: resubscribe   {"err": "already subscribed to new blocks"}
2025-02-05T18:49:53.773Z        INFO    bitswap/server/decision decision/engine.go:733  handling wantlist overflow      {"local": "12D3KooWCV98RHzGD6JzD77P4KWN2vbTDdSGFwnDz8uKhzQFFmhC", "from": "12D3KooWNj4EKMgrcQY6wouxHYu2W9zpg2v49Fei32aSQpiVJjqc", "wantlistSize": 4083, "overflowSize": 13}```

@mindstyle85
Copy link

confirming the same here after the upgrade today, cc @renaynay

@vgonkivs vgonkivs linked a pull request Feb 6, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working external Issues created by non node team members
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants