Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node crash on state sync #4137

Closed
bowenwang1996 opened this issue Mar 18, 2021 · 5 comments
Closed

Node crash on state sync #4137

bowenwang1996 opened this issue Mar 18, 2021 · 5 comments
Labels
C-bug Category: This is a bug

Comments

@bowenwang1996
Copy link
Collaborator

As reported by a validator, their node on testnet crashed while trying to do state sync:

Mar 18 05:33:04 neard[1590]: Mar 18 05:33:04.815  WARN sync: State sync didn't download the state for shard 0 in 60 seconds, sending StateRequest again
Mar 18 05:33:09 neard[1590]: Mar 18 05:33:09.870  INFO stats: State [0: parts]  16/15/40 peers ⬇ 6.1MiB/s ⬆ 72.1kiB/s 0.00 bps 0 gas/s CPU: 58%, Mem: 2.6 GiB
Mar 18 05:33:47 neard[1590]: fatal runtime error: Rust cannot catch foreign exceptions

I wonder whether something crashed in rocksdb @mikhailOK

@mikhailOK
Copy link
Contributor

Reproduced the crash, it happens in rocksdb background compaction. Investigating

@mikhailOK
Copy link
Contributor

crashing in CompactionPicker::GetRange, somehow inputs is empty when it's not supposed to be according to the assert above. The column is ColBlockHeader. Looking into rocksdb logic.

@bowenwang1996 bowenwang1996 added the C-bug Category: This is a bug label Mar 19, 2021
@bowenwang1996
Copy link
Collaborator Author

bowenwang1996 commented Apr 4, 2021

From some other reports we received, it seems that this error is triggered when the node runs out of memory. @mikhailOK

@mikhailOK
Copy link
Contributor

The hypothesis is that an out of memory crash can put rocksdb in an inconsistent state, which is what leads to the following crash in compaction. The memory issue is fixed in 1.18.1, the next step would be to implement #3266 because current way of shutting down might also be a risk. Ending the investigation for now.

@mikhailOK mikhailOK removed their assignment Apr 7, 2021
@bowenwang1996
Copy link
Collaborator Author

No action item here. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug
Projects
None yet
Development

No branches or pull requests

2 participants