-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consensus: "enterPrevote: ProposalBlock is invalid" - Error: "wrong signature" #4926
Comments
Yes, at least it's supposed to.
This error basically means validator
It's possible in theory, but it's not a desired behavior. We haven't seen this before. |
@njmurarka could you please share more logs? we'll need them to understand what has happened precisely. |
@melekes I would love to. I have the log file of output from one of my daemons. It is attached here. Thank you for clarifying what that "enterPrevote: ProposalBlock is invalid module=consensus" error means. But it really begs the question that worries me. Even if someone had an old genesis file. How could they have caused the network to go into this state? The chain gets started with my three validators, and they have such an enormous stake that it is not possible for anybody else to even get 10% power, if even 1% power. I did this intentionally for my testnet. So with this in mind, how can a single "rogue" node that has "bad setup" accomplish this? I am pretty confident now that somebody simply had "old data" of some sort, and that caused this. Maybe the old genesis. Maybe it was the old kv store? I have not attempted to replicate this, but I can tell you that it happened twice to me. Really scary if this were to happen on a mainnet. |
Do you happen to have the
|
Also, it would've been great to have logs with DEBUG level |
Unfortunately, @melekes, I do not have these nodes running at this point. In fact, to be honest, I do not even know which nodes were "stale" since I am not yet clear what happened, who was stale, and what stale means here. Is this a node that has a new validator private key? An old one? Or merely a validator that has the old genesis file? It is actually pretty scary, but I might be wrong. The reason it is scary is I am pretty confident my 3 validators (who hold 99% voting power) did everything correct, and it was one of the other validators that joined (I have 100+ other validators being run by members of the public) that might be somehow responsible. But if this is in fact true, that means that someone with < 1% voting power had a means to crash the network. I sure hope I am wrong. I think we really need to hunt down and find an explanation for this. It puts my my ill at ease knowing this happened twice and now, it is not happening. The best theory so far is a bad validator, yet it really puts the whole point of a decentralized network (with the ensuing security) in a bad light. I am attaching a log of the console output from "blzd" for one of the nodes. I am unclear how helpful this is, but it is better than nothing. Please scroll down to line 4,666 to see where the issue manifested. |
No worries! Looks like the only path here is to try and replicate the issue + reason through the code. |
We are running a "Game of Stakes" type competition right now, so I am a bit heads down. But I want to quash/explain this issue, as it is worrisome and should be to others (if it is legitimate and not something dumb on my side). I will try to make it happen intentionally in the next few weeks. I have the ability to launch new testnets pretty quickly. |
Hasn't been able to find a reason so far. |
Hello @melekes. So, as a bit of fortunate news (well, not that fortunate), this "crash" has occurred again. I reached out to you on Discord but we can continue here too. Here is some output that might look familiar:
It appears to be the exact same problem. When I restart a node that has "crashed" (presumably with the output above), I get the following:
I have kept my nodes running so we can resolve this matter. It is quite worrisome. I don't want to make assumptions. It is possible this is a result of the code we have in the COSMOS layer. That having been said, it seems like this is a genuine Tendermint issue. I don't see how any sort of code in the COSMOS layer should be able to cause this, even if we intentionally wanted it to be so. It is worrisome because we want to run a Mainnet soon. But having a crash like this makes that an unwise choice till we positively REPRODUCE this issue, and then resolve it. I obviously cannot sign off on a mainnet with an issue like this that is outstanding. Thoughts? Please advise what you need from me. Thanks. |
Thanks for the reports on this. Can you please provide output from the Would it be possible for us to get remote access to the RPC end points of running nodes so we can dig in a bit ourselves? |
Also to clarify - do these livelocks only happen after a genesis upgrade? And are you changing the chain-id after genesis upgrade? Technically once a chain-id has been used it shouldn't be used again in a genesis restart because validator signatures from the previous chain could be considered misbehaviour (eg. a signature for the old block 5 could conflict with a signature for the new block 5 from the same validator and could be seen as evidence, caused that validator to be slashed!). I'm not sure that's relevant in this case to the infinite "invalid proposal block" but it's something to keep in mind in general. |
Real quick replies to keep the momentum... we did not change or "upgrade" the genesis. I am not 100% sure what a "genesis" upgrade actually is, although I am 70% confident of my guess. We did nothing to status quo. Network was running for 3+ weeks and then boom -- this occurred! No change to the chain-id. No upgrade. No changes other than validators getting jailed, getting unbonded, and new validators trying to join. All natural, "organic" actions that are part of any Tendermint/COSMOS chain. Can you clarify what you mean when you say that a chain-id should not be re-used? This is a potentially a critical guideline. Indeed we have reset our network a few times and have re-used the same chain-id. That having been said, it scares the **** out of me that a network can crash like this. It contradicts the whole "high availability" value proposition of a blockchain, to have a "bug" like this. Even if a single validator has "bad DB data" (validator signatures from the previous chain), it makes no sense that this alone could cause the entire network to crash this way, does it? I mean really... if 99.9% of the voting power is ok with things and some validator shows up with bogus/antiquated data, how can that alone cause the whole network to crash like this? I sincerely hope I am wrong about this guess. And even if I am not, it would be great to resolve this. As it stands, our testnet is on its knees... stalled. I don't know how to recover it. Some nodes are running but no new blocks are forthcoming. I will get on providing answers to your request for output from those two RPC endpoints immediately. This having been said... I and some members of the community are very alarmed this has happened a few times... just to be clear, Bluzelle runs the three validators that collectively hold 97%+ of the voting power. So it is disturbing to think that some arbitrary validator out there, with "bad data" could cause this. It is a pretty good DoS attack vector, at a minimum. Thanks so very much, btw. Appreciate the quick feedback. I think that if this is truly an issue in Tendermint (I am 60% confident it is), it behooves us to resolve it quickly. |
Consensus State (link and file attached): http://dev-backup.testnet.public.bluzelle.com:26657/consensus_state Dump Consensus State (link and file attached): http://dev-backup.testnet.public.bluzelle.com:26657/dump_consensus_state Can you please confirm you have access at http://dev-backup.testnet.public.bluzelle.com:26657 to the endpoint you are asking for? What else can I provide to assist? Thanks. |
Thanks, this is super helpful. Looks like we've identified the problem and managed to replicate this issue. Will publish a fix ASAP. |
Thanks. That's great! Please do share here what the problem was, how you replicated, and how it was resolved? I'd love to see the discussion on this, if it was public. Will we be able to "save" our existing testnet with this fix? |
Yes we will share details on all of that once the fix is released. And we will try to provide a script that should allow you to save your existing testnet. Thanks for your patience! |
Closes #4926 The dump consensus state had this: "last_commit": { "votes": [ "Vote{0:04CBBF43CA3E 385085/00/2(Precommit) 1B73DA9FC4C8 42C97B86D89D @ 2020-05-27T06:46:51.042392895Z}", "Vote{1:055799E028FA 385085/00/2(Precommit) 652B08AD61EA 0D507D7FA3AB @ 2020-06-28T04:57:29.20793209Z}", "Vote{2:056024CFA910 385085/00/2(Precommit) 652B08AD61EA C8E95532A4C3 @ 2020-06-28T04:57:29.452696998Z}", "Vote{3:0741C95814DA 385085/00/2(Precommit) 652B08AD61EA 36D567615F7C @ 2020-06-28T04:57:29.279788593Z}", Note there's a precommit in there from the first val from May (2020-05-27) while the rest are from today (2020-06-28). It suggests there's a validator from an old instance of the network at this height (they're using the same chain-id!). Obviously a single bad validator shouldn't be an issue. But the Commit refactor work introduced a bug. When we propose a block, we get the block.LastCommit by calling MakeCommit on the set of precommits we saw for the last height. This set may include precommits for a different block, and hence the block.LastCommit we propose may include precommits that aren't actually for the last block (but of course +2/3 will be). Before v0.33, we just skipped over these precommits during verification. But in v0.33, we expect all signatures for a blockID to be for the same block ID! Thus we end up proposing a block that we can't verify.
Clarify how to get the `/dump_consensus_state` data. Eg. #4926 indicated: "Not sure how to do this."
Tendermint version:
Tendermint Core Semantic Version: 0.33.3
P2P Protocol Version: 7
Block Protocol Version: 10
ABCI app:
Cosmos SDK Version: v0.38.3
Big Dipper Explorer URL:
Instructions to setup a similar node (I'd suggest just setting up a sentry):
Access to genesis file for chain:
Sample command to get node info:
Discord channel invite (in case you want to live chat with me... I am Neeraj one of the admins):
Environment:
Using COSMOS SDK v0.38.3. Otherwise, not sure what else to say here.
We are running a testnet chain with our CRUD database as one of the application modules, in COSMOS.
We currently (as of filing this issue) have 5 "sentries" and 3 validators. To be clear, the sentries have no voting power and are the only peers that the validators talk to (the validators can talk to each other too). Furthermore, the validators are IP firewalled to only be able to talk to the sentries and other validators. The sentries themselves keep the validator node id's private.
Sentry hostnames:
I am not listing the validator hostnames, since they are inaccessible (due to the firewall) anyways.
The validators are only listening on 26656 to validators and sentries. The sentries are listening on 26656 and 26657 and also each run the cosmos REST server, listening on 1317.
We have opened our testnet to the public. Members of the public have setup sentries and validators of their own, and are expected to use our five sentries as their P2P peers in config.toml.
What happened:
For weeks, things on our testnet had been running fine. I had dozens of members of the public running validators on it, just so these people could learn the process of setting up a validator, etc.
I needed to increase the max # of allowed validators (to something much higher than the default value of 100) in the "app_state/staking/params/max_validators" value in genesis.json. I think that this particular value is a COSMOS thing, but I wanted to mention it for context. We are not using COSMOS governance yet, so we decided to do a hard reset (ie: generate a new genesis.json and start the chain all over).
First, here is what I did on my OWN 5 sentries and 3 validators:
Next, here is what I asked the people in the community to do with their validator and/or sentries:
The community slowly started up their daemons.
At some point (within an hour or so, about 2300 blocks in), I started to get the error below. I was getting this on all my sentries and validators. Basically, the chain had completely crashed. I tried to restart my validators and sentries, but this was unrecoverable.
To had no choice but to "reset" the whole chain again. I stopped all my validators and sentries and this time, ONLY ran "unsafe-reset-all" on all my daemons. Of course, I also had to do some COSMOS setup again (staking, etc), but started everything again, and asked the community to again do the same steps listed above with yet a new genesis.json, etc.
Within an hour, the whole network went down again. Effectively the same error (different block, signature HASH this time):
What you expected to happen:
I expect "clean" output, as so:
(The fact that invalidTxs is non-zero is the subject of another investigation)
Have you tried the latest version:
Not sure. I think so. Although in looking at the Tendermint Github, I see there are two minor versions available that are newer than what we have.
How to reproduce it:
I more or less explained how it came about above in the "what happened" section.
Looking at #2720, I see a similar error message, but not quite the same. But in looking at that issue, it was suggested that perhaps all the nodes did not start from the same "genesis" state. It is suggested that perhaps some node(s) have a stale "home folder" (.blzd, I presume?).
Does "unsafe-reset-all" actually clear out all state including the COSMOS KV stores, app state, etc? I assume this command is sufficient to accomplish a clean slate?
Is it possible that in "resetting" my chain as I did above, some members of the public possibly forgot to run that "blzd unsafe-reset-all" command, and by missing this step, when they started their node, it had data from the previous chain left over, and this somehow brought the whole network down? If so, it is a bit scary that a single node (or even a bunch of them) could do this. It is an excellent DoS attack vector, it seems, if so.
Logs:
Listed above.
Config:
No specific changes made to Tendermint.
node command runtime flags:
This is all running from within our daemon that was built with the COSMOS SDK.
/dump_consensus_state
output for consensus bugsNot sure how to do this.
Anything else we need to know:
Most details given above.
I did some searching ahead of time to see if I could resolve this myself. I saw some issues related to it but they are already closed.
The text was updated successfully, but these errors were encountered: