-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement basic equivocations detection loop #2367
Conversation
- store a Vec<FinalityProof> - transform prune `buf_limit` to Option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far, left some comments
async fn best_finalized_header_number( | ||
&self, | ||
) -> Result<BlockNumberOf<P::TargetChain>, Self::Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's a bit inconsistent that this method doesn't allow interrogating target chain at
a particular BlockNumber
, while the other methods do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns best finalized target chain block, which is the target chain property - it doesn't access the runtime code or storage at all. Other methods are working with the target chain runtime code/storage/state, which was active at
some target block. So this method returns an anchor
that is later used as at
in other method calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly.
impl<P: FinalityPipeline, SC: SourceClientBase<P>> Default for FinalityProofsStream<P, SC> { | ||
fn default() -> Self { | ||
Self::new() | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: FinalityProofsStream
seems to only contain Option<...>
, I think you can just derive Default
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Done.
if self.from_block_num <= self.until_block_num { | ||
let mut context = match self.build_context(self.from_block_num).await { | ||
Some(context) => context, | ||
None => return, | ||
}; | ||
|
||
self.check_block(self.from_block_num, &mut context).await; | ||
|
||
self.from_block_num = self.from_block_num.saturating_add(1.into()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a loop while self.from_block_num <= self.until_block_num {}
and do full catch-up checks in a single tick.
Doesn't look like that much work to have latency concerns..
Also I would actually suggest to skip the checking full (sub)chain and just check latest (source highest/best finalized as seen by target) - only if that has different hash than block on source at same height, then you know it forked somewhere since last tick and you can iterate to find the fork. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while self.from_block_num <= self.until_block_num {}
👍 Also @serban300 please note that from_block_num
is initialized to zero, so it'll start from genesis - I think it should be initialized to best target block at the beginning of the loop?
Also I would actually suggest to skip the checking full (sub)chain and just check latest (source highest/best finalized as seen by target) - only if that has different hash than block on source at same height, then you know it forked somewhere since last tick and you can iterate to find the fork. WDYT?
I think an attacker(s) may be able to submit { submit_finality_proof(forked_header#100), submit_message(malicious_message), submit_finality_proof(good_header#101) }
in a single block. So if you'll just look at the latest, you may skip this forked_header
submission. So imo we should look at all headers here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an attacker(s) may be able to submit
{ submit_finality_proof(forked_header#100), submit_message(malicious_message), submit_finality_proof(good_header#101) }
in a single block. So if you'll just look at the latest, you may skip this forked_header submission. So imo we should look at all headers here
but how would good_header#101
have the correct hash if it's been built on top of bad_header#100
? Once relayer introduces any forked header, all subsequent child chain will be a fork, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't check the connection between headers anywhere. So it could be built on top of other "normal" header. E.g. in following schema:
99 ---> 100 ---> 101
\--> 100'
the relayer1 may track the bad fork and submit the 100'
first and relayer2 may then submit the 101
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't check the connection between headers anywhere.
oh, ok, I thought (without checking the code) we did. (I now realize we don't check the chain continuity because we don't want to import every header and we're happy with every other Nth header as long as it has a GRANDPA justification).
valid scenario then, let's keep the equivocation check for every submitted header 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a loop while self.from_block_num <= self.until_block_num {} and do full catch-up checks in a single tick.
Sounds good. Done.
Also @serban300 please note that from_block_num is initialized to zero, so it'll start from genesis - I think it should be initialized to best target block at the beginning of the loop?
Yes, you're right. That was the intention, but somehow I forgot to implement it.
Also I would actually suggest to skip the checking full (sub)chain and just check latest (source highest/best finalized as seen by target) - only if that has different hash than block on source at same height, then you know it forked somewhere since last tick and you can iterate to find the fork. WDYT?
I think an attacker(s) may be able to submit
{ submit_finality_proof(forked_header#100), submit_message(malicious_message), submit_finality_proof(good_header#101) }
in a single block. So if you'll just look at the latest, you may skip thisforked_header
submission. So imo we should look at all headers here
That's true. Also I think it's good to check for equivocations even if synced hash == source hash. There might be cases where we could still find equivocations. And it shouldn't add a big overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - I've left some comments. Most are for the future PRs and ideas, so maybe we should just log it somewhere (some may be ignored I suppose :))
continue | ||
}, | ||
false => { | ||
// Irrecoverable error. End the loop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know - you'll be polishing the error handling in future PRs, just wanted to leave a comment here, because it maybe be confusing a bit (and I already see that the non-connection error is called the "irrecoverable" here). So in relays we have two kind of errors. Connections errors are errors that are likely network errors or jsonrpsee
internal errors. If you see a connection error, you need to reconnect the client - no other solution is possible. Other errors (non-connection errors) are assumed to be recoverable - they are caused e.g. by us submitting the obsolete transaction or e.g. server rejecting our RPC because it is overloaded.
The relay_utils::relay_loop
function is supposed to handle the connection errors. So when you're starting loop using relay_utils::relay_loop
, your "run_loop" function is expected to return an error when connection error is met. Then the RelayLoop
itself reconnects to the failed client and restarts run_loop
. Other (non-connection) errors are supposed to be handled by the run_loop
function itself (e.g. by retrying with some exponential backoff or simply sleep for some time). Right now error handling here is implemented in an opposite way - you yield to RelayLoop
on non-connection errors and do reconnect yourself.
This isn't the big issue - let's see the future error-handling PR, just wanted to share that ^^^. Maybe it'll allow you to remove some code here in favor of already existing code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanations. Personally I thought that any error other than connection errors are irrecoverable. For example:
- we can't generate the key ownership prof, because we're too many blocks ahead.
- let's say that the chains were updated and the relayer has old data structures, so it can't encode/decode them
etc
I think we should take this into consideration, because otherwise we risk getting stuck in a retrial loop. But it's true that other errors could be recoverable. So I don't know what would be best here. But seems like error handling will be a complex problem for a future PR.
But for the moment changing the strategy to consider every error recoverable and just sleep a bit and skip the item that lead to the error when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we risk getting stuck in a retrial loop
That's true. Normally (in other relay loops) we'll eventually break of this loop when we'll read updated state and realized that something has changed. But until then we'll keep retrying. Which is fine (imo), unless we spam RPC server with failing RPC requests without some backoff.
In your examples:
we can't generate the key ownership prof, because we're too many blocks ahead
: then the loop should just log an error it and go further (imo);let's say that the chains were updated and the relayer has old data structures, so it can't encode/decode them etc
: that's definitely a maintenance flaw - the best we could do here is to have a backoff mechanism. Like your current loop implementation would just exit the process if you encounter any non-connection error. Is it good? E.g. if we can't generate ownership proof, then we could just keep working with next justifications.
But as I said - in the end it is up to you, how to handle issues :)
Poll::Ready(tx_status) => { | ||
match tx_status { | ||
TrackedTransactionStatus::Lost => { | ||
log::error!(target: "bridge", "Equivocation report tx was lost"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible improvement for the future PRs:
- let's also log some report-related info here + when submitting. At least - which block/round/validator has caused misbehavior;
- apart from having justifications db (still talking about future updates) we may also write all submitted reports there to be able to resubmit them manually if automatic submit has failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Totally agree. I was also planning this for a future PR.
- Could you expand on this please ? I'm not sure what is the justifications db.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I mean that in the future I think we should keep (store) all seen justifications in some database (something I've been talking about here). This would allow ED to outlive connection issues with target node (we are not simply losing justifications on reconnect/restart) + if we have failed to submit justification (don't know the reason - just imagine it has happened - e.g. we have failed to generate key ownership proof), we could then just get some alert (based on ED logs) and submit it manually using justifications db. Please note that justifications are not stored anywhere, so if we've missed it, it is just lost. And even if we see that something bad has happened, we have no means to submit equivocation proof - it requires a justification (actually a vote).
But that's a stuff for the future.
at: P::Hash, | ||
equivocation: P::EquivocationProof, | ||
) -> Result<(), SC::Error> { | ||
let pending_report = source_client.report_equivocation(at, equivocation).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A side note (maybe worth exploring + thinking of it later, maybe - don't): if report_equivocation
transactions are signed and we'll submit such transactions from multiple ED loops, we'll lose funds proportional to number of such loops (i.e. tx_cost * count(loops)
). Of course, that is not that important - because we're talking about validator misbehavior here, but maybe we need to check e.g. if validator has been already slashed (i.e. report has been submitted). Or maybe there's already this check in the report_equivocation
call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like report_equivocation()
doesn't refund in case of duplicate equivocation reports. And it makes sense to be like this. Cause otherwise it looks like it could be a DOS vector.
And currently I don't think there is any way to check if any report was already submitted. Currently there doesn't seem to be any runtime API for getting this information. But it's a good point. Maybe we could add a runtime API method for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaict valid reports are free even if duplicates: https://github.com/paritytech/substrate/blob/e806cefa1ebf0edfe493dafa184f7e50d8772a68/frame/grandpa/src/lib.rs#L212 (as long as they're valid)
the offenses pallet will record all offense reports within a session and at the end of it apply slashing once to all offending validators with the slash amount growing exponentially with the number of offenders in same session
L.E.: actually, it seems duplicate reports are actually treated as "errors" -> invalid report -> not free
(https://github.com/paritytech/substrate/blob/db6ebf564bcdfdfaf5fd026bb321de6f7d7e6fc0/frame/offences/src/lib.rs#L121C24-L121C52)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could read it from here. In any case - it isn't a big deal, just some possible improvement :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaict valid reports are free even if duplicates: https://github.com/paritytech/substrate/blob/e806cefa1ebf0edfe493dafa184f7e50d8772a68/frame/grandpa/src/lib.rs#L212 (as long as they're valid)
If so, then anyone could just spam the network for free with report_equivocation
transactions. Either it is true (and then the issue in in the report_equivocation
), or it internally checks for duplicates and returns error here
if self.from_block_num <= self.until_block_num { | ||
let mut context = match self.build_context(self.from_block_num).await { | ||
Some(context) => context, | ||
None => return, | ||
}; | ||
|
||
self.check_block(self.from_block_num, &mut context).await; | ||
|
||
self.from_block_num = self.from_block_num.saturating_add(1.into()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while self.from_block_num <= self.until_block_num {}
👍 Also @serban300 please note that from_block_num
is initialized to zero, so it'll start from genesis - I think it should be initialized to best target block at the beginning of the loop?
Also I would actually suggest to skip the checking full (sub)chain and just check latest (source highest/best finalized as seen by target) - only if that has different hash than block on source at same height, then you know it forked somewhere since last tick and you can iterate to find the fork. WDYT?
I think an attacker(s) may be able to submit { submit_finality_proof(forked_header#100), submit_message(malicious_message), submit_finality_proof(good_header#101) }
in a single block. So if you'll just look at the latest, you may skip this forked_header
submission. So imo we should look at all headers here
@svyatonik @acatangiu thanks for the review. I addressed the comments and also created #2377 for tracking the things that need to be done in future PRs. PTAL when you have time. |
at: P::Hash, | ||
equivocation: P::EquivocationProof, | ||
) -> Result<(), SC::Error> { | ||
let pending_report = source_client.report_equivocation(at, equivocation).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaict valid reports are free even if duplicates: https://github.com/paritytech/substrate/blob/e806cefa1ebf0edfe493dafa184f7e50d8772a68/frame/grandpa/src/lib.rs#L212 (as long as they're valid)
the offenses pallet will record all offense reports within a session and at the end of it apply slashing once to all offending validators with the slash amount growing exponentially with the number of offenders in same session
L.E.: actually, it seems duplicate reports are actually treated as "errors" -> invalid report -> not free
(https://github.com/paritytech/substrate/blob/db6ebf564bcdfdfaf5fd026bb321de6f7d7e6fc0/frame/offences/src/lib.rs#L121C24-L121C52)
let mut context = | ||
match self.build_equivocation_reporting_context(current_block_number).await { | ||
Some(context) => context, | ||
None => return, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to give up completely if try_read_from_target()
in self.build_equivocation_reporting_context()
returns Err()
or Ok(None)
;
We could just do another tick and try again from where we left off.
None => return, | |
None => break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Maybe even continue
instead of break
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Sorry, leftover from before, when I was treating all non-connection errors as unrecoverable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With continue
this becomes a busy-wait loop continuously retrying try_read_from_target()
until Ok(Some(_))
, with the break
we'd at least get some ratelimiting from tick
...
I guess both are fine if we expect try_read_from_target()
to return Ok(Some(_))
eventually...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Sorry, I changed it to continue in the end and merged the PR in the meanwhile. But will fix it when implementing better error management. This is the next step for the equivocation loop. Will keep this in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, modulo @acatangiu suggestion
* FinalityProofsBuf adjustments - store a Vec<FinalityProof> - transform prune `buf_limit` to Option * FinalityProof: add target_header_hash() * Target client: implement best_synced_header_hash() * Implement first version of the equivocations detection loop * Address code review comments * Leftover
* Implement basic equivocations detection loop (#2367) * FinalityProofsBuf adjustments - store a Vec<FinalityProof> - transform prune `buf_limit` to Option * FinalityProof: add target_header_hash() * Target client: implement best_synced_header_hash() * Implement first version of the equivocations detection loop * Address code review comments * Leftover * polkadot-staging adjustments
This pull request has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/polkadot-kusama-bridge/2971/5 |
* Implement basic equivocations detection loop (paritytech#2367) * FinalityProofsBuf adjustments - store a Vec<FinalityProof> - transform prune `buf_limit` to Option * FinalityProof: add target_header_hash() * Target client: implement best_synced_header_hash() * Implement first version of the equivocations detection loop * Address code review comments * Leftover * polkadot-staging adjustments
* Implement basic equivocations detection loop (paritytech#2367) * FinalityProofsBuf adjustments - store a Vec<FinalityProof> - transform prune `buf_limit` to Option * FinalityProof: add target_header_hash() * Target client: implement best_synced_header_hash() * Implement first version of the equivocations detection loop * Address code review comments * Leftover * polkadot-staging adjustments
* FinalityProofsBuf adjustments - store a Vec<FinalityProof> - transform prune `buf_limit` to Option * FinalityProof: add target_header_hash() * Target client: implement best_synced_header_hash() * Implement first version of the equivocations detection loop * Address code review comments * Leftover
Related to #2400
First version of the equivocation detection loop.
Known limitations that will be addressed in future PRs: