Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement basic equivocations detection loop #2367

Merged
merged 6 commits into from
Aug 23, 2023

Conversation

serban300
Copy link
Collaborator

Related to #2400

First version of the equivocation detection loop.

Known limitations that will be addressed in future PRs:

  • the loop is just implemented, but it's not used anywhere. We need to start it together with the finality loop probably
  • no unit tests
  • limited error management: in case of an error, we log it, try to reconnect and skip the item. Maybe we should retry
  • no integration test

@serban300 serban300 self-assigned this Aug 22, 2023
@serban300 serban300 changed the title Implement first version of the equivocations detection loop Implement basic equivocations detection loop Aug 22, 2023
Copy link
Collaborator

@acatangiu acatangiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far, left some comments

Comment on lines +71 to +73
async fn best_finalized_header_number(
&self,
) -> Result<BlockNumberOf<P::TargetChain>, Self::Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's a bit inconsistent that this method doesn't allow interrogating target chain at a particular BlockNumber, while the other methods do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns best finalized target chain block, which is the target chain property - it doesn't access the runtime code or storage at all. Other methods are working with the target chain runtime code/storage/state, which was active at some target block. So this method returns an anchor that is later used as at in other method calls

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.

Comment on lines 72 to 76
impl<P: FinalityPipeline, SC: SourceClientBase<P>> Default for FinalityProofsStream<P, SC> {
fn default() -> Self {
Self::new()
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: FinalityProofsStream seems to only contain Option<...>, I think you can just derive Default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Done.

relays/equivocation/src/reporter.rs Show resolved Hide resolved
relays/equivocation/src/equivocation_loop.rs Show resolved Hide resolved
Comment on lines 266 to 275
if self.from_block_num <= self.until_block_num {
let mut context = match self.build_context(self.from_block_num).await {
Some(context) => context,
None => return,
};

self.check_block(self.from_block_num, &mut context).await;

self.from_block_num = self.from_block_num.saturating_add(1.into());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a loop while self.from_block_num <= self.until_block_num {} and do full catch-up checks in a single tick.

Doesn't look like that much work to have latency concerns..

Also I would actually suggest to skip the checking full (sub)chain and just check latest (source highest/best finalized as seen by target) - only if that has different hash than block on source at same height, then you know it forked somewhere since last tick and you can iterate to find the fork. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while self.from_block_num <= self.until_block_num {}

👍 Also @serban300 please note that from_block_num is initialized to zero, so it'll start from genesis - I think it should be initialized to best target block at the beginning of the loop?

Also I would actually suggest to skip the checking full (sub)chain and just check latest (source highest/best finalized as seen by target) - only if that has different hash than block on source at same height, then you know it forked somewhere since last tick and you can iterate to find the fork. WDYT?

I think an attacker(s) may be able to submit { submit_finality_proof(forked_header#100), submit_message(malicious_message), submit_finality_proof(good_header#101) } in a single block. So if you'll just look at the latest, you may skip this forked_header submission. So imo we should look at all headers here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an attacker(s) may be able to submit { submit_finality_proof(forked_header#100), submit_message(malicious_message), submit_finality_proof(good_header#101) } in a single block. So if you'll just look at the latest, you may skip this forked_header submission. So imo we should look at all headers here

but how would good_header#101 have the correct hash if it's been built on top of bad_header#100? Once relayer introduces any forked header, all subsequent child chain will be a fork, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't check the connection between headers anywhere. So it could be built on top of other "normal" header. E.g. in following schema:

99 ---> 100 ---> 101
     \--> 100' 

the relayer1 may track the bad fork and submit the 100' first and relayer2 may then submit the 101

Copy link
Collaborator

@acatangiu acatangiu Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't check the connection between headers anywhere.

oh, ok, I thought (without checking the code) we did. (I now realize we don't check the chain continuity because we don't want to import every header and we're happy with every other Nth header as long as it has a GRANDPA justification).

valid scenario then, let's keep the equivocation check for every submitted header 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a loop while self.from_block_num <= self.until_block_num {} and do full catch-up checks in a single tick.

Sounds good. Done.

Also @serban300 please note that from_block_num is initialized to zero, so it'll start from genesis - I think it should be initialized to best target block at the beginning of the loop?

Yes, you're right. That was the intention, but somehow I forgot to implement it.

Also I would actually suggest to skip the checking full (sub)chain and just check latest (source highest/best finalized as seen by target) - only if that has different hash than block on source at same height, then you know it forked somewhere since last tick and you can iterate to find the fork. WDYT?

I think an attacker(s) may be able to submit { submit_finality_proof(forked_header#100), submit_message(malicious_message), submit_finality_proof(good_header#101) } in a single block. So if you'll just look at the latest, you may skip this forked_header submission. So imo we should look at all headers here

That's true. Also I think it's good to check for equivocations even if synced hash == source hash. There might be cases where we could still find equivocations. And it shouldn't add a big overhead.

Copy link
Contributor

@svyatonik svyatonik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - I've left some comments. Most are for the future PRs and ideas, so maybe we should just log it somewhere (some may be ignored I suppose :))

relays/equivocation/src/equivocation_loop.rs Outdated Show resolved Hide resolved
continue
},
false => {
// Irrecoverable error. End the loop.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know - you'll be polishing the error handling in future PRs, just wanted to leave a comment here, because it maybe be confusing a bit (and I already see that the non-connection error is called the "irrecoverable" here). So in relays we have two kind of errors. Connections errors are errors that are likely network errors or jsonrpsee internal errors. If you see a connection error, you need to reconnect the client - no other solution is possible. Other errors (non-connection errors) are assumed to be recoverable - they are caused e.g. by us submitting the obsolete transaction or e.g. server rejecting our RPC because it is overloaded.

The relay_utils::relay_loop function is supposed to handle the connection errors. So when you're starting loop using relay_utils::relay_loop, your "run_loop" function is expected to return an error when connection error is met. Then the RelayLoop itself reconnects to the failed client and restarts run_loop. Other (non-connection) errors are supposed to be handled by the run_loop function itself (e.g. by retrying with some exponential backoff or simply sleep for some time). Right now error handling here is implemented in an opposite way - you yield to RelayLoop on non-connection errors and do reconnect yourself.

This isn't the big issue - let's see the future error-handling PR, just wanted to share that ^^^. Maybe it'll allow you to remove some code here in favor of already existing code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations. Personally I thought that any error other than connection errors are irrecoverable. For example:

  • we can't generate the key ownership prof, because we're too many blocks ahead.
  • let's say that the chains were updated and the relayer has old data structures, so it can't encode/decode them
    etc

I think we should take this into consideration, because otherwise we risk getting stuck in a retrial loop. But it's true that other errors could be recoverable. So I don't know what would be best here. But seems like error handling will be a complex problem for a future PR.

But for the moment changing the strategy to consider every error recoverable and just sleep a bit and skip the item that lead to the error when possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we risk getting stuck in a retrial loop

That's true. Normally (in other relay loops) we'll eventually break of this loop when we'll read updated state and realized that something has changed. But until then we'll keep retrying. Which is fine (imo), unless we spam RPC server with failing RPC requests without some backoff.

In your examples:

  • we can't generate the key ownership prof, because we're too many blocks ahead: then the loop should just log an error it and go further (imo);
  • let's say that the chains were updated and the relayer has old data structures, so it can't encode/decode them etc: that's definitely a maintenance flaw - the best we could do here is to have a backoff mechanism. Like your current loop implementation would just exit the process if you encounter any non-connection error. Is it good? E.g. if we can't generate ownership proof, then we could just keep working with next justifications.

But as I said - in the end it is up to you, how to handle issues :)

relays/equivocation/src/reporter.rs Show resolved Hide resolved
Poll::Ready(tx_status) => {
match tx_status {
TrackedTransactionStatus::Lost => {
log::error!(target: "bridge", "Equivocation report tx was lost");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible improvement for the future PRs:

  1. let's also log some report-related info here + when submitting. At least - which block/round/validator has caused misbehavior;
  2. apart from having justifications db (still talking about future updates) we may also write all submitted reports there to be able to resubmit them manually if automatic submit has failed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Totally agree. I was also planning this for a future PR.
  2. Could you expand on this please ? I'm not sure what is the justifications db.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I mean that in the future I think we should keep (store) all seen justifications in some database (something I've been talking about here). This would allow ED to outlive connection issues with target node (we are not simply losing justifications on reconnect/restart) + if we have failed to submit justification (don't know the reason - just imagine it has happened - e.g. we have failed to generate key ownership proof), we could then just get some alert (based on ED logs) and submit it manually using justifications db. Please note that justifications are not stored anywhere, so if we've missed it, it is just lost. And even if we see that something bad has happened, we have no means to submit equivocation proof - it requires a justification (actually a vote).

But that's a stuff for the future.

at: P::Hash,
equivocation: P::EquivocationProof,
) -> Result<(), SC::Error> {
let pending_report = source_client.report_equivocation(at, equivocation).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A side note (maybe worth exploring + thinking of it later, maybe - don't): if report_equivocation transactions are signed and we'll submit such transactions from multiple ED loops, we'll lose funds proportional to number of such loops (i.e. tx_cost * count(loops)). Of course, that is not that important - because we're talking about validator misbehavior here, but maybe we need to check e.g. if validator has been already slashed (i.e. report has been submitted). Or maybe there's already this check in the report_equivocation call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like report_equivocation() doesn't refund in case of duplicate equivocation reports. And it makes sense to be like this. Cause otherwise it looks like it could be a DOS vector.

And currently I don't think there is any way to check if any report was already submitted. Currently there doesn't seem to be any runtime API for getting this information. But it's a good point. Maybe we could add a runtime API method for this.

Copy link
Collaborator

@acatangiu acatangiu Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaict valid reports are free even if duplicates: https://github.com/paritytech/substrate/blob/e806cefa1ebf0edfe493dafa184f7e50d8772a68/frame/grandpa/src/lib.rs#L212 (as long as they're valid)

the offenses pallet will record all offense reports within a session and at the end of it apply slashing once to all offending validators with the slash amount growing exponentially with the number of offenders in same session

L.E.: actually, it seems duplicate reports are actually treated as "errors" -> invalid report -> not free
(https://github.com/paritytech/substrate/blob/db6ebf564bcdfdfaf5fd026bb321de6f7d7e6fc0/frame/offences/src/lib.rs#L121C24-L121C52)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could read it from here. In any case - it isn't a big deal, just some possible improvement :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaict valid reports are free even if duplicates: https://github.com/paritytech/substrate/blob/e806cefa1ebf0edfe493dafa184f7e50d8772a68/frame/grandpa/src/lib.rs#L212 (as long as they're valid)

If so, then anyone could just spam the network for free with report_equivocation transactions. Either it is true (and then the issue in in the report_equivocation), or it internally checks for duplicates and returns error here

Comment on lines 266 to 275
if self.from_block_num <= self.until_block_num {
let mut context = match self.build_context(self.from_block_num).await {
Some(context) => context,
None => return,
};

self.check_block(self.from_block_num, &mut context).await;

self.from_block_num = self.from_block_num.saturating_add(1.into());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while self.from_block_num <= self.until_block_num {}

👍 Also @serban300 please note that from_block_num is initialized to zero, so it'll start from genesis - I think it should be initialized to best target block at the beginning of the loop?

Also I would actually suggest to skip the checking full (sub)chain and just check latest (source highest/best finalized as seen by target) - only if that has different hash than block on source at same height, then you know it forked somewhere since last tick and you can iterate to find the fork. WDYT?

I think an attacker(s) may be able to submit { submit_finality_proof(forked_header#100), submit_message(malicious_message), submit_finality_proof(good_header#101) } in a single block. So if you'll just look at the latest, you may skip this forked_header submission. So imo we should look at all headers here

@serban300
Copy link
Collaborator Author

@svyatonik @acatangiu thanks for the review. I addressed the comments and also created #2377 for tracking the things that need to be done in future PRs. PTAL when you have time.

at: P::Hash,
equivocation: P::EquivocationProof,
) -> Result<(), SC::Error> {
let pending_report = source_client.report_equivocation(at, equivocation).await?;
Copy link
Collaborator

@acatangiu acatangiu Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaict valid reports are free even if duplicates: https://github.com/paritytech/substrate/blob/e806cefa1ebf0edfe493dafa184f7e50d8772a68/frame/grandpa/src/lib.rs#L212 (as long as they're valid)

the offenses pallet will record all offense reports within a session and at the end of it apply slashing once to all offending validators with the slash amount growing exponentially with the number of offenders in same session

L.E.: actually, it seems duplicate reports are actually treated as "errors" -> invalid report -> not free
(https://github.com/paritytech/substrate/blob/db6ebf564bcdfdfaf5fd026bb321de6f7d7e6fc0/frame/offences/src/lib.rs#L121C24-L121C52)

let mut context =
match self.build_equivocation_reporting_context(current_block_number).await {
Some(context) => context,
None => return,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to give up completely if try_read_from_target() in self.build_equivocation_reporting_context() returns Err() or Ok(None);
We could just do another tick and try again from where we left off.

Suggested change
None => return,
None => break

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Maybe even continue instead of break?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Sorry, leftover from before, when I was treating all non-connection errors as unrecoverable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With continue this becomes a busy-wait loop continuously retrying try_read_from_target() until Ok(Some(_)), with the break we'd at least get some ratelimiting from tick...

I guess both are fine if we expect try_read_from_target() to return Ok(Some(_)) eventually...

Copy link
Collaborator Author

@serban300 serban300 Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Sorry, I changed it to continue in the end and merged the PR in the meanwhile. But will fix it when implementing better error management. This is the next step for the equivocation loop. Will keep this in mind.

Copy link
Contributor

@svyatonik svyatonik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo @acatangiu suggestion

@serban300 serban300 merged commit b995ac0 into paritytech:master Aug 23, 2023
serban300 added a commit to serban300/parity-bridges-common that referenced this pull request Aug 23, 2023
* FinalityProofsBuf adjustments

- store a Vec<FinalityProof>
- transform prune `buf_limit` to Option

* FinalityProof: add target_header_hash()

* Target client: implement best_synced_header_hash()

* Implement first version of the equivocations detection loop

* Address code review comments

* Leftover
serban300 added a commit that referenced this pull request Aug 23, 2023
* Implement basic equivocations detection loop (#2367)

* FinalityProofsBuf adjustments

- store a Vec<FinalityProof>
- transform prune `buf_limit` to Option

* FinalityProof: add target_header_hash()

* Target client: implement best_synced_header_hash()

* Implement first version of the equivocations detection loop

* Address code review comments

* Leftover

* polkadot-staging adjustments
@serban300 serban300 deleted the master-equivocation-3 branch September 5, 2023 11:06
@Polkadot-Forum
Copy link

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/polkadot-kusama-bridge/2971/5

serban300 added a commit to serban300/parity-bridges-common that referenced this pull request Mar 27, 2024
* Implement basic equivocations detection loop (paritytech#2367)

* FinalityProofsBuf adjustments

- store a Vec<FinalityProof>
- transform prune `buf_limit` to Option

* FinalityProof: add target_header_hash()

* Target client: implement best_synced_header_hash()

* Implement first version of the equivocations detection loop

* Address code review comments

* Leftover

* polkadot-staging adjustments
serban300 added a commit to serban300/parity-bridges-common that referenced this pull request Apr 8, 2024
* Implement basic equivocations detection loop (paritytech#2367)

* FinalityProofsBuf adjustments

- store a Vec<FinalityProof>
- transform prune `buf_limit` to Option

* FinalityProof: add target_header_hash()

* Target client: implement best_synced_header_hash()

* Implement first version of the equivocations detection loop

* Address code review comments

* Leftover

* polkadot-staging adjustments
bkontur pushed a commit that referenced this pull request May 7, 2024
* FinalityProofsBuf adjustments

- store a Vec<FinalityProof>
- transform prune `buf_limit` to Option

* FinalityProof: add target_header_hash()

* Target client: implement best_synced_header_hash()

* Implement first version of the equivocations detection loop

* Address code review comments

* Leftover
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants