Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry.WithBackoff for all agents. #1377

Merged
merged 10 commits into from
Oct 11, 2023
Merged

Retry.WithBackoff for all agents. #1377

merged 10 commits into from
Oct 11, 2023

Conversation

CryptoMaxPlanck
Copy link
Contributor

@CryptoMaxPlanck CryptoMaxPlanck commented Sep 14, 2023

Description
This PR adds the retry.WithBackoff functionality to the agents' RPC calls.

Summary by CodeRabbit

  • New Feature: Implemented a retry mechanism across various functions in the Executor, Guard, and Notary modules. This enhancement increases the resilience of contract calls, improving overall system reliability.
  • Refactor: Updated the NewExecutor, NewGuard, and NewNotary functions to initialize the retry configuration based on the MaxRetrySeconds setting.
  • Chore: Added debug print statements in the handleStatusUpdated and updateAgentStatus functions for better visibility during execution.
  • New Feature: Introduced a new configuration field MaxRetrySeconds to customize the retry behavior.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 14, 2023

Walkthrough

This update introduces a retry mechanism to enhance the resilience of contract and RPC calls across various components. It also includes minor changes to error handling, logging, and debugging. The retry logic is implemented using the retry.WithBackoff function from the github.com/synapsecns/sanguine/core/retry package.

Changes

File(s) Summary
agents/.../executor/executor.go,
agents/.../executor/executor_utils.go
Introduced a retry mechanism for contract calls in the Executor component. Added a new field retryConfig to the Executor struct and modified several methods to use the retry.WithBackoff function.
agents/.../guard/guard.go Implemented a retry mechanism for contract calls in the Guard component. Set a default value for cfg.MaxRetrySeconds and used the retry logic in loadSummitLatestStates and loadOriginLatestStates functions.
agents/.../notary/notary.go Incorporated a retry mechanism for contract calls in the Notary component. Added a new field retryConfig to the Notary struct and modified several functions to use the retry logic.
agents/config/executor/config.go Added a new field "MaxRetrySeconds" of type uint32 to the Config struct.
agents/agents/guard/fraud.go Added fmt.Printf statements for debugging purposes.
agents/README.md Removed an empty line.
session.vim Included a series of Vim commands that manipulate buffers, windows, and tabs.

🐇💻

In the land of code where the shadows lie,

We retry, we persist, we never say die.

When the network's slow, or the server's down,

We don't despair, we don't frown.

With backoff and retry, we stand tall,

For resilience is the mightiest of all! 🎉


Tips

Chat with CodeRabbit Bot (@coderabbitai)

  • Mention @coderabbitai in any review comment for bot assistance.
  • Note: Review comments are made on code diffs or files, not on the PR overview.

Pause Incremental Reviews

  • Insert @coderabbitai: ignore in the PR description to halt the bot's ongoing reviews. Remove the line to resume.

@github-actions github-actions bot added go Pull requests that update Go code size/s labels Sep 14, 2023
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Commits Files that changed from the base of the PR and between 96da36b and 512d9fa commits.
Files selected for processing (4)
  • agents/agents/executor/executor.go (9 hunks)
  • agents/agents/executor/executor_utils.go (3 hunks)
  • agents/agents/guard/fraud.go (4 hunks)
  • agents/config/executor/config.go (2 hunks)
Files skipped from review due to trivial changes (1)
  • agents/agents/guard/fraud.go
Additional comments (Suppressed): 12
agents/config/executor/config.go (2)
  • 3-15: The import order has been changed. Ensure that this change is consistent with the project's import ordering conventions.

  • 45-50: A new field MaxRetrySeconds has been added to the Config struct. This field represents the maximum number of seconds to retry an RPC call (not a transaction). Make sure that all instances where the Config struct is used have been updated to include this new field, and that it is properly initialized where necessary.

agents/agents/executor/executor.go (9)
  • 11-16: The import of github.com/jpillora/backoff has been removed and replaced with github.com/synapsecns/sanguine/core/retry. Ensure that all instances of the old backoff mechanism have been replaced with the new retry mechanism.

  • 23-29: The import of github.com/synapsecns/sanguine/core/retry has been added. This is consistent with the PR summary which mentions the introduction of a new retry mechanism.

  • 87-94: A new field retryConfig of type []retry.WithBackoffConfigurator has been added to the Executor struct. This is consistent with the PR summary which mentions adding a new field for the retry configuration.

  • 150-159: The retryConfig is being initialized based on the MaxRetrySeconds value from the executor.Config. This is consistent with the PR summary which mentions initializing the retry configuration in the NewExecutor function.

  • 245-251: The retryConfig is being passed to the Executor struct during its initialization. This is consistent with the changes made in lines 150-159 where the retryConfig was initialized.

  • 347-362: The retry.WithBackoff function is being used to retry the contract call if it fails. This is consistent with the PR summary which mentions using the retry mechanism for certain contract calls.

  • 440-470: The retry.WithBackoff function is being used to retry the contract call if it fails. This is consistent with the PR summary which mentions using the retry mechanism for certain contract calls.

  • 557-581: The retry.WithBackoff function is being used to retry the chain call if it fails. This is consistent with the PR summary which mentions using the retry mechanism for certain contract calls.

  • 637-662: The retry.WithBackoff function is being used to retry the contract call if it fails. This is consistent with the PR summary which mentions using the retry mechanism for certain contract calls.

agents/agents/executor/executor_utils.go (1)
  • 4-16: The import of the github.com/synapsecns/sanguine/core/retry package is a good replacement for the previously used github.com/jpillora/backoff package. The new retry mechanism seems to be more flexible and configurable, which should improve error handling in network-related operations.

Comment on lines 159 to 177
return fmt.Errorf("could not encode state: %w", err)
}
// Verify that the state is valid w.r.t. Origin.
valid, err := e.chainExecutors[state.Origin()].boundOrigin.IsValidState(
ctx,
statePayload,
)
var valid bool
contractCall := func(ctx context.Context) error {
valid, err = e.chainExecutors[state.Origin()].boundOrigin.IsValidState(
ctx,
statePayload,
)
if err != nil {
return fmt.Errorf("could not check validity of state: %w", err)
}

return nil
}
err = retry.WithBackoff(ctx, contractCall, e.retryConfig...)
if err != nil {
return fmt.Errorf("could not check validity of state: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of the retry.WithBackoff function here is a good improvement over the previous implementation. It provides a cleaner way to handle retries on the contract call. However, ensure that the context passed to the retry.WithBackoff function has a timeout set to prevent indefinite retries in case of persistent errors.

- err = retry.WithBackoff(ctx, contractCall, e.retryConfig...)
+ ctxWithTimeout, cancel := context.WithTimeout(ctx, time.Duration(e.MaxRetrySeconds)*time.Second)
+ defer cancel()
+ err = retry.WithBackoff(ctxWithTimeout, contractCall, e.retryConfig...)

Comment on lines 208 to +224
}

// If the attestation is on a remote chain, we need to fetch the timestamp via an RPC call.
b := &backoff.Backoff{
Factor: 2,
Jitter: true,
Min: 30 * time.Millisecond,
Max: 3 * time.Second,
}

timeout := time.Duration(0)

var logHeader *ethTypes.Header
var err error
contractCall := func(ctx context.Context) error {
logHeader, err = e.chainExecutors[chainID].rpcClient.HeaderByNumber(ctx, big.NewInt(int64(logBlockNumber)))
if err != nil {
return fmt.Errorf("could not get log header: %w", err)
}

retryLoop:
for {
select {
case <-ctx.Done():
return fmt.Errorf("context canceled: %w", ctx.Err())
case <-time.After(timeout):
if b.Attempt() >= rpcRetry {
return fmt.Errorf("could not get log header: %w", err)
}
logHeader, err = e.chainExecutors[chainID].rpcClient.HeaderByNumber(ctx, big.NewInt(int64(logBlockNumber)))
if err != nil {
timeout = b.Duration()

continue
}
return nil
}
err = retry.WithBackoff(ctx, contractCall, e.retryConfig...)
if err != nil {
return fmt.Errorf("could not get log header: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the use of retry.WithBackoff is a good improvement. But similar to the previous comment, ensure that the context passed to the retry.WithBackoff function has a timeout set to prevent indefinite retries in case of persistent errors.

- err = retry.WithBackoff(ctx, contractCall, e.retryConfig...)
+ ctxWithTimeout, cancel := context.WithTimeout(ctx, time.Duration(e.MaxRetrySeconds)*time.Second)
+ defer cancel()
+ err = retry.WithBackoff(ctxWithTimeout, contractCall, e.retryConfig...)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 4

Commits Files that changed from the base of the PR and between 512d9fa and 8b70e8f commits.
Files ignored due to filter (1)
  • agents/go.mod
Files selected for processing (1)
  • agents/agents/executor/executor.go (9 hunks)
Additional comments (Suppressed): 5
agents/agents/executor/executor.go (5)
  • 11-16: The import statement for the backoff package has been removed and no new issues are introduced.

  • 23-29: The retry package is imported to replace the previous backoff mechanism. Ensure that all dependencies are correctly managed and updated.

  • 87-94: A new field retryConfig of type []retry.WithBackoffConfigurator is added to the Executor struct. This change seems fine as long as it's properly initialized before use.

  • 150-163: The MaxRetrySeconds configuration value is checked and set to a default value if not provided. Then, it's used to create a retryConfig. This is a good practice to ensure that the retry mechanism works even when the configuration is not explicitly provided.

  • 249-254: The retryConfig is passed to the Executor struct during initialization. This is a correct way to pass configurations to structs.

Comment on lines 351 to 366
return false, nil
}

proof, err := e.chainExecutors[message.OriginDomain()].merkleTree.MerkleProof(*nonce-1, (*state).Nonce())
var proof [][]byte
contractCall := func(ctx context.Context) error {
proof, err = e.chainExecutors[message.OriginDomain()].merkleTree.MerkleProof(*nonce-1, (*state).Nonce())
if err != nil {
return fmt.Errorf("could not get merkle proof: %w", err)
}

return nil
}
err = retry.WithBackoff(ctx, contractCall, e.retryConfig...)
if err != nil {
return false, fmt.Errorf("could not get merkle proof: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry.WithBackoff function is used to retry the contractCall function in case of failure. This is a good practice to handle temporary network issues or service unavailability. However, make sure that the context ctx passed to retry.WithBackoff has a timeout or deadline to prevent infinite retries in case of permanent failures.

Comment on lines 444 to 474

// verifyMessageMerkleProof verifies a message against the merkle tree at the state of the given nonce.
func (e Executor) verifyMessageMerkleProof(message types.Message) (bool, error) {
root, err := e.chainExecutors[message.OriginDomain()].merkleTree.Root(message.Nonce())
var root []byte
contractCall := func(ctx context.Context) error {
var err error
root, err = e.chainExecutors[message.OriginDomain()].merkleTree.Root(message.Nonce())
if err != nil {
return fmt.Errorf("could not get root: %w", err)
}

return nil
}
err := retry.WithBackoff(context.Background(), contractCall, e.retryConfig...)
if err != nil {
return false, fmt.Errorf("could not get root: %w", err)
}

proof, err := e.chainExecutors[message.OriginDomain()].merkleTree.MerkleProof(message.Nonce()-1, message.Nonce())
var proof [][]byte
contractCall = func(ctx context.Context) error {
proof, err = e.chainExecutors[message.OriginDomain()].merkleTree.MerkleProof(message.Nonce()-1, message.Nonce())
if err != nil {
return fmt.Errorf("could not get merkle proof: %w", err)
}

return nil
}
err = retry.WithBackoff(context.Background(), contractCall, e.retryConfig...)
if err != nil {
return false, fmt.Errorf("could not get merkle proof: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry.WithBackoff function is used twice to retry fetching the root and merkle proof. This is a good practice to handle temporary network issues or service unavailability. However, make sure that the context passed to retry.WithBackoff has a timeout or deadline to prevent infinite retries in case of permanent failures.

Comment on lines 561 to 585
return nil, nil
}

b := &backoff.Backoff{
Factor: 2,
Jitter: true,
Min: 30 * time.Millisecond,
Max: 3 * time.Second,
}

timeout := time.Duration(0)

var currentTime uint64
chainCall := func(ctx context.Context) error {
var err error
latestHeader, err := e.chainExecutors[chainID].rpcClient.HeaderByNumber(ctx, nil)
if err != nil {
return fmt.Errorf("could not get latest header: %w", err)
}

retryLoop:
for {
select {
case <-ctx.Done():
return nil, fmt.Errorf("context canceled: %w", ctx.Err())
case <-time.After(timeout):
if b.Attempt() >= rpcRetry {
return nil, fmt.Errorf("could not get latest header: %w", err)
}

latestHeader, err := e.chainExecutors[destinationDomain].rpcClient.HeaderByNumber(ctx, nil)
if err != nil {
timeout = b.Duration()

continue
}
if latestHeader == nil {
return fmt.Errorf("latest header is nil")
}

currentTime = latestHeader.Time
currentTime = latestHeader.Time

break retryLoop
}
return nil
}
err = retry.WithBackoff(ctx, chainCall, e.retryConfig...)
if err != nil {
return nil, fmt.Errorf("could not get latest header: %w", err)
}

if *messageMinimumTime > currentTime {
Copy link
Contributor

@coderabbitai coderabbitai bot Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry.WithBackoff function is used to retry fetching the latest header. This is a good practice to handle temporary network issues or service unavailability. However, make sure that the context ctx passed to retry.WithBackoff has a timeout or deadline to prevent infinite retries in case of permanent failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The e.retryConfig... does that. If the user does not pass in a set time, we automatically set it to 30 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CryptoMaxPlanck That's a good approach. Having a default timeout value ensures that the retry mechanism doesn't run indefinitely in case of permanent failures, which could lead to resource exhaustion. It also provides flexibility for users to configure the timeout based on their specific needs. Just make sure to document this behavior clearly so users are aware of the default timeout and how they can override it.

Comment on lines 641 to 666
metrics.EndSpanWithErr(span, err)
}()

b := &backoff.Backoff{
Factor: 2,
Jitter: true,
Min: 30 * time.Millisecond,
Max: 3 * time.Second,
}

timeout := time.Duration(0)

for {
select {
case <-ctx.Done():
return false, fmt.Errorf("context canceled: %w", ctx.Err())
case <-time.After(timeout):
if b.Attempt() >= rpcRetry {
return false, fmt.Errorf("could not get executed status: %w", ctx.Err())
}

executed, err := e.chainExecutors[message.DestinationDomain()].boundDestination.MessageStatus(ctx, message)
if err != nil {
timeout = b.Duration()
span.AddEvent("could not get executed status",
trace.WithAttributes(attribute.String("error", err.Error())),
trace.WithAttributes(attribute.String("timeout", timeout.String())),
)
continue
}
var executed uint8
contractCall := func(ctx context.Context) error {
var err error
executed, err = e.chainExecutors[message.DestinationDomain()].boundDestination.MessageStatus(ctx, message)
if err != nil {
return fmt.Errorf("could not get executed status: %w", err)
}

if execTypes.MessageStatusType(executed) == execTypes.Success {
span.AddEvent("message executed")
return true, nil
}
return nil
}
err = retry.WithBackoff(ctx, contractCall, e.retryConfig...)
if err != nil {
return false, fmt.Errorf("could not get executed status: %w", err)
}

span.AddEvent("message not executed")
return false, nil
}
if execTypes.MessageStatusType(executed) == execTypes.Success {
span.AddEvent("message executed")
return true, nil
}

span.AddEvent("message not executed")
return false, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry.WithBackoff function is used to retry fetching the executed status. This is a good practice to handle temporary network issues or service unavailability. However, make sure that the context ctx passed to retry.WithBackoff has a timeout or deadline to prevent infinite retries in case of permanent failures.

@codecov
Copy link

codecov bot commented Sep 15, 2023

Codecov Report

Attention: 82 lines in your changes are missing coverage. Please review.

Comparison is base (264e1e9) 50.89321% compared to head (ba455e0) 49.98967%.
Report is 49 commits behind head on master.

Additional details and impacted files
@@                 Coverage Diff                 @@
##              master       #1377         +/-   ##
===================================================
- Coverage   50.89321%   49.98967%   -0.90355%     
===================================================
  Files            353         343         -10     
  Lines          24966       24193        -773     
  Branches         277         277                 
===================================================
- Hits           12706       12094        -612     
+ Misses         10989       10912         -77     
+ Partials        1271        1187         -84     
Flag Coverage Δ
agents 45.14404% <56.38298%> (-2.05052%) ⬇️
cctp-relayer ?
promexporter ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
agents/agents/guard/guard.go 73.92638% <100.00000%> (+1.59933%) ⬆️
agents/config/executor/config.go 44.44444% <ø> (ø)
agents/agents/executor/executor_utils.go 48.42767% <40.00000%> (+5.48288%) ⬆️
agents/agents/notary/notary.go 57.61317% <61.25000%> (+0.09526%) ⬆️
agents/agents/executor/executor.go 56.57492% <43.75000%> (-12.76253%) ⬇️

... and 12 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Commits Files that changed from the base of the PR and between 8b70e8f and 9668a8f commits.
Files selected for processing (1)
  • agents/README.md (1 hunks)
Files skipped from review due to trivial changes (1)
  • agents/README.md

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 4

Commits Files that changed from the base of the PR and between 9668a8f and a696124 commits.
Files selected for processing (2)
  • agents/agents/executor/executor.go (10 hunks)
  • agents/agents/executor/executor_utils.go (3 hunks)
Additional comments (Suppressed): 6
agents/agents/executor/executor_utils.go (3)
  • 4-16: The github.com/jpillora/backoff package has been replaced with the github.com/synapsecns/sanguine/core/retry package. Ensure that this new package provides the same or better functionality and is compatible with the rest of the codebase.

  • 153-178: The function processSnapshot now uses a retry mechanism with backoff for the contract call to check the validity of the state. This change enhances the reliability of the code by allowing it to handle temporary issues such as network instability or rate limiting from the Ethereum node.

  • 209-225: The function processAttestation now uses a retry mechanism with backoff for the RPC call to fetch the log header. This change enhances the reliability of the code by allowing it to handle temporary issues such as network instability or rate limiting from the Ethereum node.

agents/agents/executor/executor.go (3)
  • 23-29: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [16-27]

The import of the github.com/synapsecns/sanguine/core/retry package is new and replaces the previous github.com/jpillora/backoff package. Ensure that the new retry package provides all the necessary functionality and is compatible with the rest of the codebase.

  • 91-94: A new field retryConfig of type []retry.WithBackoffConfigurator has been added to the Executor struct. This field is used to store the configuration for retrying RPC calls with backoff.

  • 152-158: The MaxRetrySeconds configuration value is checked and if it's zero, it's set to a default value of 30 seconds. Then, this value is used to create a retryConfig which is an array of retry.WithBackoffConfigurator. The retryConfig is used later in the code to configure the retry mechanism for RPC calls. This is a good practice as it allows for flexible configuration of the retry mechanism.

Comment on lines +354 to 363
contractCall := func(ctx context.Context) error {
proof, err = e.chainExecutors[message.OriginDomain()].merkleTree.MerkleProof(*nonce-1, (*state).Nonce())
if err != nil {
return fmt.Errorf("could not get merkle proof: %w", err)
}

return nil
}
err = retry.WithBackoff(ctx, contractCall, e.retryConfig...)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry.WithBackoff function is used to retry the contract call in case of temporary failures. This is a good practice to handle temporary network issues or service unavailability. However, make sure that the context ctx passed to retry.WithBackoff has a timeout or deadline to prevent infinite retries in case of permanent failures.

Comment on lines +446 to 473
var root []byte
contractCall := func(ctx context.Context) error {
var err error
root, err = e.chainExecutors[message.OriginDomain()].merkleTree.Root(message.Nonce())
if err != nil {
return fmt.Errorf("could not get root: %w", err)
}

return nil
}
err := retry.WithBackoff(context.Background(), contractCall, e.retryConfig...)
if err != nil {
return false, fmt.Errorf("could not get root: %w", err)
}

proof, err := e.chainExecutors[message.OriginDomain()].merkleTree.MerkleProof(message.Nonce()-1, message.Nonce())
var proof [][]byte
contractCall = func(ctx context.Context) error {
proof, err = e.chainExecutors[message.OriginDomain()].merkleTree.MerkleProof(message.Nonce()-1, message.Nonce())
if err != nil {
return fmt.Errorf("could not get merkle proof: %w", err)
}

return nil
}
err = retry.WithBackoff(context.Background(), contractCall, e.retryConfig...)
if err != nil {
return false, fmt.Errorf("could not get merkle proof: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry.WithBackoff function is used twice here to retry two different contract calls. This is a good practice to handle temporary network issues or service unavailability. However, make sure that the context context.Background() passed to retry.WithBackoff has a timeout or deadline to prevent infinite retries in case of permanent failures.

Comment on lines +564 to 582
chainCall := func(ctx context.Context) error {
var err error
latestHeader, err := e.chainExecutors[chainID].rpcClient.HeaderByNumber(ctx, nil)
if err != nil {
return fmt.Errorf("could not get latest header: %w", err)
}

retryLoop:
for {
select {
case <-ctx.Done():
return nil, fmt.Errorf("context canceled: %w", ctx.Err())
case <-time.After(timeout):
if b.Attempt() >= rpcRetry {
return nil, fmt.Errorf("could not get latest header: %w", err)
}

latestHeader, err := e.chainExecutors[destinationDomain].rpcClient.HeaderByNumber(ctx, nil)
if err != nil {
timeout = b.Duration()

continue
}
if latestHeader == nil {
return fmt.Errorf("latest header is nil")
}

currentTime = latestHeader.Time
currentTime = latestHeader.Time

break retryLoop
}
return nil
}
err = retry.WithBackoff(ctx, chainCall, e.retryConfig...)
if err != nil {
return nil, fmt.Errorf("could not get latest header: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry.WithBackoff function is used to retry fetching the latest header. This is a good practice to handle temporary network issues or service unavailability. However, make sure that the context ctx passed to retry.WithBackoff has a timeout or deadline to prevent infinite retries in case of permanent failures.

Comment on lines +644 to +656
contractCall := func(ctx context.Context) error {
var err error
executed, err = e.chainExecutors[message.DestinationDomain()].boundDestination.MessageStatus(ctx, message)
if err != nil {
return fmt.Errorf("could not get executed status: %w", err)
}

if execTypes.MessageStatusType(executed) == execTypes.Success {
span.AddEvent("message executed")
return true, nil
}
return nil
}
err = retry.WithBackoff(ctx, contractCall, e.retryConfig...)
if err != nil {
return false, fmt.Errorf("could not get executed status: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry.WithBackoff function is used to retry getting the executed status. This is a good practice to handle temporary network issues or service unavailability. However, make sure that the context ctx passed to retry.WithBackoff has a timeout or deadline to prevent infinite retries in case of permanent failures.

@github-actions github-actions bot added size/m and removed size/s labels Sep 19, 2023
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 3

Commits Files that changed from the base of the PR and between a696124 and d56b728.
Files selected for processing (2)
  • agents/agents/guard/guard.go (3 hunks)
  • agents/agents/notary/notary.go (11 hunks)
Additional comments (Suppressed): 9
agents/agents/guard/guard.go (1)
  • 287-298: The new hunk introduces a retry mechanism for the contract call in the loadSummitLatestStates function. This is a good improvement as it makes the function more robust against transient errors. However, the error handling has been changed. In the old hunk, if an error occurred during the contract call, the error was logged and the function continued with the next domain. In the new hunk, if an error occurs, the function immediately returns the error and stops processing the remaining domains. This change could potentially cause the function to fail prematurely if there is a problem with one of the domains. Please verify if this change in behavior is intended.
agents/agents/notary/notary.go (8)
  • 7-28: > Note: This review was outside of the patch, but no patch was found that overlapped with it. Original lines [28-27]

The import of the github.com/synapsecns/sanguine/core/retry package is new. This package provides functionality for retrying operations with backoff.

  • 45-45: A new field retryConfig of type []retry.WithBackoffConfigurator has been added to the Notary struct. This field will be used to configure the retry mechanism.

  • 94-100: The MaxRetrySeconds configuration value is being checked. If it's zero, a default value of 30 seconds is set. Then, this value is used to initialize the retryConfig field in the Notary struct.

+	if cfg.MaxRetrySeconds == 0 {
+		cfg.MaxRetrySeconds = 30
+	}
+
+	notary.retryConfig = []retry.WithBackoffConfigurator{
+		retry.WithMaxAttemptTime(time.Second * time.Duration(cfg.MaxRetrySeconds)),
+	}
  • 178-196: The GetLatestNotaryAttestation contract call has been wrapped in a retry mechanism. If the call fails, it will be retried according to the retryConfig.
+	var latestNotaryAttestation types.NotaryAttestation
+	contractCall := func(ctx context.Context) (err error) {
+		latestNotaryAttestation, err = n.summitDomain.Summit().GetLatestNotaryAttestation(ctx, n.bondedSigner)
+		if err != nil {
+			return fmt.Errorf("could not get latest notary attestation: %w", err)
+		}
+
+		return nil
+	}
+	err := retry.WithBackoff(ctx, contractCall, n.retryConfig...)
  • 208-243: The GetAgentRoot contract calls for both the bonding manager and destination light manager have been wrapped in a retry mechanism. If any of these calls fail, they will be retried according to the retryConfig.
+	var bondingManagerAgentRoot [32]byte
+	contractCall := func(ctx context.Context) (err error) {
+		bondingManagerAgentRoot, err = n.summitDomain.BondingManager().GetAgentRoot(ctx)
+		if err != nil {
+			return fmt.Errorf("could not get agent root: %w", err)
+		}
+
+		return nil
+	}
+	err := retry.WithBackoff(ctx, contractCall, n.retryConfig...)
+
+	var destinationLightManagerAgentRoot [32]byte
+	contractCall = func(ctx context.Context) (err error) {
+		destinationLightManagerAgentRoot, err = n.destinationDomain.LightManager().GetAgentRoot(ctx)
+		if err != nil {
+			return fmt.Errorf("could not get agent root: %w", err)
+		}
+
+		return nil
+	}
+	err = retry.WithBackoff(ctx, contractCall, n.retryConfig...)
  • 249-265: The GetAgentStatus contract call has been wrapped in a retry mechanism. If the call fails, it will be retried according to the retryConfig.
+	var agentStatus types.AgentStatus
+	contractCall = func(ctx context.Context) (err error) {
+		agentStatus, err = n.destinationDomain.LightManager().GetAgentStatus(ctx, n.bondedSigner.Address())
+		if err != nil {
+			return fmt.Errorf("could not get agent status: %w", err)
+		}
+
+		return nil
+	}
+	err = retry.WithBackoff(ctx, contractCall, n.retryConfig...)
  • 292-310: The GetAttestationNonce contract call has been wrapped in a retry mechanism. If the call fails, it will be retried according to the retryConfig.
+	var attNonce uint32
+	contractCall := func(ctx context.Context) (err error) {
+		attNonce, err = n.destinationDomain.Destination().GetAttestationNonce(ctx, n.myLatestNotaryAttestation.Attestation().SnapshotRoot())
+		if err != nil {
+			return fmt.Errorf("could not get attestation nonce: %w", err)
+		}
+
+		return nil
+	}
+	err := retry.WithBackoff(ctx, contractCall, n.retryConfig...)
  • 501-535: The GetProof and GetAgentStatus contract calls on the bonding manager have been wrapped in a retry mechanism. If any of these calls fail, they will be retried according to the retryConfig.
+	var agentProof [][32]byte
+	contractCall := func(ctx context.Context) (err error) {
+		agentProof, err = n.summitDomain.BondingManager().Get


</blockquote></details></blockquote></details>



</details>

Comment on lines 182 to 188
if cfg.MaxRetrySeconds == 0 {
cfg.MaxRetrySeconds = 30
}

guard.retryConfig = []retry.WithBackoffConfigurator{
retry.WithMaxAttemptTime(time.Second * time.Duration(cfg.MaxRetrySeconds)),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new hunk introduces a default value for MaxRetrySeconds if it is not set in the configuration. This is a good practice to ensure that the retry mechanism works even when the configuration does not explicitly specify a value for MaxRetrySeconds. However, it would be better to move this logic to where the configuration is loaded or validated, rather than having it in the NewGuard function.

-	if cfg.MaxRetrySeconds == 0 {
-		cfg.MaxRetrySeconds = 30
-	}

Comment on lines 314 to 339
// var latestState types.State
// var err error
// contractCall := func(ctx context.Context) error {
// latestState, err = domain.Origin().SuggestLatestState(ctx)
// if err != nil {
// latestState = nil
// return fmt.Errorf("failed calling SuggestLatestState for originID %d on the Origin contract: %w", originID, err)
// }
//
// return nil
// }
// err = retry.WithBackoff(ctx, contractCall, g.retryConfig...)
// if latestState == nil || latestState.Nonce() == uint32(0) {
// logger.Errorf("No latest state found for origin id %d", originID)
// span.AddEvent("No latest state found for origin id", trace.WithAttributes(
// attribute.Int("originID", int(originID)),
// ))
// }
//
// if err == nil {
// g.originLatestStates[originID] = latestState
// }

latestState, err := domain.Origin().SuggestLatestState(ctx)
if err != nil {
latestState = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commented-out code in this hunk suggests that a similar retry mechanism was planned for the loadOriginLatestStates function but was not implemented. If the retry mechanism is beneficial for the loadSummitLatestStates function, it might also be beneficial here. Consider implementing it.

Comment on lines 139 to 158

originID := domain.Config().DomainID

// var guardLatestState types.State
// contractCall := func(ctx context.Context) (err error) {
// guardLatestState, err = n.summitDomain.Summit().GetLatestState(ctx, originID)
// if err != nil {
// return fmt.Errorf("could not get latest state: %w", err)
// }
//
// return nil
// }
// err := retry.WithBackoff(ctx, contractCall, n.retryConfig...)
// if err == nil && guardLatestState.Nonce() > uint32(0) {
// n.summitGuardLatestStates[originID] = guardLatestState
// }

guardLatestState, err := n.summitDomain.Summit().GetLatestState(ctx, originID)
if err != nil {
guardLatestState = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for getting the latest state from the Summit has been commented out and replaced with the old code that doesn't use the retry mechanism. This seems like an oversight or unfinished work.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 6

Commits Files that changed from the base of the PR and between d56b728 and 33bb895.
Files selected for processing (3)
  • agents/agents/guard/guard.go (3 hunks)
  • agents/agents/notary/notary.go (11 hunks)
  • session.vim (1 hunks)
Files skipped from review due to trivial changes (1)
  • session.vim
Additional comments (Suppressed): 6
agents/agents/guard/guard.go (1)
  • 314-317: The comment suggests that the SuggestLatestState call should be wrapped with a retry mechanism if the Start behavior changes. This indicates that there are plans to modify the Start function, which could potentially impact the behavior of this code. It would be beneficial to clarify what changes are expected in the Start function and how they might necessitate a retry mechanism for the SuggestLatestState call.
agents/agents/notary/notary.go (5)
  • 7-27: The new hunk introduces the github.com/synapsecns/sanguine/core/retry package for implementing a retry mechanism with backoff functionality. Ensure that this package is correctly installed and imported.

  • 42-47: A new field retryConfig of type []retry.WithBackoffConfigurator is added to the Notary struct. This field will be used to configure the retry mechanism.

  • 91-103: The NewNotary function now initializes the retryConfig field based on the MaxRetrySeconds configuration value. If MaxRetrySeconds is zero, it's set to a default value of 30 seconds. This change introduces a retry mechanism with backoff functionality to the notary agent.

  • 132-146: No significant changes are made in this hunk. The comment on line 143 suggests that a retry loop might be added in the future if the current Start behavior deviates.

  • 588-593: No significant changes are made in this hunk. The Start method remains unchanged.

Comment on lines +182 to 188
if cfg.MaxRetrySeconds == 0 {
cfg.MaxRetrySeconds = 60
}

guard.retryConfig = []retry.WithBackoffConfigurator{
retry.WithMaxAttemptTime(time.Second * time.Duration(cfg.MaxRetrySeconds)),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new hunk introduces a default value for cfg.MaxRetrySeconds if it's zero. This is a good practice to ensure that the retry mechanism works even if the configuration value is not explicitly set. However, it would be better to move this logic to a configuration validation or initialization function to keep the NewGuard function focused on initializing the Guard struct.

-	if cfg.MaxRetrySeconds == 0 {
-		cfg.MaxRetrySeconds = 60
-	}

Comment on lines +287 to 298
contractCall := func(ctx context.Context) error {
latestState, err = g.domains[g.summitDomainID].Summit().GetLatestAgentState(ctx, originID, g.bondedSigner)
if err != nil {
return fmt.Errorf("failed calling GetLatestAgentState for originID %d on the Summit contract: err = %w", originID, err)
}

return nil
}
if latestState != nil && latestState.Nonce() > uint32(0) {
err = retry.WithBackoff(ctx, contractCall, g.retryConfig...)
if err == nil && latestState.Nonce() > uint32(0) {
g.summitLatestStates[originID] = latestState
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new hunk introduces a retry mechanism with backoff functionality for the GetLatestAgentState contract call. This is a good improvement as it makes the code more resilient to transient errors. However, the error handling could be improved. If the retry mechanism exhausts all attempts and still fails, the error is logged but the function continues execution. It might be better to return the error to the caller so that it can decide how to handle the failure.

-	err = retry.WithBackoff(ctx, contractCall, g.retryConfig...)
-	if err == nil && latestState.Nonce() > uint32(0) {
-		g.summitLatestStates[originID] = latestState
-	}
+	if err := retry.WithBackoff(ctx, contractCall, g.retryConfig...); err != nil {
+		return fmt.Errorf("failed to get latest agent state after retries: %w", err)
+	}
+	if latestState.Nonce() > uint32(0) {
+		g.summitLatestStates[originID] = latestState
+	}

Comment on lines 163 to 184
))
defer span.End()

latestNotaryAttestation, err := n.summitDomain.Summit().GetLatestNotaryAttestation(ctx, n.bondedSigner)
var latestNotaryAttestation types.NotaryAttestation
contractCall := func(ctx context.Context) (err error) {
latestNotaryAttestation, err = n.summitDomain.Summit().GetLatestNotaryAttestation(ctx, n.bondedSigner)
if err != nil {
return fmt.Errorf("could not get latest notary attestation: %w", err)
}

return nil
}
err := retry.WithBackoff(ctx, contractCall, n.retryConfig...)
if err != nil {
span.AddEvent("GetLatestNotaryAttestation failed", trace.WithAttributes(
attribute.String("err", err.Error()),
))
}

if latestNotaryAttestation != nil {
if n.myLatestNotaryAttestation == nil ||
latestNotaryAttestation.Attestation().SnapshotRoot() != n.myLatestNotaryAttestation.Attestation().SnapshotRoot() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loadNotaryLatestAttestation method now uses the retry mechanism when calling the GetLatestNotaryAttestation contract method. If the call fails, it will be retried according to the retryConfig settings.

Comment on lines 193 to 231
attribute.Int(metrics.ChainID, int(n.destinationDomain.Config().DomainID)),
))
defer span.End()
var bondingManagerAgentRoot [32]byte
contractCall := func(ctx context.Context) (err error) {
bondingManagerAgentRoot, err = n.summitDomain.BondingManager().GetAgentRoot(ctx)
if err != nil {
return fmt.Errorf("could not get agent root: %w", err)
}

bondingManagerAgentRoot, err := n.summitDomain.BondingManager().GetAgentRoot(ctx)
return nil
}
err := retry.WithBackoff(ctx, contractCall, n.retryConfig...)
if err != nil {
span.AddEvent("GetAgentRoot failed on bonding manager", trace.WithAttributes(
attribute.String("err", err.Error()),
))

return false, false
}

destinationLightManagerAgentRoot, err := n.destinationDomain.LightManager().GetAgentRoot(ctx)
var destinationLightManagerAgentRoot [32]byte
contractCall = func(ctx context.Context) (err error) {
destinationLightManagerAgentRoot, err = n.destinationDomain.LightManager().GetAgentRoot(ctx)
if err != nil {
return fmt.Errorf("could not get agent root: %w", err)
}

return nil
}
err = retry.WithBackoff(ctx, contractCall, n.retryConfig...)
if err != nil {
span.AddEvent("GetAgentRoot failed on destination light manager", trace.WithAttributes(
attribute.String("err", err.Error()),
))

return false, false
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shouldNotaryRegisteredOnDestination method now uses the retry mechanism when calling the GetAgentRoot and GetAgentStatus contract methods. If any of these calls fail, they will be retried according to the retryConfig settings.

Comment on lines 234 to 257
return false, false
}

agentStatus, err := n.destinationDomain.LightManager().GetAgentStatus(ctx, n.bondedSigner.Address())
var agentStatus types.AgentStatus
contractCall = func(ctx context.Context) (err error) {
agentStatus, err = n.destinationDomain.LightManager().GetAgentStatus(ctx, n.bondedSigner.Address())
if err != nil {
return fmt.Errorf("could not get agent status: %w", err)
}

return nil
}
err = retry.WithBackoff(ctx, contractCall, n.retryConfig...)
if err != nil {
span.AddEvent("GetAgentStatus failed", trace.WithAttributes(
attribute.String("err", err.Error()),
))

return false, false
}

if agentStatus.Flag() == types.AgentFlagUnknown {
// Here we want to add the Notary and proceed with sending to destination
return true, true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checkDidSubmitNotaryLatestAttestation method now uses the retry mechanism when calling the GetAttestationNonce contract method. If the call fails, it will be retried according to the retryConfig settings.

Comment on lines 7 to 28
"time"

"github.com/ethereum/go-ethereum/accounts/abi/bind"
"github.com/ethereum/go-ethereum/common"
ethTypes "github.com/ethereum/go-ethereum/core/types"
"github.com/synapsecns/sanguine/agents/agents/notary/db"
"github.com/synapsecns/sanguine/core/metrics"
signerConfig "github.com/synapsecns/sanguine/ethergo/signer/config"
"github.com/synapsecns/sanguine/ethergo/submitter"
omnirpcClient "github.com/synapsecns/sanguine/services/omnirpc/client"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"

"github.com/ethereum/go-ethereum/common"
"github.com/synapsecns/sanguine/agents/config"
"github.com/synapsecns/sanguine/agents/contracts/summit"
"github.com/synapsecns/sanguine/agents/domains"
"github.com/synapsecns/sanguine/agents/domains/evm"
"github.com/synapsecns/sanguine/agents/types"
"github.com/synapsecns/sanguine/core/metrics"
"github.com/synapsecns/sanguine/core/retry"
signerConfig "github.com/synapsecns/sanguine/ethergo/signer/config"
"github.com/synapsecns/sanguine/ethergo/signer/signer"
"github.com/synapsecns/sanguine/ethergo/submitter"
omnirpcClient "github.com/synapsecns/sanguine/services/omnirpc/client"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
"golang.org/x/sync/errgroup"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This review was outside of the patch, but no patch was found that overlapped with it. Original lines [313-298]

The registerNotaryOnDestination method now uses the retry mechanism when calling the GetProof, GetAgentStatus, and SubmitTransaction contract methods. If any of these calls fail, they will be retried according to the retryConfig settings.

@github-actions
Copy link

github-actions bot commented Oct 4, 2023

This PR is stale because it has been open 14 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Oct 4, 2023
@github-actions github-actions bot closed this Oct 9, 2023
@dwasse dwasse reopened this Oct 10, 2023
@github-actions github-actions bot removed the Stale label Oct 10, 2023
@dwasse dwasse merged commit 2ee5234 into master Oct 11, 2023
38 checks passed
@dwasse dwasse deleted the feat/retries_for_all_agents branch October 11, 2023 03:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go Pull requests that update Go code size/m
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants