-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
archive node retries #7707
archive node retries #7707
Conversation
Can you confirm that |
match res with | ||
| Error _ -> | ||
Conn.rollback () >>| ignore | ||
| Error _ as e -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be logs for the commit and rollback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, added it
There was a type error, fixed it |
let rec go retry_count = | ||
match%bind f () with | ||
| Error e -> | ||
if retry_count <= 0 then return (Error e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe accumulate errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to logging it as warnings (the else case)? The error in this line is logged at the call site
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably good enough
else ( | ||
[%log warn] "Error in %s : $error. Retrying..." error_str | ||
~metadata:[("error", `String (Caqti_error.show e))] ; | ||
go (retry_count - 1) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to immediately retry without any backoff timeout? Maybe it's worth waiting a few milliseconds? Though I'm not sure what else that will affect. Up to you...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't see a problem adding a backoff timeout of a few milliseconds. It should help if there are multiple commit attempts
Agreed and something like below added to archive node's db-bootstrap jobs should work for this:
|
Alternatively, we can update the postgres settings set in terraform with extended configs:
|
!approved-for-mainnet |
Added retry logic for database writes to support concurrent database transactions (described here). This also requires setting transaction isolation level to 'serializable' and can be done in two ways:
default_transaction_isolation = 'serializable'
ALTER DATABASE <DATABASE NAME> SET DEFAULT_TRANSACTION_ISOLATION TO SERIALIZABLE ;
They need to be done as a separate step before archive node starts writing (and before bringing up the database instance in the case of option 1). I think setting it at the db level makes it explicit and therefore is a better option, but i'm not quite sure where in the codebase this configuration needs to live