-
Notifications
You must be signed in to change notification settings - Fork 722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: stalled connection in CI #1758
Conversation
# Conflicts: # package-lock.json # packages/core/package.json # packages/sign-client/package.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking great! Grasped what you did here and it was definitely a needed addition. I left few comments on cleanup ideas.
await publish; | ||
} catch (err) { | ||
this.logger.debug(`Publishing Payload stalled`); | ||
this.relayer.events.emit(RELAYER_EVENTS.connection_stalled); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need to inspect the error here to tell for sure it's the connection_stalled
one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This event is emitted when the publish
request fails to resolve within the expiry (currently 10s). In normal circumstances, the expiry should be more than enough for a request to complete. If it doesn't, it is safe to assume the connection stalled. Also, the error here comes from the expiry rejecting the promise, not an actual network error
@@ -12,7 +12,7 @@ import { describe, it, expect, afterEach } from "vitest"; | |||
|
|||
const environment = process.env.ENVIRONMENT || "dev"; | |||
|
|||
const timeout = environment === "prod" ? 610_000 : 70_000; | |||
const timeout = environment === "prod" ? 610_000 : 30_000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to link this against the server config https://github.com/WalletConnect/rs-relay/blob/master/terraform/vars/staging.tfvars#L9-L11
…sible race conditions
Description
Running
sign-client
tests in GHA has been problematic for a while due to socket connections issues such assocket hang up
<- inability to capture the error causing the CI run to failstalled connections
<- A connection that is seeminglyactive
in the SDK but on the relay, it has been closed with104 connection reset by peer
Stalled connections/104 errors are happening on the client machine and are raised by the OS when it doesn't recognize an incoming packet. This can happen by any firewall/router etc and since we have no control over GHA runners a solutions is needed to handle such scenarios
Worth noting that the issues are so far only reproducible during CI runs and not during normal SDK usage or local runs
Steps to increase tests stability
1. Reducing overhead
Added
/integration
folder where full e2e test coverage is available without unnecessary duplicate tests and/or redundant pairings. We have 5+ping
tests that are testing the same thing. Moved the validation tests out of the integration pipeline as they don't test the relay, just create overhead connections2. Better catching connection errors
Updated the
@walletconnect/jsonrpc-ws-connection
package to handle connection errors, also implemented reconnection onsocket hang up
so we can continue processing instead of failing the run3. Proper disposal of connections after a test is done
Proper closing of connections and removing event listeners/heartbeats after a test is done
4. Handling stalled connections
Added expiry timeout on subscrptions/payloads requests where if the request expires a reconnection is attempted. From the test runs, I'm seeing that a reconnection might attempted several times before it is restored and active again.
The mechanism for stalled requests works by utilizing the
queue
where the request is removed from the queue only after successful deliveryHow Has This Been Tested?
npm run test
tons of CI runs
canary has been running this code for ~2 weeks
Due Dilligence