Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: stalled connection in CI #1758

Merged
merged 139 commits into from
Dec 20, 2022
Merged

fix: stalled connection in CI #1758

merged 139 commits into from
Dec 20, 2022

Conversation

ganchoradkov
Copy link
Member

@ganchoradkov ganchoradkov commented Dec 14, 2022

Description

Running sign-client tests in GHA has been problematic for a while due to socket connections issues such as

  • socket hang up <- inability to capture the error causing the CI run to fail
  • stalled connections <- A connection that is seemingly active in the SDK but on the relay, it has been closed with 104 connection reset by peer

Stalled connections/104 errors are happening on the client machine and are raised by the OS when it doesn't recognize an incoming packet. This can happen by any firewall/router etc and since we have no control over GHA runners a solutions is needed to handle such scenarios

Worth noting that the issues are so far only reproducible during CI runs and not during normal SDK usage or local runs

Steps to increase tests stability

1. Reducing overhead

Added /integration folder where full e2e test coverage is available without unnecessary duplicate tests and/or redundant pairings. We have 5+ ping tests that are testing the same thing. Moved the validation tests out of the integration pipeline as they don't test the relay, just create overhead connections

2. Better catching connection errors

Updated the @walletconnect/jsonrpc-ws-connection package to handle connection errors, also implemented reconnection on socket hang up so we can continue processing instead of failing the run

3. Proper disposal of connections after a test is done

Proper closing of connections and removing event listeners/heartbeats after a test is done

4. Handling stalled connections

Added expiry timeout on subscrptions/payloads requests where if the request expires a reconnection is attempted. From the test runs, I'm seeing that a reconnection might attempted several times before it is restored and active again.
The mechanism for stalled requests works by utilizing the queue where the request is removed from the queue only after successful delivery

How Has This Been Tested?

npm run test
tons of CI runs
canary has been running this code for ~2 weeks

Due Dilligence

  • Breaking change
  • Requires a documentation update
  • Requires a e2e/integration test update

@arein arein added the accepted label Dec 14, 2022
@ganchoradkov ganchoradkov requested a review from bkrem December 14, 2022 08:18
@ganchoradkov ganchoradkov self-assigned this Dec 14, 2022
@ganchoradkov ganchoradkov marked this pull request as ready for review December 14, 2022 08:18
@ganchoradkov ganchoradkov requested a review from arein December 14, 2022 08:19
Copy link
Contributor

@xzilja xzilja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking great! Grasped what you did here and it was definitely a needed addition. I left few comments on cleanup ideas.

packages/core/src/controllers/relayer.ts Outdated Show resolved Hide resolved
packages/core/src/controllers/relayer.ts Show resolved Hide resolved
packages/core/src/controllers/subscriber.ts Show resolved Hide resolved
packages/core/src/controllers/subscriber.ts Show resolved Hide resolved
packages/core/src/controllers/subscriber.ts Show resolved Hide resolved
packages/sign-client/test/sdk/client.spec.ts Outdated Show resolved Hide resolved
await publish;
} catch (err) {
this.logger.debug(`Publishing Payload stalled`);
this.relayer.events.emit(RELAYER_EVENTS.connection_stalled);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to inspect the error here to tell for sure it's the connection_stalled one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This event is emitted when the publish request fails to resolve within the expiry (currently 10s). In normal circumstances, the expiry should be more than enough for a request to complete. If it doesn't, it is safe to assume the connection stalled. Also, the error here comes from the expiry rejecting the promise, not an actual network error

@@ -12,7 +12,7 @@ import { describe, it, expect, afterEach } from "vitest";

const environment = process.env.ENVIRONMENT || "dev";

const timeout = environment === "prod" ? 610_000 : 70_000;
const timeout = environment === "prod" ? 610_000 : 30_000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ganchoradkov ganchoradkov requested a review from xzilja December 16, 2022 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants