Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Debug CLI / test-solana job #573

Closed
wants to merge 11 commits into from
Closed

Conversation

nvsriram
Copy link
Collaborator

@nvsriram nvsriram commented Dec 5, 2024

CLI / test-solana job has been failing sporadically (race condition probably) in recent PRs so this PR attempts to fix it

Context:
In the Solana CLI test, after upgrading to 2.0.0 from 1.0.0, the upgrade command calls nttFromManager which in turn calls getProtocol to fetch the program and version (via getVersion). The default getVersion output on error/ fetching a program not deployed is 3.0.0 (as that is the current version on main). This causes a diff between the actual deployed version and fetched version that ultimately results in the failure.

Fix:

  1. ❌ Revert getVersion default to be 2.0.0. This is not ideal as it does not reflect the current version on main.
  2. ✅ Add a sleep/spinloop after we deploy the program and before we fetch the program to ensure we fetch the version directly from the deployed program rather than depending on the default value returned.

@nvsriram nvsriram marked this pull request as ready for review December 6, 2024 20:02
@nvsriram nvsriram requested a review from kcsongor December 6, 2024 20:02
@@ -472,6 +472,9 @@ yargs(hideBin(process.argv))
argv["binary"]
);

// we sleep here to ensure the program has enough time to be deployed
// and we can fetch the version from the program itself
await new Promise((resolve) => setTimeout(resolve, 2000));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a less arbitrary way to detect this? Can it be known that the program will always take less than 2 seconds to be deployed?

Copy link
Collaborator Author

@nvsriram nvsriram Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guarantee that the program will always take less than 2 seconds to be deployed or in this case for the RPC to catch up to the deployed program (I could probably update the comment).
This sleep happens after deploy proc has exited. So this sleep just acts as a buffer for the RPC to query the correct deployed program - similar to the sleep before initialize.

The less arbitrary option I considered was a spinloop inside nttFromManager if an optional expectedVersion parameter is passed in:

async function nttFromManager<N extends Network, C extends Chain>(
    ch: ChainContext<N, C>,
    nativeManagerAddress: string,
    expectedVersion?: string
): Promise<{ ntt: Ntt<N, C>; addresses: Partial<Ntt.Contracts> }> {
    let onlyManager: Ntt<N, C>;
    onlyManager = await ch.getProtocol("Ntt", {
        ...
    });
    while (expectedVersion && getVersion(ch.chain, onlyManager) !== expectedVersion) {
       await new Promise((resolve) => setTimeout(resolve, 2000));
       onlyManager = await ch.getProtocol("Ntt", {
         ...
       });
    }
    ...
}

And it could be called in upgrade:

const { ntt: upgraded } = await nttFromManager(ch, chainConfig.manager, toVersion ?? undefined);

Concerns with this approach:

  1. This CI job fail is only due to the edge case where the CLI release has not caught up to the latest NTT manager version so the complexity added with a spinloop and debugging might be overkill?
  2. Calling getProtocol in a loop might not be the most optimal way to fetch versions
  3. Not sure if the strict equality is the best way to compare versions (in case of semver etc.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my proposal in #574 fixes the underlying problem without the arbitrary wait.

@evan-gray
Copy link
Contributor

closing in favor of #574

@evan-gray evan-gray closed this Dec 12, 2024
@nvsriram nvsriram deleted the ci/fix-cli-test-solana branch December 12, 2024 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants