CI/E2E test improvements #2168

mvds00 · 2024-07-25T01:38:38Z

Bug

CI is broken because of issues in E2E tests. Issues are:

epoch calculation is off in various ways
time.sleep() is used in an asyncio context
wait_for_epoch() was an unused duplicate of wait_for_interval()

This is more of a style / bug waiting to happen thing:

various constants are hardcoded repeatedly ("magic constants")

Description of the Change

epoch calculation is now exact (derived mathematically and verified by comparing epoch block calculations with actual epoch blocks)
wait_for_epoch() now waits for the epoch without the need to specify the tempo
tempo is taken from chain instead of hardcoded value
netuid==1 is now a variable and not a magic constant at several places

The epoch calculation is detailed in the related commit message, with reference to subtensor code.

Alternate Designs

No alternate designs were considered. Looking forward however, the following is considered:

using a different netuid should be possible, for example when creating multiple netuids or when testing whether the tempo logic (which takes netuid as a parameter) is correct.
using a different tempo should be possible, one the one hand to speed up the test, on the other hand to validate all still works with a different tempo than 360. Code is locally available to do this via an extrinsic, but I have to find the root key first. Using a different tempo (for example 12 or 36), by hardcoding it in subtensor, resulted in some tests failing, which might hint at possible issues.

Possible Drawbacks

N/A

Verification Process

The E2E CI code was run locally and showed that the errors that were seen before, are now gone. No new errors appeared.

Release Notes

N/A

gus-opentensor · 2024-07-25T14:00:48Z

@mvds00 thank you for the submission - we're taking a look.

mvds00 · 2024-07-25T14:04:22Z

@mvds00 thank you for the submission - we're taking a look.

Seems like only the swap hotkey fails, but that's simply missing in subtensor and I was requested to leave it as-is.

- Log commands as executed, so that CI errors can be pinpointed to their originating command. - Dropped wait_epoch() as it is not used. - Improved wait_interval(), explanation below: For tempo T, the epoch interval is T+1, and the offset depends on the netuid. See subtensor/src/block_step.rs, blocks_until_next_epoch(), with the comment stating: https://github.com/opentensor/subtensor/blob/1332d077ea73bc7bf40f551c7f1adea3370df2bd/pallets/subtensor/src/block_step.rs#L33 "Networks run their epoch when (block_number + netuid + 1 ) % (tempo + 1) = 0" This comment from the subtensor code is not correct, the algorithm actually tests: (block_number + netuid + 1 ) % (tempo + 1) == tempo This is because what is tested, is whether blocks_until_next_epoch() == 0 and defining: A = (block_number + netuid + 1)%(tempo + 1) we can say that, looking at https://github.com/opentensor/subtensor/blob/1332d077ea73bc7bf40f551c7f1adea3370df2bd/pallets/subtensor/src/block_step.rs#L47: blocks_until_next_epoch() = tempo - A And so it is easy to see that we need A == tempo to run the epoch. Then, to find the last epoch, calculating mod M = tempo+1: (block_number + netuid + 1)%M = tempo%M (block_number + netuid + 1)%M = (-1)%M (block_number + netuid + 1)%M + 1 = 0 So the last epoch is at: last_epoch = block_number - 1 - (block_number + netuid + 1) % interval It is easily seen that this is in the range: [block_number-interval, block_number-1] And so the current block, if it were epoch, is not seen as the last epoch. The next epoch is then: last_epoch + interval Which is in the range: [block_number, block_number+interval-1] And so if the current block is epoch, wait_for_interval() will return immediately. It is suspected that CI tests fail because the wait_epoch() waits for the wrong block. And if this is not an issue at the present time, it may become an issue later. Note that difficulty adjustments follow a different schedule. In the reworked function, blocks passing while waiting are reported after passing 10 blocks, not when exactly hitting block N*10. This ensures that logging is seen, even if the N*10 block passes during time.sleep(1). TODO: check if time.sleep() should be replaced by asyncio.sleep(), as time.sleep() halts the entire process.

…tuid

- Replace wait_for_interval(360) calls with wait_epoch(), reducing the number of magic constants and preparing for optionally changing the tempo used in tests. - Automatically determine tempo from on-chain data. - Make wait functions async, eliminating time.sleep() which basically halts everything, which is not beneficial to other coroutines.

roman-opentensor · 2024-07-27T07:20:53Z

@mvds00 thank you for the submission - we're taking a look.

Seems like only the swap hotkey fails, but that's simply missing in subtensor and I was requested to leave it as-is.

Yes, the "swap hotkey" test will still fail due to the subtensor. We are waiting for these changes.
Let's wait for the remaining tests to complete and I would like to approve this PR. Also let's merge this PR to #2155.

thewhaleking · 2024-07-27T18:23:43Z

Also let's merge this PR to #2155.

Why would we do that? This is unrelated. I specifically asked him to create a separate PR because it is unrelated.

thewhaleking · 2024-07-27T18:32:04Z

tests/e2e_tests/utils.py

    current_block = subtensor.get_current_block()
-    next_tempo_block_start = (current_block - (current_block % interval)) + interval
+    last_epoch = current_block - 1 - (current_block + netuid + 1) % interval


Hey man. What's the reason for the netuid here? @ibraheem-opentensor and I were wondering.

It's the algorithm as used in subtensor!

I'm not making this up ;-)
There is extensive explanation in one of the commit messages, quoting a part from there:

For tempo T, the epoch interval is T+1, and the offset depends on the
netuid. See subtensor/src/block_step.rs, blocks_until_next_epoch(), with
the comment stating:

https://github.com/opentensor/subtensor/blob/1332d077ea73bc7bf40f551c7f1adea3370df2bd/pallets/subtensor/src/block_step.rs#L33
"Networks run their epoch when (block_number + netuid + 1 ) % (tempo + 1) = 0"

This comment from the subtensor code is not correct

It only accidentally worked before, and only for the first few epochs before the differences start to matter

Cool. Assumed you had a good reason for it. Just didn't seem immediately apparent. Thanks.

roman-opentensor · 2024-07-27T19:33:05Z

Also let's merge this PR to #2155.

Why would we do that? This is unrelated. I specifically asked him to create a separate PR because it is unrelated.

Yeah, not to 2155, but before. I meant this PR has to be merged first.
@mvds00

gus-opentensor requested a review from a team July 25, 2024 14:02

μ added 6 commits July 27, 2024 07:03

e2e_tests/multistep/test_axon.py: replace magic constant 1 by netuid

f38473b

e2e_tests/multistep/test_dendrite.py: replace magic constant 1 by netuid

ad46a5e

e2e_tests/multistep/test_emissions.py: replace magic constant 1 by ne…

e352389

…tuid

e2e_tests/multistep/test_incentive.py: replace magic constant 1 by ne…

4db0205

…tuid

heeres force-pushed the feature/mvds00/fix-for-broken-ci-relating-to-epoch-calculation-and-asyncio branch from de9ec22 to b40165e Compare July 27, 2024 07:03

thewhaleking approved these changes Jul 27, 2024

View reviewed changes

thewhaleking requested a review from a team July 27, 2024 18:27

thewhaleking reviewed Jul 27, 2024

View reviewed changes

thewhaleking merged commit 7895c02 into opentensor:staging Jul 28, 2024
31 of 52 checks passed

ibraheem-opentensor mentioned this pull request Aug 23, 2024

Release 7.4.0 #2255

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI/E2E test improvements #2168

CI/E2E test improvements #2168

mvds00 commented Jul 25, 2024

gus-opentensor commented Jul 25, 2024

mvds00 commented Jul 25, 2024

roman-opentensor commented Jul 27, 2024

thewhaleking commented Jul 27, 2024

thewhaleking Jul 27, 2024

mvds00 Jul 27, 2024

mvds00 Jul 27, 2024 •

edited

Loading

mvds00 Jul 27, 2024

thewhaleking Jul 28, 2024

roman-opentensor commented Jul 27, 2024

CI/E2E test improvements #2168

CI/E2E test improvements #2168

Conversation

mvds00 commented Jul 25, 2024

Bug

Description of the Change

Alternate Designs

Possible Drawbacks

Verification Process

Release Notes

gus-opentensor commented Jul 25, 2024

mvds00 commented Jul 25, 2024

roman-opentensor commented Jul 27, 2024

thewhaleking commented Jul 27, 2024

thewhaleking Jul 27, 2024

Choose a reason for hiding this comment

mvds00 Jul 27, 2024

Choose a reason for hiding this comment

mvds00 Jul 27, 2024 • edited Loading

Choose a reason for hiding this comment

mvds00 Jul 27, 2024

Choose a reason for hiding this comment

thewhaleking Jul 28, 2024

Choose a reason for hiding this comment

roman-opentensor commented Jul 27, 2024

mvds00 Jul 27, 2024 •

edited

Loading