-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI/E2E test improvements #2168
CI/E2E test improvements #2168
Conversation
@mvds00 thank you for the submission - we're taking a look. |
Seems like only the swap hotkey fails, but that's simply missing in subtensor and I was requested to leave it as-is. |
- Log commands as executed, so that CI errors can be pinpointed to their originating command. - Dropped wait_epoch() as it is not used. - Improved wait_interval(), explanation below: For tempo T, the epoch interval is T+1, and the offset depends on the netuid. See subtensor/src/block_step.rs, blocks_until_next_epoch(), with the comment stating: https://github.com/opentensor/subtensor/blob/1332d077ea73bc7bf40f551c7f1adea3370df2bd/pallets/subtensor/src/block_step.rs#L33 "Networks run their epoch when (block_number + netuid + 1 ) % (tempo + 1) = 0" This comment from the subtensor code is not correct, the algorithm actually tests: (block_number + netuid + 1 ) % (tempo + 1) == tempo This is because what is tested, is whether blocks_until_next_epoch() == 0 and defining: A = (block_number + netuid + 1)%(tempo + 1) we can say that, looking at https://github.com/opentensor/subtensor/blob/1332d077ea73bc7bf40f551c7f1adea3370df2bd/pallets/subtensor/src/block_step.rs#L47: blocks_until_next_epoch() = tempo - A And so it is easy to see that we need A == tempo to run the epoch. Then, to find the last epoch, calculating mod M = tempo+1: (block_number + netuid + 1)%M = tempo%M (block_number + netuid + 1)%M = (-1)%M (block_number + netuid + 1)%M + 1 = 0 So the last epoch is at: last_epoch = block_number - 1 - (block_number + netuid + 1) % interval It is easily seen that this is in the range: [block_number-interval, block_number-1] And so the current block, if it were epoch, is not seen as the last epoch. The next epoch is then: last_epoch + interval Which is in the range: [block_number, block_number+interval-1] And so if the current block is epoch, wait_for_interval() will return immediately. It is suspected that CI tests fail because the wait_epoch() waits for the wrong block. And if this is not an issue at the present time, it may become an issue later. Note that difficulty adjustments follow a different schedule. In the reworked function, blocks passing while waiting are reported after passing 10 blocks, not when exactly hitting block N*10. This ensures that logging is seen, even if the N*10 block passes during time.sleep(1). TODO: check if time.sleep() should be replaced by asyncio.sleep(), as time.sleep() halts the entire process.
- Replace wait_for_interval(360) calls with wait_epoch(), reducing the number of magic constants and preparing for optionally changing the tempo used in tests. - Automatically determine tempo from on-chain data. - Make wait functions async, eliminating time.sleep() which basically halts everything, which is not beneficial to other coroutines.
de9ec22
to
b40165e
Compare
Yes, the "swap hotkey" test will still fail due to the subtensor. We are waiting for these changes. |
Why would we do that? This is unrelated. I specifically asked him to create a separate PR because it is unrelated. |
current_block = subtensor.get_current_block() | ||
next_tempo_block_start = (current_block - (current_block % interval)) + interval | ||
last_epoch = current_block - 1 - (current_block + netuid + 1) % interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey man. What's the reason for the netuid here? @ibraheem-opentensor and I were wondering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the algorithm as used in subtensor!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not making this up ;-)
There is extensive explanation in one of the commit messages, quoting a part from there:
For tempo T, the epoch interval is T+1, and the offset depends on the
netuid. See subtensor/src/block_step.rs, blocks_until_next_epoch(), with
the comment stating:
https://github.com/opentensor/subtensor/blob/1332d077ea73bc7bf40f551c7f1adea3370df2bd/pallets/subtensor/src/block_step.rs#L33
"Networks run their epoch when (block_number + netuid + 1 ) % (tempo + 1) = 0"
This comment from the subtensor code is not correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only accidentally worked before, and only for the first few epochs before the differences start to matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Assumed you had a good reason for it. Just didn't seem immediately apparent. Thanks.
Bug
CI is broken because of issues in E2E tests. Issues are:
This is more of a style / bug waiting to happen thing:
Description of the Change
The epoch calculation is detailed in the related commit message, with reference to subtensor code.
Alternate Designs
No alternate designs were considered. Looking forward however, the following is considered:
Possible Drawbacks
N/A
Verification Process
The E2E CI code was run locally and showed that the errors that were seen before, are now gone. No new errors appeared.
Release Notes
N/A