Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI/E2E test improvements #2168

Conversation

mvds00
Copy link

@mvds00 mvds00 commented Jul 25, 2024

Bug

CI is broken because of issues in E2E tests. Issues are:

  • epoch calculation is off in various ways
  • time.sleep() is used in an asyncio context
  • wait_for_epoch() was an unused duplicate of wait_for_interval()

This is more of a style / bug waiting to happen thing:

  • various constants are hardcoded repeatedly ("magic constants")

Description of the Change

  • epoch calculation is now exact (derived mathematically and verified by comparing epoch block calculations with actual epoch blocks)
  • wait_for_epoch() now waits for the epoch without the need to specify the tempo
  • tempo is taken from chain instead of hardcoded value
  • netuid==1 is now a variable and not a magic constant at several places

The epoch calculation is detailed in the related commit message, with reference to subtensor code.

Alternate Designs

No alternate designs were considered. Looking forward however, the following is considered:

  • using a different netuid should be possible, for example when creating multiple netuids or when testing whether the tempo logic (which takes netuid as a parameter) is correct.
  • using a different tempo should be possible, one the one hand to speed up the test, on the other hand to validate all still works with a different tempo than 360. Code is locally available to do this via an extrinsic, but I have to find the root key first. Using a different tempo (for example 12 or 36), by hardcoding it in subtensor, resulted in some tests failing, which might hint at possible issues.

Possible Drawbacks

N/A

Verification Process

The E2E CI code was run locally and showed that the errors that were seen before, are now gone. No new errors appeared.

Release Notes

N/A

@gus-opentensor
Copy link
Collaborator

@mvds00 thank you for the submission - we're taking a look.

@gus-opentensor gus-opentensor requested a review from a team July 25, 2024 14:02
@mvds00
Copy link
Author

mvds00 commented Jul 25, 2024

@mvds00 thank you for the submission - we're taking a look.

Seems like only the swap hotkey fails, but that's simply missing in subtensor and I was requested to leave it as-is.

μ added 6 commits July 27, 2024 07:03
- Log commands as executed, so that CI errors can be pinpointed to their
  originating command.
- Dropped wait_epoch() as it is not used.
- Improved wait_interval(), explanation below:

For tempo T, the epoch interval is T+1, and the offset depends on the
netuid. See subtensor/src/block_step.rs, blocks_until_next_epoch(), with
the comment stating:

https://github.com/opentensor/subtensor/blob/1332d077ea73bc7bf40f551c7f1adea3370df2bd/pallets/subtensor/src/block_step.rs#L33
"Networks run their epoch when (block_number + netuid + 1 ) % (tempo + 1) = 0"

This comment from the subtensor code is not correct, the algorithm
actually tests:

(block_number + netuid + 1 ) % (tempo + 1) == tempo

This is because what is tested, is whether

    blocks_until_next_epoch() == 0

and defining:

    A = (block_number + netuid + 1)%(tempo + 1)

we can say that, looking at https://github.com/opentensor/subtensor/blob/1332d077ea73bc7bf40f551c7f1adea3370df2bd/pallets/subtensor/src/block_step.rs#L47:

    blocks_until_next_epoch() = tempo - A

And so it is easy to see that we need A == tempo to run the epoch.

Then, to find the last epoch, calculating mod M = tempo+1:

    (block_number + netuid + 1)%M = tempo%M
    (block_number + netuid + 1)%M = (-1)%M
    (block_number + netuid + 1)%M + 1 = 0

So the last epoch is at:

    last_epoch = block_number - 1 - (block_number + netuid + 1) % interval

It is easily seen that this is in the range:

    [block_number-interval, block_number-1]

And so the current block, if it were epoch, is not seen as the last
epoch.

The next epoch is then:

    last_epoch + interval

Which is in the range:

    [block_number, block_number+interval-1]

And so if the current block is epoch, wait_for_interval() will return
immediately.

It is suspected that CI tests fail because the wait_epoch() waits for
the wrong block. And if this is not an issue at the present time, it may
become an issue later.

Note that difficulty adjustments follow a different schedule.

In the reworked function, blocks passing while waiting are reported
after passing 10 blocks, not when exactly hitting block N*10. This
ensures that logging is seen, even if the N*10 block passes during
time.sleep(1).

TODO: check if time.sleep() should be replaced by asyncio.sleep(),
as time.sleep() halts the entire process.
- Replace wait_for_interval(360) calls with wait_epoch(), reducing the
  number of magic constants and preparing for optionally changing the
  tempo used in tests.
- Automatically determine tempo from on-chain data.
- Make wait functions async, eliminating time.sleep() which basically
  halts everything, which is not beneficial to other coroutines.
@heeres heeres force-pushed the feature/mvds00/fix-for-broken-ci-relating-to-epoch-calculation-and-asyncio branch from de9ec22 to b40165e Compare July 27, 2024 07:03
@roman-opentensor
Copy link
Contributor

@mvds00 thank you for the submission - we're taking a look.

Seems like only the swap hotkey fails, but that's simply missing in subtensor and I was requested to leave it as-is.

Yes, the "swap hotkey" test will still fail due to the subtensor. We are waiting for these changes.
Let's wait for the remaining tests to complete and I would like to approve this PR. Also let's merge this PR to #2155.

@thewhaleking
Copy link
Contributor

Also let's merge this PR to #2155.

Why would we do that? This is unrelated. I specifically asked him to create a separate PR because it is unrelated.

@thewhaleking thewhaleking requested a review from a team July 27, 2024 18:27
current_block = subtensor.get_current_block()
next_tempo_block_start = (current_block - (current_block % interval)) + interval
last_epoch = current_block - 1 - (current_block + netuid + 1) % interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey man. What's the reason for the netuid here? @ibraheem-opentensor and I were wondering.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the algorithm as used in subtensor!

Copy link
Author

@mvds00 mvds00 Jul 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not making this up ;-)
There is extensive explanation in one of the commit messages, quoting a part from there:

For tempo T, the epoch interval is T+1, and the offset depends on the
netuid. See subtensor/src/block_step.rs, blocks_until_next_epoch(), with
the comment stating:

https://github.com/opentensor/subtensor/blob/1332d077ea73bc7bf40f551c7f1adea3370df2bd/pallets/subtensor/src/block_step.rs#L33
"Networks run their epoch when (block_number + netuid + 1 ) % (tempo + 1) = 0"

This comment from the subtensor code is not correct

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only accidentally worked before, and only for the first few epochs before the differences start to matter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Assumed you had a good reason for it. Just didn't seem immediately apparent. Thanks.

@roman-opentensor
Copy link
Contributor

Also let's merge this PR to #2155.

Why would we do that? This is unrelated. I specifically asked him to create a separate PR because it is unrelated.

Yeah, not to 2155, but before. I meant this PR has to be merged first.
@mvds00

@thewhaleking thewhaleking merged commit 7895c02 into opentensor:staging Jul 28, 2024
31 of 52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants