Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move QPY tests to GitHub Actions and increase inter-symengine tests #13273

Merged
merged 1 commit into from
Oct 29, 2024

Conversation

jakelishman
Copy link
Member

Summary

This commit has two major goals:

  • fix the caching of the QPY files for both the main and stable/* branches

  • increase the number of compatibility tests between the different symengine versions that might be involved in the generation and loading of the QPY files.

Achieving both of these goals also means that it is sensible to move the job to GitHub Actions at the same time, since it will put more pressure on the Azure machine concurrency we use.

Caching

The previous QPY tests attempted to cache the generated files for each historical version of Qiskit, but this was unreliable. The cache never seemed to hit on backport branches, which was a huge slowdown in the critical path to getting releases out. The cache restore keys were also a bit lax, meaning that we might accidentally have invalidated files in the cache by changing what we wanted to test, but the restore keys wouldn't have changed.

The cache files would fail to restore as a side-effect of ed79d42 (gh-11526); QPY was moved to be on the tail end of the lint run, rather than in a test run. This meant that it was no longer run as part of the push event when updating main or one of the stable/* branches. In Azure (and GitHub Actions), the "cache" action accesses a scoped cache, not a universal one for the repository 12. Approximately, base branches each have their own scope, and PR events open a new scope that is a child of the target branch, the default branch, and the source branch, if appropriate. A cache task can read from any of its parent scopes, but write events go to the most local scope. This means that we haven't been writing to long-standing caches for some time now. PRs would typically miss the cache on the first attempt, hit their cache for updates, then miss again once entering the merge queue.

The fix for this is to run the QPY job on branch-update events as well. The post-job cache action will then write out to a reachable cache for all following events.

Cross-symengine tests

We previously were just running a single test with differing versions of symengine between the loading and generation of the QPY files. This refactors the QPY run_tests.sh script to run a full pairwise matrix of compatibility tests, to increase the coverage.

Details and comments

This is a CI change, so the chance I got it right first time is approximately zero. Just need a PR to start testing it. We'll need to update the branch-protection rules if we decide to merge this.

Footnotes

  1. https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache

  2. https://learn.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#cache-isolation-and-security

@jakelishman jakelishman added type: qa Issues and PRs that relate to testing and code quality stable backport potential The bug might be minimal and/or import enough to be port to stable Changelog: None Do not include in changelog labels Oct 3, 2024
@jakelishman jakelishman requested a review from a team as a code owner October 3, 2024 19:13
@qiskit-bot
Copy link
Collaborator

One or more of the following people are relevant to this code:

  • @Qiskit/terra-core
  • @mtreinish
  • @nkanazawa1989

@jakelishman

This comment was marked as outdated.

@coveralls
Copy link

coveralls commented Oct 3, 2024

Pull Request Test Coverage Report for Build 11574466833

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 19 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-0.02%) to 88.667%

Files with Coverage Reduction New Missed Lines %
crates/qasm2/src/expr.rs 1 94.02%
crates/qasm2/src/lex.rs 6 91.73%
crates/qasm2/src/parse.rs 12 97.15%
Totals Coverage Status
Change from base Build 11572090796: -0.02%
Covered Lines: 74982
Relevant Lines: 84566

💛 - Coveralls

@jakelishman

This comment was marked as outdated.

@jakelishman
Copy link
Member Author

The cache size appears to be 200kB, which is slightly questionable - that's near exactly the size of a single set of QPY files. That said, the cache action is compressing them, and there are a few places in the QPY files that will include some random bytes between versions because of randomly generated Parameter UUID instances, so maybe it does all add up right.

@jakelishman jakelishman changed the title [WIP] Move QPY tests to GitHub Actions and increase inter-symengine tests Move QPY tests to GitHub Actions and increase inter-symengine tests Oct 4, 2024
@jakelishman
Copy link
Member Author

I tested the caching for new PRs on my fork, and verified that it correctly restores the cache even for the initial "open a new PR" event, if the PR doesn't modify the QPY-compatibility test directory.

Building a single dev-version wheel at the top of the file means that the QPY backwards-compatibility job now takes only five minutes on a cache hit (e.g. https://github.com/jakelishman/qiskit-terra/actions/runs/11178693927/job/31076786343) despite now building an extra venv and running three more compatibility test, and it runs in parallel to all other jobs, whereas previously it was often on the critical path. It takes about 15 minutes on a complete cache miss (the same as Azure, give or take), but the cache misses should now be much rarer, which should help a lot for the throughput of backports and releases.

This is ready for review.

Copy link
Member

@mtreinish mtreinish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this LGTM, I like the changes to the run_tests.sh file so that we only build the wheel once and reuse it for all the venvs. In the original version I reused the dev venv to avoid building a second venv for symengine testing with the same version but that minimizes the overhead and makes everything more explicit which is nice.

I just had one inline question about the hashing key and when the hashing is evaluated.

Comment on lines +32 to +35
# The hashing is this key can be too eager to invalidate the cache,
# but since we risk the QPY tests failing to update if they're not in
# sync, it's better safe than sorry.
key: qpy-${{ hashFiles('test/qpy_compat/**') }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this hash only on the files in the checkout from git, or will it include all the qpy files we generate during the run? I can't remember if the hashing is only performed at the start of the job or the end of the job.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I got this right (I think I did), the hash is calculated only once on the initial git checkout (so doesn't include the QPY files), but when pushing back to the cache, it will include the generated files, so subsequent lookups will retrieve them. It's important that the QPY files aren't part of the hash key because they're not fully deterministic - they include randomly generated UUID payloads in some of the parameters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yeah that's why I was asking, especially because the docs aren't clear: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/evaluate-expressions-in-workflows-and-actions#hashfiles but we can keep an eye on it post merge and adjust if it's a problem. The only option I can think of is manually listing out the files in the tree we care about tying to the cache.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back before I opened this PR I checked it in a bunch of configurations on my fork, and the caching all seemed to be working the way I expected/hoped.

@jakelishman
Copy link
Member Author

Yeah, the new form will almost certainly duplicate one of the venvs we use (depends on the array of symengine specifiers we choose), but once we've got the Qiskit wheel pre-built, it ends up taking very little time to build, and it was easier to write and edit the loops that way.

This commit has two major goals:

- fix the caching of the QPY files for both the `main` and `stable/*`
  branches

- increase the number of compatibility tests between the different
  symengine versions that might be involved in the generation and
  loading of the QPY files.

Achieving both of these goals also means that it is sensible to move the
job to GitHub Actions at the same time, since it will put more pressure
on the Azure machine concurrency we use.

Caching
-------

The previous QPY tests attempted to cache the generated files for each
historical version of Qiskit, but this was unreliable.  The cache never
seemed to hit on backport branches, which was a huge slowdown in the
critical path to getting releases out.  The cache restore keys were also
a bit lax, meaning that we might accidentally have invalidated files in
the cache by changing what we wanted to test, but the restore keys
wouldn't have changed.

The cache files would fail to restore as a side-effect of ed79d42
(Qiskitgh-11526); QPY was moved to be on the tail end of the lint run, rather
than in a test run.  This meant that it was no longer run as part of the
push event when updating `main` or one of the `stable/*` branches.  In
Azure (and GitHub Actions), the "cache" action accesses a _scoped_
cache, not a universal one for the repository [^1][^2].  Approximately,
base branches each have their own scope, and PR events open a new scope
that is a child of the target branch, the default branch, and the source
branch, if appropriate.  A cache task can read from any of its parent
scopes, but write events go to the most local scope.  This means that we
haven't been writing to long-standing caches for some time now.  PRs
would typically miss the cache on the first attempt, hit their
cache for updates, then miss again once entering the merge queue.

The fix for this is to run the QPY job on branch-update events as well.
The post-job cache action will then write out to a reachable cache for
all following events.

Cross-symengine tests
---------------------

We previously were just running a single test with differing versions of
symengine between the loading and generation of the QPY files.  This
refactors the QPY `run_tests.sh` script to run a full pairwise matrix of
compatibility tests, to increase the coverage.

[^1]: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache
[^2]: https://learn.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#cache-isolation-and-security
@jakelishman
Copy link
Member Author

Rebased over main so the new Python 3.13 tests will run properly.

Copy link
Member

@mtreinish mtreinish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTm, thanks for doing this it should improve ci throughput quite a bit

@mtreinish mtreinish added this pull request to the merge queue Oct 29, 2024
Merged via the queue into Qiskit:main with commit af8be25 Oct 29, 2024
17 checks passed
@jakelishman jakelishman deleted the qpy-cache branch October 29, 2024 22:15
mergify bot pushed a commit that referenced this pull request Oct 29, 2024
…13273)

This commit has two major goals:

- fix the caching of the QPY files for both the `main` and `stable/*`
  branches

- increase the number of compatibility tests between the different
  symengine versions that might be involved in the generation and
  loading of the QPY files.

Achieving both of these goals also means that it is sensible to move the
job to GitHub Actions at the same time, since it will put more pressure
on the Azure machine concurrency we use.

Caching
-------

The previous QPY tests attempted to cache the generated files for each
historical version of Qiskit, but this was unreliable.  The cache never
seemed to hit on backport branches, which was a huge slowdown in the
critical path to getting releases out.  The cache restore keys were also
a bit lax, meaning that we might accidentally have invalidated files in
the cache by changing what we wanted to test, but the restore keys
wouldn't have changed.

The cache files would fail to restore as a side-effect of ed79d42
(gh-11526); QPY was moved to be on the tail end of the lint run, rather
than in a test run.  This meant that it was no longer run as part of the
push event when updating `main` or one of the `stable/*` branches.  In
Azure (and GitHub Actions), the "cache" action accesses a _scoped_
cache, not a universal one for the repository [^1][^2].  Approximately,
base branches each have their own scope, and PR events open a new scope
that is a child of the target branch, the default branch, and the source
branch, if appropriate.  A cache task can read from any of its parent
scopes, but write events go to the most local scope.  This means that we
haven't been writing to long-standing caches for some time now.  PRs
would typically miss the cache on the first attempt, hit their
cache for updates, then miss again once entering the merge queue.

The fix for this is to run the QPY job on branch-update events as well.
The post-job cache action will then write out to a reachable cache for
all following events.

Cross-symengine tests
---------------------

We previously were just running a single test with differing versions of
symengine between the loading and generation of the QPY files.  This
refactors the QPY `run_tests.sh` script to run a full pairwise matrix of
compatibility tests, to increase the coverage.

[^1]: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache
[^2]: https://learn.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#cache-isolation-and-security

(cherry picked from commit af8be25)
github-merge-queue bot pushed a commit that referenced this pull request Oct 30, 2024
…13273) (#13380)

This commit has two major goals:

- fix the caching of the QPY files for both the `main` and `stable/*`
  branches

- increase the number of compatibility tests between the different
  symengine versions that might be involved in the generation and
  loading of the QPY files.

Achieving both of these goals also means that it is sensible to move the
job to GitHub Actions at the same time, since it will put more pressure
on the Azure machine concurrency we use.

Caching
-------

The previous QPY tests attempted to cache the generated files for each
historical version of Qiskit, but this was unreliable.  The cache never
seemed to hit on backport branches, which was a huge slowdown in the
critical path to getting releases out.  The cache restore keys were also
a bit lax, meaning that we might accidentally have invalidated files in
the cache by changing what we wanted to test, but the restore keys
wouldn't have changed.

The cache files would fail to restore as a side-effect of ed79d42
(gh-11526); QPY was moved to be on the tail end of the lint run, rather
than in a test run.  This meant that it was no longer run as part of the
push event when updating `main` or one of the `stable/*` branches.  In
Azure (and GitHub Actions), the "cache" action accesses a _scoped_
cache, not a universal one for the repository [^1][^2].  Approximately,
base branches each have their own scope, and PR events open a new scope
that is a child of the target branch, the default branch, and the source
branch, if appropriate.  A cache task can read from any of its parent
scopes, but write events go to the most local scope.  This means that we
haven't been writing to long-standing caches for some time now.  PRs
would typically miss the cache on the first attempt, hit their
cache for updates, then miss again once entering the merge queue.

The fix for this is to run the QPY job on branch-update events as well.
The post-job cache action will then write out to a reachable cache for
all following events.

Cross-symengine tests
---------------------

We previously were just running a single test with differing versions of
symengine between the loading and generation of the QPY files.  This
refactors the QPY `run_tests.sh` script to run a full pairwise matrix of
compatibility tests, to increase the coverage.

[^1]: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache
[^2]: https://learn.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops#cache-isolation-and-security

(cherry picked from commit af8be25)

Co-authored-by: Jake Lishman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: None Do not include in changelog stable backport potential The bug might be minimal and/or import enough to be port to stable type: qa Issues and PRs that relate to testing and code quality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants