Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test CI changes #15201

Closed
wants to merge 6 commits into from
Closed

Test CI changes #15201

wants to merge 6 commits into from

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Mar 1, 2024

Description

This PR makes a number of changes in the hopes of alleviating the heavy load that we've been frequently observing in our CI queues of late. The changes are based on the following thoughts, and based on observations from PR #15194 (the most recent PR with passing CI) as the "other PR" below:

  • Builds are cheaper than tests because they don't require GPU nodes, but they're still not completely free.
  • Every PR still needs to build the full matrix to produce usable artifacts (we could reconsider this but was a more drastic step than I felt was necessary in this PR).
  • Both wheels and devcontainers are much faster to build than conda packages due to the substantial overhead of conda-build. devcontainer builds are slower than wheel builds, but they provide more complete coverage (devcontainers build C++ tests as well as all Python packages).
    • For PRs that don't change C++ code, devcontainer builds should effectively be fully cached and complete very fast (about 10 mins on the first run of this PR). For PRs that break builds, devcontainers will show that just as quickly as other builds. The main downside is in cases where builds are not cached but also succeed, in which case devcontainer builds can take significantly longer (~50 mins on the other PR) and will prevent other build jobs from starting.
  • Wheel tests are much faster than conda tests because of various conda overheads.
    • In the other PR the fastest full wheel test run on x86 took ~15 while the fastest conda test run took ~30).
  • There is currently significant overlap between wheel and conda tests, but we rarely see failures in one without seeing them in the other unless they are due to something like a conda packaging/solver issue.
  • Doc changes are infrequent but doc builds are expensive because they require GPU nodes
  • cudf_kafka and custreamz are essentially static

Based on those observations, and in conjunction with rapidsai/shared-workflows#184 (used in this PR), this PR makes the following changes:

  • devcontainer builds now require style checks.
  • devcontainer builds now gate all other builds.
  • Wheel tests now gate conda Python tests.
  • Reduce CI runs shared-workflows#184 also reduces the number of conda tests in PRs (nightlies are unchanged).
  • The conda tests of dask_cudf, cudf_kafka, and custreamz remain independent of the conda tests of cudf, but they are now gated by the dask_cudf wheel tests.
  • doc builds now require tests to pass.

The net result should be that while running the full CI workflow successfully for a single PR is now slower, CI failures are still observed almost as fast in almost all cases (exceptions include doc builds, python tests that only fail in conda, and build issues that only arise in conda-build or wheel builds) and overall CI resource usage by jobs that fail at any stage is substantially reduced.

I'm happy to discuss all of these changes, and I think implementing even a subset of them will make a meaningful impact on our CI turnaround time. Once we're happy with this and churn on other RAPIDS-wide CI changes reduces, we can port these changes to other repos as well.

Contributes to rapidsai/build-planning#5.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vyasr vyasr added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 1, 2024
@vyasr vyasr self-assigned this Mar 1, 2024
@jakirkham
Copy link
Member

In the other PR the fastest full wheel test run on x86 took ~15 while the fastest conda test run took ~30).

Think we should add units here

Could we get a better sense of what we are measuring here? What are the steps in the Conda process and how long they take?

For example if we are creating a lot of environments for different tests, that will take longer. We could create one environment once and do all the testing there. This comes with its own tradeoffs in terms of how contained/minimal the testing environment is, but this is a consideration

Anyways all this to say think we need more clarity on what we are measuring and what question it we are trying to answer before proposing a solution

@vyasr
Copy link
Contributor Author

vyasr commented Mar 1, 2024

Think we should add units here

Could you clarify what you mean by units? Is that just referring to what you ask below about breaking the runtimes up, or something else?

Could we get a better sense of what we are measuring here? What are the steps in the Conda process and how long they take?

Sure that's a good question. Let's look at this job from the other PR:

  • 23:41:17 - Make the environment
  • 23:43:20 - Download artifacts
  • 23:45:26 - Some trivial steps, then installing the downloaded artifacts
  • 23:46:52 - Start running tests
  • 00:00:50 - Start running benchmarks
  • 00:06:22 - Start running the benchmarks with pandas
  • 00:11:10 - End

Unfortunately there isn't equivalent data in the current wheel jobs for comparison. However, it's immediately evident that it's not an apples-to-apples comparison:

  1. The wheel jobs aren't running either of the benchmark steps, which add about 11 mins here.
  2. The actual test run takes about 14 minutes.
  3. I would have liked to use a cuml PR for comparison after Bradley merged Install test dependencies at the same time as cuml packages. cuml#5781 since that would presumably reduce or even remove the extra 1.5 minutes used to install the downloaded artifacts, but unfortunately with cuml the wheel and conda tests use different levels of parallelization (something else we can probably improve there), again making the comparison unfair.

If we assume that combining the installation of cudf/libcudf into the initial solve drops that time to 0, then we have the runs taking about 18 minutes. That brings us near parity with the wheel jobs, which suggests that it doesn't matter too much which ones go first for tests (conda-build overhead is much clearer in build jobs since there wheel vs conda builds do essentially the same amount of work otherwise). However if even that 1-2 minute advantage for wheels is consistent, it gives us a reason to choose those going first (assuming that we're OK with the sequencing).

For example if we are creating a lot of environments for different tests, that will take longer. We could create one environment once and do all the testing there. This comes with its own tradeoffs in terms of how contained/minimal the testing environment is, but this is a consideration

We only create one environment, but the one improvement we could make is the one we've already done in cuml where we install the built artifacts during the initial solve instead of after the fact.

Anyways all this to say think we need more clarity on what we are measuring and what question it we are trying to answer before proposing a solution

Big picture what we want to solve is to reduce the amount of CI blockage. This PR tries to address that in two ways:

  • Reducing the number of redundant jobs (that's in rapidsai/shared_workflows#184). Whether we do that by reducing conda or wheel jobs should be determined by which is faster. The numbers above do suggest
  • Serializing more jobs to reduce the number of concurrent failures.

@bdice
Copy link
Contributor

bdice commented Mar 1, 2024

Weighing in on each proposed change:

  1. ✅ devcontainer builds now require style checks.

This is fine. It was probably an oversight originally.

  1. ❌ devcontainer builds now gate all other builds.
  2. ❌ Wheel tests now gate conda Python tests.

I want to strongly optimize for the straight-path (no queue) time. Introducing new gating between jobs will serialize tasks that we should execute in parallel if resources allow. The most important reason for this is that fixing CI-blocking issues will be dramatically slower if we introduce more gates between jobs.

  1. ⚠️ Reduce CI runs shared-workflows#184 also reduces the number of conda tests in PRs (nightlies are unchanged).

This reduced matrix must cover one job with CUDA 11 and one job with CUDA 12. Conda packaging across CUDA 11 and CUDA 12 is not at all the same, and the coverage should alert us of general conda packaging issues.

Proposal here: https://github.com/rapidsai/shared-workflows/pull/184/files#r1509255538

  1. ❌ The conda tests of dask_cudf, cudf_kafka, and custreamz remain independent of the conda tests of cudf, but they are now gated by the dask_cudf wheel tests.
  2. ❌ doc builds now require tests to pass.

As above, optimize for the straight-line path. Do not serialize jobs that can run in parallel.

@bdice
Copy link
Contributor

bdice commented Mar 1, 2024

Generally, I think rapidsai/shared-workflows#184 will be enough to have a huge impact on our CI turnaround time (and reduce queueing) without needing to introduce more gates between jobs. I am a fan of rapidsai/shared-workflows#184 (with the modification I proposed to cover CUDA 11/12) and would support moving forward on that first, before we try to reorder or gate jobs.

@bdice
Copy link
Contributor

bdice commented Mar 1, 2024

A potential improvement to reduce GPU time that I would support is fail-fast: true in more of our test jobs. Currently we only do that for wheels. (More accurately, it defaults to true and we just don't define fail-fast: false for our wheel test jobs.)

I think a lot of our test failures are correlated, i.e. a failure on one conda test job is likely to have a failure on another conda test job, too.

If a job fails-fast due to a "sibling" job having a flaky network connection (or whatever unexpected issues), it's generally no harder (and doesn't consume much more GPU time) to rerun the whole set of failed jobs.

@bdice
Copy link
Contributor

bdice commented Mar 1, 2024

Regarding the single-step conda solves for cuml, I don't think it saves a great deal of time. It's just a boost for correctness (knowing you'll get compatible packages and not force the solver into a downgrade that it can't resolve).

I looked at some recent CI logs. The time spent on the second step (installing libcuml/cuml) was typically quite small. The majority of the initial conda setup time is spent in downloading packages, it seems. It has to pull down multiple GB of data, which can take ~5 minutes compared to the conda solve times which are each less than 30 seconds. It's hard to do apples-to-apples comparisons because the test startup time is dominated by network activity.

@vyasr
Copy link
Contributor Author

vyasr commented Mar 1, 2024

pip jobs will always be cheaper than conda jobs because wheels don't have to set up an entire set of libraries the way that conda does. We could improve this by caching (I plan to test that out next) but even then the conda cache will be much larger than the pip cache so the same applies, if reduced. As a result I don't know if apples-to-apples comparison is really what we should be optimizing for because we can safely indicate that the nature of how conda works always necessitates some unavoidable setup overhead.

I want to strongly optimize for the straight-path (no queue) time. Introducing new gating between jobs will serialize tasks that we > should execute in parallel if resources allow. The most important reason for this is that fixing CI-blocking issues will be dramatically slower if we introduce more gates between jobs.

Optimizing exclusively for the no queue time doesn't make much sense because in practice we never have no queue. Probably twice a week we end up in a situation where someone posts a Slack query asking why their job hasn't started for X hours and it's because the overall queue is large (and if it's being asked about on Slack by a couple of people it's certainly happening far more frequently when nobody comments). This frequently happens with CI-blocking issues too, and those also provide a great counterpoint because the backlog right after a CI-blocking issue is resolved is massive and therefore we don't get much parallelism in practice when everyone reruns all their PRs. Having workflows fail faster on early jobs would significantly improve that.

There's a balance to be struck here though. I think a good compromise would be to maintain parallelism across all build (read: CPU-only node) jobs while still serializing more of the jobs that run on GPU nodes. WDYT?

would support moving forward on that first, before we try to reorder or gate jobs.

Totally fine merging the shared workflows changes first and then coming back to this PR after a few weeks depending on what we observe.

I think a lot of our test failures are correlated, i.e. a failure on one conda test job is likely to have a failure on another conda test job, too.

I agree. The issue is that our failures are also frequently correlated across jobs within the same workflow: doc builds, conda Python tests and wheel tests all tend to fail together, as do conda "other" python tests and dask-cudf wheel tests. I don't think there's a way to fail-fast between different jobs.

@bdice
Copy link
Contributor

bdice commented Mar 11, 2024

We can closing this testing PR now that rapidsai/shared-workflows#184 is merged.

@bdice bdice closed this Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants