Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enhance FBGEMM nightly CI with cuDNN fix (pytorch#1407)
Summary: This PR updates FBGEMM nightly CI to make it more usable and maintainable by using GitHub Actions more effectively, with a cuDNN installation fix. I want to make them happen together so that the "fix" really fixes an issue. ## 1. Fix the cuDNN installation issue in FBGEMM nightly CI We used `conda-forge` to install cuDNN, but this unintentionally added CUDA 10 dependency to the FBGEMM nightly package, breaking installation steps that rely on FBGEMM nightly. To address this issue, we install cuDNN separately in the CI script so that it can coexist with CUDA 11.7, which is used by PyTorch-CUDA nightly. ## 2. Implement label-triggered wheel tests during a PR review Previously, we needed to apply a hack to trigger wheel-related tests before merging the PR, which is neither comprehensive nor efficient. To address this issue, this PR adds optional wheel-related tests. Now a PR with a label such as `"test_wheel_nightly"`, `"test_wheel_prerelease"`, and `"test_wheel_release"` enables the corresponding wheel-related tests. We don't need to modify YML files again, so we can easily run these tests. Please trigger these tests for suspicious PRs that touch the installation logic. Note that binaries won't be pushed to PYPI if this method is used. ## 3. Remove duplicated logic across YML files Nightly/release/cpu/gpu wheel scripts shared most in common, but the logic were scattered across different files, which significantly lowered maintainability. To address this issue, this PR collects all the wheel-related test logic into a single file (`build_wheel.yml`) and makes the callers (e.g., scheduled nightly trigger or per-PR tests using labels) pass flags to control the flow (e.g., per-PR tests do not push the wheel binary to PYPI). No duplication improves maintainability. <img width="1269" alt="Screen Shot 2022-10-21 at 9 58 56 AM" src="https://user-images.githubusercontent.com/15073003/197249555-bb00f3b1-9bc9-46f1-8fea-7874460723f6.png"> ## 4. Support handy wheel-related tests on a local machine Previously, the core wheel-related build/test logics were embedded into GitHub workflow files (`*.yml`) mixed with AWS-specific commands, so it was very tedious to test the nightly logic on a local machine, lowering the developer efficiency. To address this issue, this PR extracts core scripts from GitHub workflow files and makes them standalone so that the developer can try wheel-build locally (i.e., without access to AWS machine). It uses conda to create a virtual software environment, so it should be handy enough in many cases though this is not as robust as a container-based solution. ```sh # For example, check prerelease PyTorch (pytorch-test package) locally. git clone https://github.com/pytorch/FBGEMM.git cd FBGEMM git submodule update --init --recursive bash .github/scripts/build_wheel.bash -p 3.10 -c 11.7 -v -P pytorch-test -o fbgemm_gpu_test bash .github/scripts/test_wheel.bash -p 3.10 -c 11.7 -v -P pytorch-test -w fbgemm_gpu/dist/fbgemm_gpu_test-2022.10.20-cp310-cp310-manylinux1_x86_64.whl git clone https://github.com/pytorch/torchrec.git cd torchrec git submodule update --init --recursive bash ../.github/scripts/test_torchrec.bash -o torchrec_nightly -p 3.10 -c 11.7 -v -P pytorch-test -w ../fbgemm_gpu/dist/fbgemm_gpu_test-2022.10.20-cp310-cp310-manylinux1_x86_64.whl ``` ## 5. [Temporary] Add TorchRec integration test TorchRec is one of our most important users, but we didn't test TorchRec integration in the nightly CI, so the changes in FBGEMM sometimes surprised the TorchRec developers. To address this issue, though it is temporary, but this PR adds a TorchRec integration test before pushing a binary to PYPI so that we can make sure that FBGEMM works with TorchRec. However, now broken TorchRec can break FBGEMM-nightly; we will investigate a more sustainable solution that makes everyone happy. ## 6. Better interface to manually trigger the CI It looks nice. I love to operate the release management in an automated way as much as possible to reduce human errors at the very last minute. In the future, we can add more options if needed. <img width="844" alt="Screen Shot 2022-10-21 at 5 53 03 AM" src="https://user-images.githubusercontent.com/15073003/197240190-e808df23-795a-451d-a7d9-ec1c43c68e92.png"> Pull Request resolved: pytorch#1407 Differential Revision: D40597700 Pulled By: shintaro-iwasaki fbshipit-source-id: 150121661d58097164a589d506eeb842ee7ea905
- Loading branch information