Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify matrix configuration for CI workflows #1213

Merged
merged 2 commits into from
Aug 11, 2022

Conversation

sjain-stanford
Copy link
Member

@sjain-stanford sjain-stanford commented Aug 11, 2022

Addresses #1207.

Provisioned jobs:

# ubuntu - x86_64 - llvm in-tree     - pytorch binary - build+test    # most used dev flow and fastest signal
# ubuntu - x86_64 - llvm out-of-tree - pytorch source - build+test    # most elaborate build
# macos  - arm64  - llvm in-tree     - pytorch source - build only    # cross compile, can't test arm64

Main changes

  • Spawn macos builds from a separate matrix (in the same workflow). It made sense to do this as they are fairly different from ubuntu (cross compile, use a different cmake configuration). This simplifies the matrix configuration and exclusions quite a bit, and makes the workflow a bit more tractable and maintenance friendly.
  • Remove the submodule md5sum step for ccache config. This was broken for a while now.
  • Removes unused matrix options - os, targetarch, python-version, llvmtype.
  • Address ZSTD comment on @powderluv's cross compile PR.

Further improvements (to be addressed in follow-on):

  • ubuntu-x86_64 out-of-tree integration tests fail (error); only run unit tests for now (tests are excluded in current CI too)

Passing workflow:

https://github.com/sjain-stanford/torch-mlir/actions/runs/2840676309
image

@sjain-stanford sjain-stanford force-pushed the sambhav/build_reconfig branch from a8b8ac5 to b11f37c Compare August 11, 2022 16:46
@powderluv
Copy link
Collaborator

This is looking great. The arm64 Pytorch src build should work if you remove this one line

pip uninstall torch

It is trying to uninstall the systemwide torch.

@sjain-stanford
Copy link
Member Author

The arm64 Pytorch src build should work if you remove this one line

Thanks, patched. Let's wait to see if the arm64 pytorch source workflow goes through with the fix. If there are more errors, I can revert to pytorch binary and land it for now (to avoid cache evictions the longer this is open). I'll wait for an "all green" CI before landing, but if this looks good otherwise, please feel free to ✅ this.

Copy link
Collaborator

@powderluv powderluv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Feel free to switch arm64 pytorch source in a follow on.

@sjain-stanford sjain-stanford merged commit f00ca91 into llvm:main Aug 11, 2022
@powderluv
Copy link
Collaborator

nicely done. The silly cache gets generated again when it merges so we got to wait for it again

@sjain-stanford
Copy link
Member Author

The silly cache gets generated again when it merges so we got to wait for it again

Ah I was wondering why it didn't restore from cache after landing because the keys didn't change. Good to know this is normal. Maybe it treats GHA runs on PRs differently than runs on push to main. Oh well...

@sjain-stanford
Copy link
Member Author

... and thank you for the help in reviewing it!

sjain-stanford added a commit that referenced this pull request Aug 12, 2022
My earlier[ PR](#1213) had (among other things) decoupled ubuntu and macos builds into separate matrix runs. This is not working well due to limited number of MacOS GHA VMs causing long queue times and backlog. There are two reasons causing this backlog: 

1. macos arm64 builds with pytorch source are getting erratically cancelled due to resource / network constraints. This is addressed with this: #1215

> "macos-arm64 (in-tree, OFF) The hosted runner: GitHub Actions 3 lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error."

2. macos runs don't fail-fast when ubuntu runs fail due to being in separate matrix setups. This PR couples them again.
qedawkins pushed a commit to nod-ai/torch-mlir that referenced this pull request Oct 3, 2022
@sjain-stanford sjain-stanford deleted the sambhav/build_reconfig branch November 10, 2022 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clean up llvmtype / buildtype in Github workflows
2 participants