Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jax-IREE not working on MacOS-13 #4521

Open
kratman opened this issue Oct 17, 2024 · 8 comments · May be fixed by #4585
Open

Jax-IREE not working on MacOS-13 #4521

kratman opened this issue Oct 17, 2024 · 8 comments · May be fixed by #4585
Assignees
Labels
CI/CD Related to continuous integration, continuous deployment (GitHub Actions, workflows, testing, etc.) dependencies Pull requests that update a dependency file

Comments

@kratman
Copy link
Contributor

kratman commented Oct 17, 2024

While removing the deprecated MacOS-12 runners for GitHub actions, I ran into issues with IREE. The comments suggest that it should work with MacOS version 13, but the tests fail. It seems to work fine one MacOS-14 and Linux.

As a temporary fix, I suggested changed the version in PR #4520

            if (not sys.version_info[:2] == (3, 11)) or mac_ver < 14:
                warnings.warn(
                    (
                        "IREE is only supported on MacOS 13 (or higher) and Python"
                        "version 3.11. Setting PYBAMM_IDAKLU_EXPR_IREE=OFF."
                    ),
                    stacklevel=2,
                )
                return "OFF"

A few other things I noticed while looking into the failures:

  • Updating Jax, JaxLib, and IREE requires updating versions in multiple places which is not easy to maintain. This should be centralized somehow
    • jax versions in utils.py and pyproject.toml
    • IREE compiler versions are set in both noxfile.py and pyproject.toml
  • Newer versions of Jax/JaxLib have removed support for Python 3.9, so updating Jax and IREE could be difficult until we remove support for python 3.9
@kratman kratman added CI/CD Related to continuous integration, continuous deployment (GitHub Actions, workflows, testing, etc.) dependencies Pull requests that update a dependency file labels Oct 17, 2024
@agriyakhetarpal agriyakhetarpal added the release blocker Issues that need to be addressed before the creation of a release label Oct 17, 2024
@agriyakhetarpal
Copy link
Member

cc @jsbrittain who could help us with this. I had a macOS device running macOS 13, but I couldn't reproduce the crash when running the relevant tests from the test suite locally (ran them a few times). I updated it to macOS 14 today, so I don't know if we can reliably reproduce this outside of a CI runner.

@jsbrittain
Copy link
Contributor

@agriyakhetarpal just a very quick comment for the time being: as noted above, there are multiple components that need to be upgraded together, but this has to be done carefully. I actually took a look at this a couple of months back in an effort to get rid of the demotion code (IREE only runs at 32-bit precision at the present). Notes from that investigation are here and may prove helpful: jsbrittain#4

@jsbrittain
Copy link
Contributor

Some follow-up exploratory notes. I tested upgrading to the latest Jax/Jaxlib (0.4.34) and corresponding iree-compiler (1036) with relaxed version constraints on everything apart from Windows. I don't have a macOS-13 machine but using gh-runners produced a slightly different set of results to your previous run (see https://github.com/jsbrittain/PyBaMM/actions/runs/11391403398/job/31695386575?pr=5):

ubuntu
3.9 jax incompatible
3.10-3.12 operations cancelled

macos-13
3.9 jax incompatible
3.10 iree incompatible (only 3.11, 3.12 build available)
3.11 worker crashes
3.12 worker crashes

macos-14
3.9 jax incompatible
3.10 iree incompatible (only 3.11, 3.12 build available)
3.11 worker crashes
3.12 worker crashes

windows
3.9 jax incompatible
3.10-3.12 pass

Looking into the available wheels, here are the current offerings:
jax (0.4.34): manylinux,macos[10+,11+],win (all for python 3.10-3.13)
iree-compiler (1036): manylinux(py3.9-3.12), macos(13+)(py3.11-3.12), win(py3.11-3.12)

Re the macos worker crashes (all of which seem to occur on TestIDAKLUSolver), there are a few blogs suggesting that pytest-xdist (parallel testing) is associated with the 'node down: Not properly terminated' error that we are seeing when run with certain libraries, although its not clear exactly what is causing it. Curiously, my run crashed workers on both macos-13 and macos-14, so it seems to impact both, but may be less frequent on macos-14, or simply fragile to particular jax/iree/runner combinations. I have seen local/inline imports mentioned as problematic (as are used in places in pybamm). Some context here: https://github.com/pytest-dev/pytest/issues/ 3216.

On a slightly more positive note the jax/iree combination tested compiles and successfully solves locally (which is much more promising than my experience a few months back: jsbrittain#4). So, this may principally be a test problem.

I had a quick chat to @martinjrobins and - given the dependency issues - it may be an option to simply disable IREE for the time being. It is only enabled by building with specific user-configurable flags at present (PYBAMM_IDAKLU_EXPR_IREE=OFF by default) as it is still slow compared to casadi; in-fact, it is only being enabled in tests to maintain the functionality until it becomes more useable, but is not built into the wheels at all.

@agriyakhetarpal
Copy link
Member

Thanks for the details, @jsbrittain! Yes, that sounds like a good plan, and IREE has not been enabled yet for https://github.com/pybamm-team/pybammsolvers either. Though, another option could be to keep testing the IREE code, and instruct pytest-xdist to run the related tests in "serial mode" for now; that way, we still prevent the failures from creeping in and still have it in CI in case something is modified later – this could help make the task of updating to the JAX 0.4.34 + iree-compiler 1036 combo easier later on, what do you think?

Adding a decorator @pytest.mark.xdist_group(name="iree") will run all tests decorated with it on the same worker.

@kratman
Copy link
Contributor Author

kratman commented Oct 18, 2024

@jsbrittain, @agriyakhetarpal I think it might be better to remove IREE functionality for now instead of just disabling it and bring it back in the future when things improve. At the very least it should just be disabled for MacOS since allowing it for only a single python version makes it difficult to use. Python 3.9 reaches end of life about 1 year from now, so that part of the Jax-IREE complexity will go away when we remove support for 3.9

Reasons for removal:

  • IREE is not pinned to a stable release and the update process is tricky. Realistically we should only keep it if we can pin to an official stable release
  • IREE is only 32 bit, and as you said it is slower than Casadi and IDAKLU. It sounds like there is not much of a use for it until these downsides are resolved
  • We don't put IREE into the wheels, so the separation of the IDAKLU code into https://github.com/pybamm-team/pybammsolvers (i.e. [WIP] Download IDAKLU from pybammsolvers #4487) means that it would be unavailable unless we ship it with the wheels or the users build both from scratch

We can always bring back the IREE interface in the future or dig into if it is possible to build a "bring your own solver" plugin sort of architecture

@kratman
Copy link
Contributor Author

kratman commented Oct 18, 2024

@jsbrittain Of course I don't want to hinder any research that requires IREE functionality. Is there anything critical that depends on it?

@jsbrittain
Copy link
Contributor

@jsbrittain Of course I don't want to hinder any research that requires IREE functionality. Is there anything critical that depends on it?

@kratman None that I am aware of. The MLIR/IREE approach has the potential to offer fast and flexibilty expression evaluation, capable of lowering onto numerous architectures (including GPU), but it has not achieved that potential yet; at present we are working around problems (e.g. precision and compatibility) that may be resolved with more mature/stable releases.

@agriyakhetarpal agriyakhetarpal removed the release blocker Issues that need to be addressed before the creation of a release label Oct 18, 2024
@kratman
Copy link
Contributor Author

kratman commented Oct 18, 2024

Disabling IREE on MacOS was done in #4528, but it is still available on Linux

@kratman kratman self-assigned this Nov 12, 2024
@kratman kratman linked a pull request Nov 13, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Related to continuous integration, continuous deployment (GitHub Actions, workflows, testing, etc.) dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants