Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run GitHub CI on MacOS #598

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

bmillwood
Copy link
Contributor

@bmillwood bmillwood commented May 16, 2024

Description

Full disclosure, I intended to open this PR against my own fork and accidentally opened it here. Having done so, I after-the-fact said "well, why not see if folks want it anyway?" and here we are.

While looking into something that was failing on Mac MPS hardware, I decided to try to set up Mac CI, since I don't have an MPS device myself.

I think this PR does not solve my problem, because the GitHub Action runners appear not to use MPS. The macos-latest runner fails CI with this kind of message:

E   RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 94.26 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

Notice in particular the numbers very strongly suggest there should be enough memory for this allocation. I think what's happening is that MPS is not actually available on GitHub Actions runners (this is rumoured e.g. in pytorch/pytorch#111449 (comment) but I don't see documentation about this from GitHub themselves), yet somehow the code tries to use MPS anyway.

This PR uses the macos-13 runner, which for whatever reason works just fine. Apple's docs suggest MPS has been available since macOS 12.3, but bmillwood#3 suggests the runners don't have it. Possibly they're just better at noticing they don't have it, and using a different code path?

Therefore there are two key questions raised by this PR:

  • Is it actually a good proxy for whether the library is usable by real Mac users?
  • How come the macos-13 runner works while macos-latest doesn't, and are we going to be able to upgrade it some day in the future, or are we eventually going to have to drop it as it ages out of existence?

Regardless of whether these tests are good proxies for Mac users, you could reasonably object that this isn't worth the extra time / complexity added to CI runs. If that's your concern, and you'd be mollified by running the macos tests on 3.10 only, let me know and I can make that change.

This PR also currently doesn't run the notebook tests on macos, since the pandoc install step would need updating, and I thought I might as well take temperature on the change as a whole before making any more efforts in that direction.

Type of change

  • New feature (non-breaking change which adds functionality)
    • ...kind of?

Screenshots

I think the PR speaks for itself in this regard.

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
    • none needed, IMO
  • My changes generate no new warnings
    • My changes strictly speaking generate new triggers for existing warnings (some of the shared actions we use need updating). I don't think this counts as a substantively new warning.
  • I have added tests that prove my fix is effective or that my feature works
    • the whole change is tests
  • New and existing unit tests pass locally with my changes
    • I didn't actually check because my changes have no effect on local runs
  • I have not rewritten tests relating to key interfaces which would affect backward compatibility

@bmillwood bmillwood force-pushed the run-checks-on-macos branch from 7bc1e6d to da6fb8a Compare May 16, 2024 13:48
@bmillwood bmillwood changed the title PR for running checks on MacOS Run GitHub CI on MacOS May 16, 2024
@bmillwood bmillwood force-pushed the run-checks-on-macos branch from da6fb8a to 68bfbdb Compare May 16, 2024 14:03
@bmillwood bmillwood changed the base branch from main to dev May 16, 2024 14:03
@bmillwood bmillwood force-pushed the run-checks-on-macos branch 3 times, most recently from ab31b8e to e818a24 Compare May 16, 2024 14:26
@bmillwood bmillwood marked this pull request as ready for review May 16, 2024 15:05
@bryce13950
Copy link
Collaborator

Thank you for spending the time working on this. I do think there is value in this, however I think it should be changed slightly.
On the notebook tests, I wouldn't bother running those on macOS via CI at all. They are only configured to run on either cuda or cpu devices. I know for sure that a few of those demos do not work on MPS devices, and running them on a macOS runner via CPU is going to take a long time. Finally, most people on Macs are probably going to look at those demos in colab anyways, so all of that combines probably spells out little value for that particular area.
As for the actual value in having a MPS running on the code... I do think there is quite a bit of value in that. I don't think it needs to run when changes are put into dev, and I also don't think it needs to run through all compatibility checks. What I think would be ideal is if it was configured to run on PRs and pushes to main, and on python 3.11. The reason for only doing main is that I don't think it is reasonable to require people submitting PRs to resolve errors on MPS devices. A lot of people simply do not have access to those devices, and trying to debug something based on what the CI says can be very frustrating. We can leave debugging MPS for when new releases are being put up. We also probably only need to run it on the latest supported version of Python, since MPS devices only account for m-series Macs, which I believe were released with python 3.10 as their default installed version. Everyone using them is probably also working on the latest version of macOS, which comes with 3.11 by default.
As for macos-latest not working, those are probably still running on Intel Macs, thus they do not have mps hardware. I would suspect that in the next year those will all begin supporting mps.

@bmillwood
Copy link
Contributor Author

As for the actual value in having a MPS running on the code... I do think there is quite a bit of value in that.

Sure, but my best guess is that GitHub runners don't have / can't use MPS! bmillwood#3 shows that assert torch.backends.mps.is_available() fails on those runners (in addition to failing on ubuntu runners, obviously)

As for macos-latest not working, those are probably still running on Intel Macs, thus they do not have mps hardware.

In fact it may be the other way around -- https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories only explicitly gives M1 as the CPU for macos-latest (= macos-14), and doesn't explicitly say what macos-13 has, and https://github.com/actions/runner-images?tab=readme-ov-file#available-images suggests that macos-13 is non-ARM and macos-14 is ARM, though it's maddeningly non-explicit about any of these things.

@bryce13950
Copy link
Collaborator

I reviewed the results you referenced, and I have a few thoughts here. The tests that you ran were all done on python 3.8, 3.9, and 3.10. After looking around in the code a bit further, I realized that TransformerLens probably only supports arm Macs in python 3.11. When the library selects what device to use, it only selects mps, if the installed torch version is over 2.0. Torch was release in March 2023, which was about half a year after python 3.11 was released. I know that at least the first year of arm Macs had very rocky support for a huge range of things, so the possibility of incompatibilities of python+torch+mps increases greatly with every version you go backwards from 3.11. Perhaps the issues with the runners failing on order versions could be somehow related to this? There are still a lot of incompatibilities of certain features of torch with MPS. It is improving rapidly, but seeing that the runner failed on older versions of python is not really all that surprising.
Going back to my first comment, I think we should revise this to only run on macOS-latest with python 3.11, and start from there. Maybe after doing that, we still find that GitHub is simply not providing mps, but if they aren't providing it today, I would put my money on them supporting it completely at some point in the next year. I don't know if I would say that we can gain many facts by looking at this on anything but python 3.11, or 3.12, which is currently simply not compatible with TransformerLens due to some pending dependency updates.
Another thing to think about is that maybe on macos-13 mps is showing as available, but there's a distinct possibility that torch.backends.mps.is_built() returns false. If those are Intel Macs, then they probably still have the API in the source code, but you can only know if it is capable of being used by checking that second variable. If it is actually capable of being used, both available and is_built should be true.
If you can update this to run python 3.11 on macos-latest, then I would be very curious to look at those results, and help debug this to get it up.

bmillwood added 4 commits May 23, 2024 11:57
This on the theory that it's only MPS specifically that doesn't work,
and MacOS 13 maybe (?) doesn't have MPS
@bmillwood bmillwood force-pushed the run-checks-on-macos branch from e818a24 to 7328ebd Compare May 23, 2024 11:03
@bmillwood
Copy link
Contributor Author

After looking around in the code a bit further, I realized that TransformerLens probably only supports arm Macs in python 3.11. When the library selects what device to use, it only selects mps, if the installed torch version is over 2.0. Torch was release in March 2023, which was about half a year after python 3.11 was released

I don't think Python version affects torch version in the way that you imply. Even the latest stable version of PyTorch (2.3.0) is still compatible with Python versions all the way back to 3.8. The torch version is pinned in poetry.lock so it will be the same regardless of the Python version. (We could upgrade the version listed there, potentially, but that is its own can of worms.)

Nevertheless, I rebased and added 3.11. macos-latest still fails, and macos-13still passes. I also updated the tests I'm running in bmillwood#3, and we now see a couple of interesting things:

  • torch.backends.mps.is_built() is true on both versions, and is_available() is false on macos-13 but true on macos-latest. (To get the test to run at all on macos-latest, I had to do More pytest fixtures #609).
  • I ran system_profiler, a command line tool for showing Mac system information, in a separate job, to avoid PyTorch stuff altogether, and it seems like on macos-13 we see a virtualized graphics device, whereas on macos-latest, we see no graphics device at all. Of course, this is a bit tricky to interpret, but I think it supports the idea that the macos-latest runner doesn't have graphics capability at all.

@bryce13950
Copy link
Collaborator

Yeah I did some digging earlier in the day as well, and as you have deduced it seems like MPS is simply not possible at this time. I am not sure if you have already seen this comment actions/runner-images#9254 (comment), but it seems like it's an issue that has been brought to the attention of GitHub. I opened an issue for it actions/runner-images#9918, so hopefully someone at least looks at it. I don't understand why would have not made it available. Regardless, for our purposes, I think we should just leave this until it is added. There really isn't any reason to run this on Mac OS CPU devices, since the possibilities of having issues there different from Ubuntu are incredibly slim. There is a ton of value if we can get this running on MPS though. Unfortunately, we are just going to have to wait. Maybe with macOS 15 next month, they will allow MPS to be used finally. Thank you for putting the time into experimenting with this, and fingers crossed that we will be able to close this out and merge it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants