-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intel GPU support initialization #1118
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thnaks for opening this PR. I think it tries to do way too much at the same time and should be split in smaller blocks: we don't need to have support for DeepSpeed +XPUs (is that even a thing?) or MegatronLM + XPUs (same) or big model inference on XPUs all in the same PR as enabling XPUs for distributed training.
I suggest starting smaller by just pushing support for distributed training, then we can add the other features if there is interest from the community.
src/accelerate/test_utils/scripts/test_distributed_data_loop.py
Outdated
Show resolved
Hide resolved
Hi @sgugger , thanks for reviewing the initial draft of changes 👍🏻 . The PR is still in debug state, we are working internally to ensure it functionally works. I will make the recommended changes in the meantime , before I request your review . |
@sgugger do you suggest to remove Megatron LM support for XPU ? I guess that seems ok considering the PR first tries to enable distributed training |
As I said before, please keep the changes minimal to progressively enable support in small increments that are easy to review. The first integration should focus only on a small training on XPU. It shouldn't touch things like big model inference, megatron LM etc. |
@abhilash1910 we have a number of tests failing: On single GPU:
On multi-GPU:
And most concerning, when testing the example scripts, it looks like the Accelerator is prioritizing the CPU over the GPU, even when ipex isn't available (which could also explain the prior test failures as well). This was tested on two T4's running the following: CUDA_VISIBLE_DEVICES="0" pytest -sv tests/test_examples.py -k test_checkpointing_by_epoch And in this script (located at def training_function(config, args):
# For testing only
if os.environ.get("TESTING_MOCKED_DATALOADERS", None) == "1":
config["num_epochs"] = 2
# Initialize accelerator
accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
+ assert accelerator.device.type == "cuda", f'Device: {accelerator.device}, type: {accelerator.device.type}' |
@muellerzr is the failure only on tests with modelling.py ? Prioritisation is strange ,maybe due to checking for xpu before cuda in some places . I will see why this fails . Could you also highlight the test scripts which fail ? Thanks |
@abhilash1910 the two test files are there in the trace I gave, |
@muellerzr does this still fail in the slow tests? |
@abhilash1910 we're now failing our CPU tests (notice that |
@abhilash1910 looks like we're fine now actually, let me go run slow tests |
@abhilash1910 still not using CUDA on the example tests. Did you try making my modification to the checkpointing script to make sure it was using CUDA in practice? (If you don't have a GPU to use, you could try using Google Colab to test) |
Hi @muellerzr , I did some testing on colab for checkpointing.py (using the assertion for cuda which you gave) with the xpu wheel build(1.19.0 -dev0 build) , and when I specify args.cpu=None it does pick up the GPU cuda device . I am sharing a snapshot: |
@abhilash1910 this also needs to work without the XPU build, which is where I believe we're facing the breaking issue :) As it still shows |
@muellerzr I had tried with the same notebook with the default public accelerate(1.19.0) but saw the same results on colab(standard public - 1.19.0 ; xpu wheel - 1.19.0-dev 0). (as mentioned here: #1118 (comment)) . Both the public and the xpu build picks up the cuda properly. I put args.cpu=None so as to use GPU cuda device in both cases.(since colab has no xpu ). Could you suggest which scripts to check further? Thanks |
@abhilash1910 you can find a full reproducer here, which includes a wget to grab the correct file with the change: https://gist.github.com/muellerzr/72155ad00fd83c20dab9173a5ce8b79b |
Thanks @muellerzr for sharing the test notebook. It was a tricky issue to resolve (was with the cluster launcher args hence was not detected before), and now it passes the slow tests (both checkpoint & and modeling utils) on T4 with cuda getting used. Let me know on this ? |
Great, slow tests are running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are passing fine here! cc @sgugger for a final look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for all the work on this!
@abhilash1910 it seems this has broken something in diffusers, please see this issue: #1420 Any insight towards what could be going wrong here? This is breaking some of our other libraries (this is on CPU) |
Intel GPU Support for Accelerate Framework .
Initial Draft PR :
-Feature addition for XPU devices backend
PR is in progress