Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel GPU support initialization #1118

Merged
merged 58 commits into from
May 11, 2023
Merged

Intel GPU support initialization #1118

merged 58 commits into from
May 11, 2023

Conversation

abhilash1910
Copy link
Contributor

Intel GPU Support for Accelerate Framework .
Initial Draft PR :
-Feature addition for XPU devices backend

  • CCL bindings for XPU devices
  • Sanity checks
    PR is in progress

@abhilash1910 abhilash1910 marked this pull request as draft February 26, 2023 07:29
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 2, 2023

The documentation is not available anymore as the PR was closed or merged.

@muellerzr muellerzr marked this pull request as ready for review March 9, 2023 14:12
@muellerzr muellerzr marked this pull request as draft March 9, 2023 14:12
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thnaks for opening this PR. I think it tries to do way too much at the same time and should be split in smaller blocks: we don't need to have support for DeepSpeed +XPUs (is that even a thing?) or MegatronLM + XPUs (same) or big model inference on XPUs all in the same PR as enabling XPUs for distributed training.

I suggest starting smaller by just pushing support for distributed training, then we can add the other features if there is interest from the community.

src/accelerate/state.py Outdated Show resolved Hide resolved
src/accelerate/test_utils/scripts/test_script.py Outdated Show resolved Hide resolved
src/accelerate/test_utils/scripts/test_sync.py Outdated Show resolved Hide resolved
src/accelerate/utils/modeling.py Outdated Show resolved Hide resolved
src/accelerate/utils/imports.py Outdated Show resolved Hide resolved
src/accelerate/utils/imports.py Outdated Show resolved Hide resolved
src/accelerate/utils/megatron_lm.py Outdated Show resolved Hide resolved
src/accelerate/utils/memory.py Outdated Show resolved Hide resolved
src/accelerate/utils/modeling.py Outdated Show resolved Hide resolved
@abhilash1910
Copy link
Contributor Author

abhilash1910 commented Mar 20, 2023

Hi @sgugger , thanks for reviewing the initial draft of changes 👍🏻 . The PR is still in debug state, we are working internally to ensure it functionally works. I will make the recommended changes in the meantime , before I request your review .

src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/accelerator.py Outdated Show resolved Hide resolved
src/accelerate/checkpointing.py Outdated Show resolved Hide resolved
src/accelerate/checkpointing.py Outdated Show resolved Hide resolved
@abhilash1910
Copy link
Contributor Author

@sgugger do you suggest to remove Megatron LM support for XPU ? I guess that seems ok considering the PR first tries to enable distributed training

src/accelerate/commands/env.py Outdated Show resolved Hide resolved
src/accelerate/utils/launch.py Outdated Show resolved Hide resolved
src/accelerate/utils/imports.py Outdated Show resolved Hide resolved
src/accelerate/utils/imports.py Outdated Show resolved Hide resolved
src/accelerate/hooks.py Outdated Show resolved Hide resolved
@sgugger
Copy link
Collaborator

sgugger commented Apr 14, 2023

As I said before, please keep the changes minimal to progressively enable support in small increments that are easy to review. The first integration should focus only on a small training on XPU. It shouldn't touch things like big model inference, megatron LM etc.

@muellerzr
Copy link
Collaborator

muellerzr commented May 4, 2023

@abhilash1910 we have a number of tests failing:

On single GPU:

 FAILED tests/test_modeling_utils.py::ModelingUtilsTester::test_get_balanced_memory - AssertionError: {0: 215, 1: 300} != {0: 300, 1: 300}
- {0: 215, 1: 300}
?     ^^^

+ {0: 300, 1: 300}
?     ^^^

On multi-GPU:

FAILED tests/test_big_modeling.py::BigModelingTester::test_dispatch_model_bnb - AssertionError: False is not true

And most concerning, when testing the example scripts, it looks like the Accelerator is prioritizing the CPU over the GPU, even when ipex isn't available (which could also explain the prior test failures as well). This was tested on two T4's running the following:

CUDA_VISIBLE_DEVICES="0" pytest -sv tests/test_examples.py -k test_checkpointing_by_epoch

And in this script (located at examples/by_feature/checkpointing.py I changed lines 124-130 to be:

def training_function(config, args):
    # For testing only
    if os.environ.get("TESTING_MOCKED_DATALOADERS", None) == "1":
        config["num_epochs"] = 2
    # Initialize accelerator
    accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
+    assert accelerator.device.type == "cuda", f'Device: {accelerator.device}, type: {accelerator.device.type}'

@abhilash1910
Copy link
Contributor Author

abhilash1910 commented May 4, 2023

@muellerzr is the failure only on tests with modelling.py ? Prioritisation is strange ,maybe due to checking for xpu before cuda in some places . I will see why this fails . Could you also highlight the test scripts which fail ? Thanks

@muellerzr
Copy link
Collaborator

muellerzr commented May 4, 2023

@abhilash1910 the two test files are there in the trace I gave, test_modeling_utils and test_big_modeling. All the example tests are failing silently because what should take 6 minutes to run to completion takes 30+ minutes (on gpu)

src/accelerate/utils/modeling.py Outdated Show resolved Hide resolved
@abhilash1910
Copy link
Contributor Author

@muellerzr does this still fail in the slow tests?

@muellerzr
Copy link
Collaborator

@abhilash1910 we're now failing our CPU tests (notice that test_checkpoint_step has been running for 2+ hours and not passing. Can you try running make test_checkpoint_step to see if it passes? I'll run the slow tests after we get an all green again. (This should take no more than 2-3 minutes to run)

@muellerzr
Copy link
Collaborator

@abhilash1910 looks like we're fine now actually, let me go run slow tests

@muellerzr
Copy link
Collaborator

muellerzr commented May 8, 2023

@abhilash1910 still not using CUDA on the example tests. Did you try making my modification to the checkpointing script to make sure it was using CUDA in practice? (If you don't have a GPU to use, you could try using Google Colab to test)

@abhilash1910
Copy link
Contributor Author

abhilash1910 commented May 9, 2023

Hi @muellerzr , I did some testing on colab for checkpointing.py (using the assertion for cuda which you gave) with the xpu wheel build(1.19.0 -dev0 build) , and when I specify args.cpu=None it does pick up the GPU cuda device . I am sharing a snapshot:
image
The colab is here: https://colab.research.google.com/drive/15NZd13igA_S-jresSiKVcb6p2qXatziI?usp=sharing
This is also seen with the default version 1.19 of accelerate . So with cpu=None , it does pick up the cuda and assertion passes. Could you please check if anything is missing ? I will re-check again .

@muellerzr
Copy link
Collaborator

@abhilash1910 this also needs to work without the XPU build, which is where I believe we're facing the breaking issue :) As it still shows cpu for me

@abhilash1910
Copy link
Contributor Author

@muellerzr I had tried with the same notebook with the default public accelerate(1.19.0) but saw the same results on colab(standard public - 1.19.0 ; xpu wheel - 1.19.0-dev 0). (as mentioned here: #1118 (comment)) . Both the public and the xpu build picks up the cuda properly. I put args.cpu=None so as to use GPU cuda device in both cases.(since colab has no xpu ). Could you suggest which scripts to check further? Thanks

@muellerzr
Copy link
Collaborator

@abhilash1910 you can find a full reproducer here, which includes a wget to grab the correct file with the change: https://gist.github.com/muellerzr/72155ad00fd83c20dab9173a5ce8b79b

@abhilash1910
Copy link
Contributor Author

abhilash1910 commented May 10, 2023

Thanks @muellerzr for sharing the test notebook. It was a tricky issue to resolve (was with the cluster launcher args hence was not detected before), and now it passes the slow tests (both checkpoint & and modeling utils) on T4 with cuda getting used. Let me know on this ?
BTW is there any issue with the CI tests as they seem to pass on Colab ?

@muellerzr
Copy link
Collaborator

Great, slow tests are running

Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are passing fine here! cc @sgugger for a final look

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for all the work on this!

@muellerzr
Copy link
Collaborator

muellerzr commented May 12, 2023

@abhilash1910 it seems this has broken something in diffusers, please see this issue: #1420

Any insight towards what could be going wrong here? This is breaking some of our other libraries (this is on CPU)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants