Intel GPU support initialization #1118

abhilash1910 · 2023-02-26T07:29:32Z

Intel GPU Support for Accelerate Framework .
Initial Draft PR :
-Feature addition for XPU devices backend

CCL bindings for XPU devices
Sanity checks
PR is in progress

HuggingFaceDocBuilderDev · 2023-03-02T14:17:59Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Thnaks for opening this PR. I think it tries to do way too much at the same time and should be split in smaller blocks: we don't need to have support for DeepSpeed +XPUs (is that even a thing?) or MegatronLM + XPUs (same) or big model inference on XPUs all in the same PR as enabling XPUs for distributed training.

I suggest starting smaller by just pushing support for distributed training, then we can add the other features if there is interest from the community.

src/accelerate/state.py

src/accelerate/test_utils/scripts/test_distributed_data_loop.py

src/accelerate/test_utils/scripts/test_script.py

src/accelerate/test_utils/scripts/test_sync.py

src/accelerate/utils/modeling.py

src/accelerate/utils/imports.py

src/accelerate/utils/megatron_lm.py

src/accelerate/utils/memory.py

src/accelerate/utils/modeling.py

abhilash1910 · 2023-03-20T15:30:12Z

Hi @sgugger , thanks for reviewing the initial draft of changes 👍🏻 . The PR is still in debug state, we are working internally to ensure it functionally works. I will make the recommended changes in the meantime , before I request your review .

src/accelerate/accelerator.py

src/accelerate/checkpointing.py

abhilash1910 · 2023-04-13T17:25:36Z

@sgugger do you suggest to remove Megatron LM support for XPU ? I guess that seems ok considering the PR first tries to enable distributed training

src/accelerate/commands/env.py

src/accelerate/utils/launch.py

src/accelerate/utils/imports.py

src/accelerate/hooks.py

sgugger · 2023-04-14T11:29:31Z

As I said before, please keep the changes minimal to progressively enable support in small increments that are easy to review. The first integration should focus only on a small training on XPU. It shouldn't touch things like big model inference, megatron LM etc.

muellerzr · 2023-05-04T15:30:35Z

@abhilash1910 we have a number of tests failing:

On single GPU:

 FAILED tests/test_modeling_utils.py::ModelingUtilsTester::test_get_balanced_memory - AssertionError: {0: 215, 1: 300} != {0: 300, 1: 300}
- {0: 215, 1: 300}
?     ^^^

+ {0: 300, 1: 300}
?     ^^^

On multi-GPU:

FAILED tests/test_big_modeling.py::BigModelingTester::test_dispatch_model_bnb - AssertionError: False is not true

And most concerning, when testing the example scripts, it looks like the Accelerator is prioritizing the CPU over the GPU, even when ipex isn't available (which could also explain the prior test failures as well). This was tested on two T4's running the following:

CUDA_VISIBLE_DEVICES="0" pytest -sv tests/test_examples.py -k test_checkpointing_by_epoch

And in this script (located at examples/by_feature/checkpointing.py I changed lines 124-130 to be:

def training_function(config, args):
    # For testing only
    if os.environ.get("TESTING_MOCKED_DATALOADERS", None) == "1":
        config["num_epochs"] = 2
    # Initialize accelerator
    accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
+    assert accelerator.device.type == "cuda", f'Device: {accelerator.device}, type: {accelerator.device.type}'

abhilash1910 · 2023-05-04T19:19:15Z

@muellerzr is the failure only on tests with modelling.py ? Prioritisation is strange ,maybe due to checking for xpu before cuda in some places . I will see why this fails . Could you also highlight the test scripts which fail ? Thanks

muellerzr · 2023-05-04T19:41:05Z

@abhilash1910 the two test files are there in the trace I gave, test_modeling_utils and test_big_modeling. All the example tests are failing silently because what should take 6 minutes to run to completion takes 30+ minutes (on gpu)

src/accelerate/utils/modeling.py

abhilash1910 · 2023-05-06T17:06:50Z

@muellerzr does this still fail in the slow tests?

muellerzr · 2023-05-08T09:41:03Z

@abhilash1910 we're now failing our CPU tests (notice that test_checkpoint_step has been running for 2+ hours and not passing. Can you try running make test_checkpoint_step to see if it passes? I'll run the slow tests after we get an all green again. (This should take no more than 2-3 minutes to run)

muellerzr · 2023-05-08T10:05:50Z

@abhilash1910 looks like we're fine now actually, let me go run slow tests

muellerzr · 2023-05-08T11:31:07Z

@abhilash1910 still not using CUDA on the example tests. Did you try making my modification to the checkpointing script to make sure it was using CUDA in practice? (If you don't have a GPU to use, you could try using Google Colab to test)

abhilash1910 · 2023-05-09T07:19:21Z

Hi @muellerzr , I did some testing on colab for checkpointing.py (using the assertion for cuda which you gave) with the xpu wheel build(1.19.0 -dev0 build) , and when I specify args.cpu=None it does pick up the GPU cuda device . I am sharing a snapshot:

The colab is here: https://colab.research.google.com/drive/15NZd13igA_S-jresSiKVcb6p2qXatziI?usp=sharing
This is also seen with the default version 1.19 of accelerate . So with cpu=None , it does pick up the cuda and assertion passes. Could you please check if anything is missing ? I will re-check again .

muellerzr · 2023-05-09T08:34:50Z

@abhilash1910 this also needs to work without the XPU build, which is where I believe we're facing the breaking issue :) As it still shows cpu for me

abhilash1910 · 2023-05-09T09:01:14Z

@muellerzr I had tried with the same notebook with the default public accelerate(1.19.0) but saw the same results on colab(standard public - 1.19.0 ; xpu wheel - 1.19.0-dev 0). (as mentioned here: #1118 (comment)) . Both the public and the xpu build picks up the cuda properly. I put args.cpu=None so as to use GPU cuda device in both cases.(since colab has no xpu ). Could you suggest which scripts to check further? Thanks

muellerzr · 2023-05-09T09:11:04Z

@abhilash1910 you can find a full reproducer here, which includes a wget to grab the correct file with the change: https://gist.github.com/muellerzr/72155ad00fd83c20dab9173a5ce8b79b

abhilash1910 · 2023-05-10T13:37:10Z

Thanks @muellerzr for sharing the test notebook. It was a tricky issue to resolve (was with the cluster launcher args hence was not detected before), and now it passes the slow tests (both checkpoint & and modeling utils) on T4 with cuda getting used. Let me know on this ?
BTW is there any issue with the CI tests as they seem to pass on Colab ?

muellerzr · 2023-05-11T11:06:18Z

Great, slow tests are running

muellerzr

Tests are passing fine here! cc @sgugger for a final look

sgugger

Thanks again for all the work on this!

muellerzr · 2023-05-12T13:25:32Z

@abhilash1910 it seems this has broken something in diffusers, please see this issue: #1420

Any insight towards what could be going wrong here? This is breaking some of our other libraries (this is on CPU)

src/accelerate/accelerator.py

Intel GPU support initialization

a4a9109

abhilash1910 marked this pull request as draft February 26, 2023 07:29

abhilash1910 added 2 commits March 2, 2023 01:04

rng state for xpu ,accel backend

830824b

add xpu variable and clean code

ea6a327

checkpointing, hooks, colls & megatronlm porting

81e3b89

muellerzr marked this pull request as ready for review March 9, 2023 14:12

muellerzr marked this pull request as draft March 9, 2023 14:12

abhilash1910 added 2 commits March 17, 2023 00:56

fix runtime errors

0b34891

test utils and xpu runtime checks

b3d2d70

sgugger reviewed Mar 20, 2023

View reviewed changes

abhilash1910 and others added 3 commits March 20, 2023 11:59

fix unknown import in constant

97a1495

Resolve amp and cuda/xpu tensor placement

8ca1f63

Merge branch 'main' into main

337fe56

jgong5 reviewed Apr 4, 2023

View reviewed changes

src/accelerate/accelerator.py Outdated Show resolved Hide resolved

src/accelerate/accelerator.py Outdated Show resolved Hide resolved

src/accelerate/checkpointing.py Outdated Show resolved Hide resolved

src/accelerate/checkpointing.py Outdated Show resolved Hide resolved

abhilash1910 and others added 5 commits April 12, 2023 23:38

add ipex for state and hooks

e491df8

rebase ipex

2983013

Merge branch 'main' into main

33c36c5

add mingxiao's ipex changes and source code rebase changes

daef577

add ipex binding in cluster

b02725d

abhilash1910 and others added 2 commits April 13, 2023 23:10

Merge branch 'huggingface:main' into main

78ebf51

resolve megatron lm issues and modelling memory

d223276

mingxiaoh approved these changes Apr 14, 2023

View reviewed changes

src/accelerate/commands/env.py Outdated Show resolved Hide resolved

src/accelerate/utils/launch.py Outdated Show resolved Hide resolved

src/accelerate/utils/imports.py Outdated Show resolved Hide resolved

src/accelerate/utils/imports.py Outdated Show resolved Hide resolved

mingxiaoh reviewed Apr 14, 2023

View reviewed changes

src/accelerate/hooks.py Outdated Show resolved Hide resolved

indent fix and syntax

2fe8e70

abhilash1910 and others added 2 commits April 14, 2023 06:29

versioning and sanity checks

50a44ae

Merge branch 'huggingface:main' into main

74f2025

refine accelerator and modelling for tests

40adef9

mingxiaoh suggested changes May 6, 2023

View reviewed changes

src/accelerate/utils/modeling.py Outdated Show resolved Hide resolved

abhilash1910 and others added 3 commits May 7, 2023 17:23

Merge branch 'huggingface:main' into main

b37283a

refine modeling and merge

a4655c8

Merge branch 'huggingface:main' into main

7ea5115

Merge branch 'main' into main

ea23426

abhilash1910 added 2 commits May 10, 2023 19:02

Merge branch 'huggingface:main' into main

11ad1b3

Fix slow cuda tests

cadb8a4

abhilash1910 and others added 2 commits May 10, 2023 11:52

doc and retrigger test

2f2e234

Merge branch 'huggingface:main' into main

2eac6c8

muellerzr approved these changes May 11, 2023

View reviewed changes

sgugger approved these changes May 11, 2023

View reviewed changes

sgugger merged commit ab37979 into huggingface:main May 11, 2023

muellerzr mentioned this pull request May 12, 2023

[Casting] RuntimeError: Found dtype Float but expected BFloat16 on a CPU #1420

Closed

pacman100 reviewed May 31, 2023

View reviewed changes

src/accelerate/accelerator.py Show resolved Hide resolved

xwu99 mentioned this pull request Aug 2, 2023

[Core] Support Intel GPU ray-project/ray#36493

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intel GPU support initialization #1118

Intel GPU support initialization #1118

abhilash1910 commented Feb 26, 2023

HuggingFaceDocBuilderDev commented Mar 2, 2023 •

edited

Loading

sgugger left a comment

abhilash1910 commented Mar 20, 2023 •

edited

Loading

abhilash1910 commented Apr 13, 2023

sgugger commented Apr 14, 2023

muellerzr commented May 4, 2023 •

edited

Loading

abhilash1910 commented May 4, 2023 •

edited

Loading

muellerzr commented May 4, 2023 •

edited

Loading

abhilash1910 commented May 6, 2023

muellerzr commented May 8, 2023

muellerzr commented May 8, 2023

muellerzr commented May 8, 2023 •

edited

Loading

abhilash1910 commented May 9, 2023 •

edited

Loading

muellerzr commented May 9, 2023

abhilash1910 commented May 9, 2023

muellerzr commented May 9, 2023

abhilash1910 commented May 10, 2023 •

edited

Loading

muellerzr commented May 11, 2023

muellerzr left a comment

sgugger left a comment

muellerzr commented May 12, 2023 •

edited

Loading

Intel GPU support initialization #1118

Intel GPU support initialization #1118

Conversation

abhilash1910 commented Feb 26, 2023

HuggingFaceDocBuilderDev commented Mar 2, 2023 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

abhilash1910 commented Mar 20, 2023 • edited Loading

abhilash1910 commented Apr 13, 2023

sgugger commented Apr 14, 2023

muellerzr commented May 4, 2023 • edited Loading

abhilash1910 commented May 4, 2023 • edited Loading

muellerzr commented May 4, 2023 • edited Loading

abhilash1910 commented May 6, 2023

muellerzr commented May 8, 2023

muellerzr commented May 8, 2023

muellerzr commented May 8, 2023 • edited Loading

abhilash1910 commented May 9, 2023 • edited Loading

muellerzr commented May 9, 2023

abhilash1910 commented May 9, 2023

muellerzr commented May 9, 2023

abhilash1910 commented May 10, 2023 • edited Loading

muellerzr commented May 11, 2023

muellerzr left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

muellerzr commented May 12, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Mar 2, 2023 •

edited

Loading

abhilash1910 commented Mar 20, 2023 •

edited

Loading

muellerzr commented May 4, 2023 •

edited

Loading

abhilash1910 commented May 4, 2023 •

edited

Loading

muellerzr commented May 4, 2023 •

edited

Loading

muellerzr commented May 8, 2023 •

edited

Loading

abhilash1910 commented May 9, 2023 •

edited

Loading

abhilash1910 commented May 10, 2023 •

edited

Loading

muellerzr commented May 12, 2023 •

edited

Loading