[TPU] Call torch._sync(param) during weight loading #9437

WoosukKwon · 2024-10-17T00:10:28Z

During weight loading, we often do something like:

narrowed_tensor = param.data.narrow(0, offset, len)
narrowed_tensor.copy_(real_weight)

expecting narrowed_tensor and param.data to share the same storage. However, on TPUs, narrowed_tensor will lazily propagate to the base tensor, which is param.data, leading to the redundant memory usage. This sometimes causes OOM errors during model loading.

This PR address this problem by adding a post-hook to call torch._sync(param) after the weight loader of each param is called.

When loading Llama3-8B (bf16) on v5e-8,

Before this PR: 3.4 GB allocated after weight loading
After this PR: 2.0 GB allocated after weight loading

github-actions · 2024-10-17T00:10:40Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

WoosukKwon · 2024-10-17T00:11:21Z

Thanks @JackCaoG for finding out the bug and providing the solution.

JackCaoG · 2024-10-17T00:19:46Z

vllm/model_executor/utils.py

@@ -28,4 +29,22 @@ def set_weight_attrs(
    for key, value in weight_attrs.items():
        assert not hasattr(
            weight, key), (f"Overwriting existing tensor attribute: {key}")
+
+        # NOTE(woosuk): For TPU, param.data.copy_(weight) happens lazily,


to be more accurate this is because in VLLM we do

narrowed_tensor = param.data.narrow(0, offset, len) narrowed_tensor.copy_(real_weight)

narrowed_tensor and param.data share the same storage. With functionization, the in place update on the narrowed_tensor will lazily propagate to the base tensor which is param.data.

Thanks for the elaboration. Fixed the comment!

JackCaoG · 2024-10-17T00:19:59Z

lgtm

mgoin

Thanks for referencing the CT issue, LGTM!

Signed-off-by: charlifu <[email protected]>

Signed-off-by: Vinay Damodaran <[email protected]>

Signed-off-by: Alvant <[email protected]>

Signed-off-by: Amit Garg <[email protected]>

Signed-off-by: qishuai <[email protected]>

Signed-off-by: Sumit Dubey <[email protected]>

Signed-off-by: Maxime Fournioux <[email protected]>

Signed-off-by: Tyler Michael Smith <[email protected]>

WoosukKwon added 2 commits October 16, 2024 23:57

[TPU] Ensure torch._sync(param) is called after param.data.copy_()

bb7c741

yapf

cf842bd

WoosukKwon added the tpu Related to Google TPUs label Oct 17, 2024

JackCaoG reviewed Oct 17, 2024

View reviewed changes

JackCaoG approved these changes Oct 17, 2024

View reviewed changes

This was referenced Oct 17, 2024

[Quantization][TPU] compressed-tensors integration for TPU #9301

Closed

[TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA #9438

Merged

WoosukKwon changed the title ~~[TPU] Ensure torch._sync(param) is called after param.data.copy_()~~ [TPU] Call torch._sync(param) during weight loading Oct 17, 2024

Update comment

f5d8d91

mgoin approved these changes Oct 17, 2024

View reviewed changes

WoosukKwon merged commit 8e1cddc into main Oct 17, 2024
30 checks passed

WoosukKwon deleted the tpu-sync branch October 17, 2024 16:00

charlifu pushed a commit to charlifu/vllm that referenced this pull request Oct 23, 2024

[TPU] Call torch._sync(param) during weight loading (vllm-project#9437)

48144b9

Signed-off-by: charlifu <[email protected]>

vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Oct 23, 2024

[TPU] Call torch._sync(param) during weight loading (vllm-project#9437)

ee0a0bf

Signed-off-by: Vinay Damodaran <[email protected]>

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[TPU] Call torch._sync(param) during weight loading (vllm-project#9437)

c2ab3eb

Signed-off-by: Alvant <[email protected]>

garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024

[TPU] Call torch._sync(param) during weight loading (vllm-project#9437)

aabc4d1

Signed-off-by: Amit Garg <[email protected]>

FerdinandZhong pushed a commit to FerdinandZhong/vllm that referenced this pull request Oct 29, 2024

[TPU] Call torch._sync(param) during weight loading (vllm-project#9437)

441ec16

Signed-off-by: qishuai <[email protected]>

sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024

[TPU] Call torch._sync(param) during weight loading (vllm-project#9437)

7c3ddbb

Signed-off-by: Sumit Dubey <[email protected]>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[TPU] Call torch._sync(param) during weight loading (vllm-project#9437)

485c652

mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 20, 2024

[TPU] Call torch._sync(param) during weight loading (vllm-project#9437)

a09844b

Signed-off-by: Maxime Fournioux <[email protected]>

tlrmchlsmth pushed a commit to neuralmagic/vllm that referenced this pull request Nov 23, 2024

[TPU] Call torch._sync(param) during weight loading (vllm-project#9437)

6c72813

Signed-off-by: Tyler Michael Smith <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU] Call torch._sync(param) during weight loading #9437

[TPU] Call torch._sync(param) during weight loading #9437

WoosukKwon commented Oct 17, 2024 •

edited

Loading

github-actions bot commented Oct 17, 2024

WoosukKwon commented Oct 17, 2024

JackCaoG Oct 17, 2024

WoosukKwon Oct 17, 2024

JackCaoG commented Oct 17, 2024

mgoin left a comment

[TPU] Call torch._sync(param) during weight loading #9437

[TPU] Call torch._sync(param) during weight loading #9437

Conversation

WoosukKwon commented Oct 17, 2024 • edited Loading

github-actions bot commented Oct 17, 2024

WoosukKwon commented Oct 17, 2024

JackCaoG Oct 17, 2024

Choose a reason for hiding this comment

WoosukKwon Oct 17, 2024

Choose a reason for hiding this comment

JackCaoG commented Oct 17, 2024

mgoin left a comment

Choose a reason for hiding this comment

WoosukKwon commented Oct 17, 2024 •

edited

Loading