This repository has been archived by the owner on Oct 19, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 361
[FEATURE] A CPU Swapping Runtime #694
Labels
Comments
The key point for swapping in XLA is that all parameters should be already in GPU when launching an XlaExecutable. To address this:
|
For the infra, you may want to leverage the latest xla runtime effort.
We already implemented a similar solution in tf/pt with heuristic
device/data placement and schedule. It works well in certain hardware
systems.
…On Fri, Sep 9, 2022 at 1:47 PM Yonghao Zhuang ***@***.***> wrote:
The key point for swapping in XLA is that all parameters should be already
in GPU when launching an XlaExecutable. To address this:
- When the model is not very large. We can split more stages so that
the parameter for each stage can be prepared before starting;
- When the model is extremely large that even parameters of a single
transformer layer(or likewise layer) cannot be placed into the GPU memory
simultaneously. Although we can still split each operator as a stage, the
auto-sharding pass will be inefficient. We can
- split each operator into a stage, but run auto-sharding with
multiple stages. To avoid missing optimization opportunities like fusion,
we can split stages not at the JAX level but at the optimized HLO level.
- modify the HloModule. Use custom calls to swap parameters in the
HloComputation and replace all parameters with the output of such custom
calls.
- When the model is even larger that a single GeMM cannot be placed
into the GPU memory. We need a hand-optimized GeMM kernel that runs GeMM
for a sub-matrix while swapping in another sub-matrix. The hand-optimized
kernel will replace the corresponding HloInstruction. Such a kernel also
helps with cases that are not extremely memory intense because it overlaps
swapping and computation.
—
Reply to this email directly, view it on GitHub
<#694 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDJMFAXPFOKYJWETETT7UDV5OO6DANCNFSM6AAAAAAQILQQAU>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Cpu Compute Runtime
|
I have some similar code in the tpu-support branch |
@ff7250 Sounds good! Could you give us some pointers to the code and usage? |
Hello, I am implementing CPU distributed collectives support to XLA via gloo. Is there any overlap with this project here? |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Background
To train or serve large models with limited GPU memory resources, we can utilize the huge amount of available CPU memory by swapping tensors between CPU and GPU. In this project, we are going to implement a swapping runtime for Alpa. We can start with the easiest case: swapping between 1 CPU and 1 GPU for serving. We can then move to more complicated cases: swapping between distributed CPUs and GPUs for training.
Todo
References
SwapAdvisor: Push Deep Learning Beyond the GPU Memory Limit via Smart Swapping
Harmony: Overcoming the Hurdles of GPU Memory Capacity to Train Massive DNN Models on Commodity Servers
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
The text was updated successfully, but these errors were encountered: