-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial FSDP Support for QLoRA Finetuning #970
Initial FSDP Support for QLoRA Finetuning #970
Conversation
Peft tests
Peft tests
Thank you all for your fine work and describing it thoroughly. We're very happy with the collaboration and after our talks in the last days and having given the code an initial review, the next step is to merge onto main as that triggers our daily pre-release CI pipeline on the HF side, running all the HF integration tests, making sure that nothing breaks in Transformers, PEFT + Accelerate. (This is our workaround for not having our own GPU runners, so we're using a pipeline on the HF side, which doesn't yet include the BnB tests itself.) Have you run the BnB test suite itself? Everything looking good there so far? We have a few flaky tests that we still need to make more reproducible, but this is still important information. It would be good to paste the output here for review. The procedure now is that we'll do a preliminary merge, get back to you tmr with the integration test results tmr and then speak about the next steps in our video call, including any potential improvements that we still might want to add, as well as Transformers integration, etc. Thanks again for your contribution and good work! Really happy to move forward with this. |
I ran the test suite with
Note: Previously I've tested python -m bitsandbytes
/home/paperspace/workdir/git/bitsandbytes/bitsandbytes/cuda_setup/main.py:108: UserWarning:
================================================================================
WARNING: Manual override via BNB_CUDA_VERSION env variable detected!
BNB_CUDA_VERSION=XXX can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64
Loading CUDA version: BNB_CUDA_VERSION=123
================================================================================
warn((f'\n\n{"="*80}\n'
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++ ANACONDA CUDA PATHS ++++++++++++++++++++
/home/paperspace/miniconda3/pkgs/icu-73.1-h6a678d5_0/lib/libicudata.so
/home/paperspace/miniconda3/pkgs/pytorch-2.1.2-py3.11_cuda12.1_cudnn8.9.2_0/lib/python3.11/site-packages/torch/lib/libtorch_cuda_linalg.so
/home/paperspace/miniconda3/pkgs/pytorch-2.1.2-py3.11_cuda12.1_cudnn8.9.2_0/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
/home/paperspace/miniconda3/pkgs/pytorch-2.1.2-py3.11_cuda12.1_cudnn8.9.2_0/lib/python3.11/site-packages/torch/lib/libc10_cuda.so
/home/paperspace/miniconda3/lib/libicudata.so
/home/paperspace/miniconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda_linalg.so
/home/paperspace/miniconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
/home/paperspace/miniconda3/lib/python3.11/site-packages/torch/lib/libc10_cuda.so
++++++++++++++++++ /usr/local CUDA PATHS +++++++++++++++++++
/usr/local/cuda-12.3/targets/x86_64-linux/lib/libcudart.so
/usr/local/cuda-12.3/targets/x86_64-linux/lib/stubs/libcuda.so
+++++++++++++++ WORKING DIRECTORY CUDA PATHS +++++++++++++++
/home/paperspace/workdir/git/bitsandbytes/bitsandbytes/libbitsandbytes_cuda123.so
++++++++++++++++++ LD_LIBRARY CUDA PATHS +++++++++++++++++++
++++ /home/paperspace/local/cuda-12.3/lib64 CUDA PATHS +++++
++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
COMPILED_WITH_CUDA = True
COMPUTE_CAPABILITIES_PER_GPU = ['8.0']
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Running a quick check that:
+ library is importable
+ CUDA function is callable
WARNING: Please be sure to sanitize sensible info from any such env vars!
SUCCESS!
Installation was successful! |
Further testing shows no issues using FSDP Mixed Precision with |
Both times (HF libs from source vs pip, with BNB always from source) the same tests are failing in the same way: This is the same failing test case - but with a different error! - that we got after merging @jph00 's commit, which we had to then revert. See the error log of core_single_gpu-Jeremy's_fix.log for reference, as I think that this might hint that both are related to the prevention of "Params4Bit from quantizing already quantized params when transferring from CPU to GPU". I'll try to look more into this during the day, but just wanted to share the necessary info with everyone asap. cc @Sourab @younesbelkada @TimDettmers for visibility / collab in fixing this |
x-posting @pacman100 's message on Discord here for visibility:
|
amazing! will take a look from FSDP side to see what can be improved |
This PR adds initial FSDP support for training QLoRA models. It enables basic FSDP and CPU Offload support, with low memory training via
FSDP.sync_module_states
option unsupported.This PR builds off of #840 commit 8278fca and BNB FSDP by @TimDettmers and @Titus-von-Koeller.
An example of using this PR to finetune QLoRA models with FSDP can be found in our demo script: fsdp_qlora.
Rational
The primary blocker for FSDP QLoRA finetuning is the quantized storage type of uint8. FSDP can only shard float data types. Additionally, every time FSDP moves a Linear4Bit from CPU to GPU when using CPU offloading will result in a quantization of existing data. Even if the data is already quantized.
Changes Made
Selectable Quantization Storage
This PR adds a selectable quantization storage option
quant_storage
toLinear4Bit
andParams4Bit
. The quantization storage dtype defaults totorch.uint8
for backward compatibility with existing code.While selecting any floating-point storage type will allow FSDP to shard
Linear4Bit
layers, setting the quantization storage dtype to match the rest of the non-LoRA layers' dtype allowsLinear4Bit
layers to be wrapped identically toLinear
layers in a LoRA wrapping policy, such as thefsdp_auto_wrap_policy
from llama-recipes.If the quantization storage dtype does not match the rest of the layers's dtype, then the
Linear4Bit
layers will have to be wrapped individually.Prevent Multiple Quantization
The PR adds a quantization flag to prevent
Params4Bit
from quantizing already quantized params when transferring from CPU to GPU. For example, when training with FSDP CPU offloading.Set Quant State during FSDP Forward
FSDP does not copy the
Params4Bit
QuantState dictionary in when moving sharded layers. This PR sets the QuantState as a component ofLinear4Bit
and copies it toParams4Bit
if it no longer exists inParams4Bit
.Testing
This PR adds
quant_storage
testing to the Linear4Bit tests and fixes an issue with the current tests where NF4 wasn't tested.We also tested these changes against PEFT's QLoRA tests and did not find any regressions from the current bitsandbytes behavior.
We have also tested FSDP Mixed Precision in fp32 and bf16 and noticed no changes in training behavior when setting the
Linear4Bit
andParams4Bit
'squant_storage
dtype to match the FSDPMixedPrecision.param
dtype.We have successfully finetuned Llama-2, Mistral, and TinyLlama models with FSDP & QLoRA using our demo script.
Downstream Implications
Existing implementations may require some modification to work with this, for example:
load_in_4bit
will need a way to setquant_storage
(the demo script uses custom model loading)prepare_model_for_kbit_training
upcasts all non-uint8 params to float32 under the assumption that the base (quantized) weights are stored in uint8, which is now no longer guaranteedget_nb_trainable_parameters
multiplies the number of parameters fromParams4Bit
by two which is only valid ifquant_storage
is uint8Future Work
Currently QLoRA finetuning using FSDP's low memory via the
sync_module_states
option doesn't work. Enabling this will require a future PR.