Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anyone try fine-tuning 13B model? #28

Open
gururise opened this issue Mar 16, 2023 · 28 comments
Open

Anyone try fine-tuning 13B model? #28

gururise opened this issue Mar 16, 2023 · 28 comments

Comments

@gururise
Copy link
Contributor

Training the 7B model takes about 18GB of RAM.

I tried training the 13B model, and ran out of VRAM on my 24GB card. I suspect, will need at least 32GB of VRAM.

Has anyone else been successful with fine-tuning a 13B model?

@tloen
Copy link
Owner

tloen commented Mar 16, 2023

#14 (comment)

@ItsLogic what's your experience been like?

@ItsLogic
Copy link

ItsLogic commented Mar 16, 2023

It has been quite good although understandably slow to train. I have a 4090 so a 24GB card is enough. The only thing you need to change to make it work is MICRO_BATCH_SIZE which I have set to 2.
The whole time training I was within a few hundred MB of an OOM so you might need to close a few background tasks when you decide to train.
Training time is ~10 hours for the full three epochs. I trained a single epoch(406 steps) in 3 hours 15 mins and got these results on 13B:

13B with lora

image
image

13B normal

image

Just a heads up the provided export_state_dict_checkpoint.py has the parameters set for 7B so you will need to change those to match the 13B params before you can use it.
It also only outputs one file at the end but the llama to HF conversion script works fine as long as you change the 13B shard count to 1 if you plan on using transformers.

@baleksey
Copy link

@ItsLogic Great results! Can you share your way to make it as a chat with memorized history? Do you include the whole chat history to "### Input:" field with the next prompt or what is the idea? It would be great if you can share your chat logic code!

Btw, how big chat history it can handle with decent memorization and ability to recover info from previous messages?

@ItsLogic
Copy link

@ItsLogic Great results! Can you share your way to make it as a chat with memorized history? Do you include the whole chat history to "### Input:" field with the next prompt or what is the idea? It would be great if you can share your chat logic code!

I'm just using https://github.com/oobabooga/text-generation-webui with the --cai-chat launch arg as my GUI and it handles everything

Btw, how big chat history it can handle with decent memorization and ability to recover info from previous messages?

Honestly not sure. Haven't had any super long conversations so I can't speak for it's memory.

@shawnanastasio
Copy link

I'm just using https://github.com/oobabooga/text-generation-webui with the --cai-chat launch arg as my GUI and it handles everything

It's my understanding that you have to format your prompt in a specific way for this model, as is done here. I don't think text-generation-webui does that (yet).

@baleksey
Copy link

baleksey commented Mar 17, 2023

@tloen @ItsLogic Guys, do you remember what was the minimal loss when you stopped the training?
And do you know (from papers or experience) what loss is considered great for such model/training goal?

Fine-tuning my 13B at the moment and guessing when it's "enough" :)

@0xbitches
Copy link

@baleksey Trained a 13B lora with one epoch as well, my loss was around 0.75 lowest. Text gen results don't feel too different for me than 7B though

@devilismyfriend
Copy link

Crazy to think we can even finetune the 13B on a 4090, next step is to train the 33B and convert it to 4int :D

@ItsLogic
Copy link

@ItsLogic Guys, do you remember what was the minimal loss when you stopped the training?

I hovered around low 0.8 and high 0.7 from about 20% through the first epoch. It ended just above 0.8.
I just finished epoch 3 and got 0.78. took 9 hours and 20 mins.

@baleksey
Copy link

baleksey commented Mar 17, 2023

Thanks! Just finished my 13B as well with 0.79.

Tried to run fine-tuning 33B with A6000. It uses 48.4GB out of 49 and says that 1 epoch will take about 12 hours. Maybe do it later.

@mastoca
Copy link

mastoca commented Mar 17, 2023

you may need to also add max_memory={0: 18} to LlamaForCausalLM.from_pretrained if u run into OOM errors when fine-tuning the 13B model

@justinh-rahb
Copy link

justinh-rahb commented Mar 17, 2023

Any hero out there got Alpaca-13B distributed yet? For those of us that lack a 24GB GPU 😓

@ItsLogic
Copy link

Any hero out there got Alpaca-13B distributed yet? For those of us that lack a 24GB GPU sweat

I uploaded my epoch 1 and epoch 3 loras
https://huggingface.co/Draff/llama-alpaca-stuff/tree/main/Alpaca-Loras

@justinh-rahb
Copy link

justinh-rahb commented Mar 17, 2023

@ItsLogic Forgive my ignorance, but how do I download this? from_pretrained() doesn't support subdirectories.

@ItsLogic
Copy link

@ItsLogic Forgive my ignorance, but how do I download this? from_pretrained() doesn't support subdirectories.

just download adapter_config.json and adapter_model.bin manually and put them in a folder and then edit the path in generate.py to point to that folder.

@gururise
Copy link
Contributor Author

I uploaded my epoch 1 and epoch 3 loras

Any noticeable difference between epoch 1 and epoch 3? I'm going to train a model on my cleaned up dataset. There are a lot of issues with the current dataset.

@ItsLogic
Copy link

Any noticeable difference between epoch 1 and epoch 3?

Havent done any in depth testing yet but from my usage so far they feel about the same

@0xbitches
Copy link

Can report the same as @ItsLogic e1 and e3 feels roughly the same, probably because the loss are both ~0.78.

Somewhat related, when I was trying the model out with textgen-webui, the outputs are incredibly short. Don't know if it's a problem with the webui or the model itself.

@gururise
Copy link
Contributor Author

gururise commented Mar 17, 2023

Any hero out there got Alpaca-13B distributed yet? For those of us that lack a 24GB GPU sweat

Not so sure the 13B model is gonna perform much better than the 7B right now, the stanford dataset has a ton of issues. I've been going through trying to fix them.

I've done a first best effort to resolve the issues, and I'm training a new 7b model right now, but my GPU is a potato, so I wont have anything to show until tomorrow.

@rohvani
Copy link

rohvani commented Mar 17, 2023

Any hero out there got Alpaca-13B distributed yet? For those of us that lack a 24GB GPU sweat

Not so sure the 13B model is gonna perform much better than the 7B right now, the stanford dataset has a ton of issues. I've been going through trying to fix them.

I've done a first best effort to resolve the issues, and I'm training a new 7b model right now, but my GPU is a potato, so I wont have anything to show until tomorrow.

@gururise Could you please elaborate on the issues you are seeing with the Stanford Alpaca dataset?

Ahh, never mind -- I just saw this: #32

@gvijqb
Copy link

gvijqb commented Mar 28, 2023

Hi @gururise

but my GPU is a potato

I'm co-founder of qblocks.cloud. We would love to offer to you some GPU credits to help with your research and experimentation on alpaca / lora. Can we connect some way?

@gururise
Copy link
Contributor Author

gururise commented Mar 28, 2023

Hi @gururise

but my GPU is a potato

I'm co-founder of qblocks.cloud. We would love to offer to you some GPU credits to help with your research and experimentation on alpaca / lora. Can we connect some way?

Would love to take you up on your offer of GPU credits to generate some fine-tuned Alpaca models using my cleaned dataset. I've sent you an email.

@kizunasunhy
Copy link

kizunasunhy commented Apr 7, 2023

@tloen @ItsLogic Guys, do you remember what was the minimal loss when you stopped the training? And do you know (from papers or experience) what loss is considered great for such model/training goal?

Fine-tuning my 13B at the moment and guessing when it's "enough" :)

loss
This is my loss curve figure for 13B model. All parameters same to 7B model. Training time 7.5h on 3090. Wondering if others are the same

@mirekphd
Copy link

mirekphd commented May 1, 2023

Training time 7.5h on 3090.

I suppose Kaggle grandmaster's lesson of using 5-fold cross-validation also for DL models is out of the window then:)

@pGit1
Copy link

pGit1 commented May 24, 2023

@ItsLogic Can you show what your trainer args and hyper params are for the 13B training run? My models seem to take WAY longer than 10 hours to train. on 3090ti. Like a couple days at a time. :(

@pGit1
Copy link

pGit1 commented May 24, 2023

@ItsLogic nevermind. The longer training time definitely stemmed from cutoff len going from 256 to 512.

@pmudgal-Intel
Copy link

@ItsLogic I get the error while fine-tuning 13B on 2 quadro 24GB. Setting the micro_batch_size to 2 does not help either. What are your training params? the same code works fine for distributed finetuning of 7B model. So, I suspect I would have to reduce the size, not sure which one.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 23.65 GiB total capacity; 21.88 GiB already allocated; 8.31 MiB free; 22.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "finetune.py", line 317, in
fire.Fire(train)

@pmudgal-Intel
Copy link

Update. bitsandbytes==0.37.2 solved the problem. I had bitsandbytes==0.39.0 previously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests