-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LORA fine-tuning with openlm-research/open_llama_7b as a plugin replacement for decapoda-research/llama-7b-hf #63
Comments
It's not just the tokenizer config. The tokenizer vocabs seem to be very different. See #40 and https://github.com/openlm-research/open_llama/discussions/41. Therefore, any direct comparison of loss values needs to be taken with a grain of salt. |
Would that imply that if the normalised train & eval loss between models is approximately similar that there's similar quality of fine tuning? |
Looking at the perplexity scores in the discussion in #41 , you have ppl=6.5 (loss=1.87) for Open and ppl=5.4 (loss=1.68) for Meta's. That turns out to be around 11% of difference in loss values, which is very similar to what you report here in your fine-tuning experiments. |
So maybe the tokenizer is simply 10% worse across a wide range of semi-uncurated text such as I am using for fine-tuning or perplexity, but for many custom hand-curated evaluation sets the tokenizer isn't having much effect, e.g. #40 isn't going to likely impact scores when using cleanly formatted single-spaced evaluation data. |
I'm also having a hard time reproducing models that use the Facebook Llama as the base model. Using a slightly modified version of the Qlora code and the dataset here. I can't get results comparable to this. Has anyone been able to successfully use OpenLLama as a drop in replacement for Llama yet? It seems like tokenization is a big problem as I get noisy/repetitive outputs as if the model has a hard time generating stop tokens correctly. |
What are you losses looking like when fine tuning FB llama versus Open Llama base models? Below are my normalised loss plots. Data starts at step 500 to remove outliers. The FB model train and eval losses are normalised with a mean of zero and divided by the standard deviation. The two Open Llama train and eval losses are grouped and normalised as one data set as they should be an apples-to-apples loss comparison, pre- and post- normalisation. 1. Default FB llama 7B model, default tokenizer from HF, and baize data set. Model is underfitting the data. |
Sorry I don't have evaluation or loss curves to share, but generally OpenLLama models for me have repetition problems, repeating the same or very similar sequence until it reaches the token limit. How do you configure your tokenizer in run 2? I'd like to try with the tokenizer initialization you mentioned. |
The If you can attach any I tried qlora with EDIT: Fixed some typos above and should add that I am evaluating every 20 steps as it helps in better plotting given the noise in training loss, at the cost of about 20% increase in run time. |
Thanks! Here's a trainer_state.json from a ~4epoch training run with a length filtered dataset filtering out any input sequences with >768 tokens. I'll start one with https://huggingface.co/decapoda-research/llama-13b-hf today! |
Ideally I need the Here's how I'm running
I notice that decapoda-research/llama-13b-hf/blob/main/config.json had a lot more defines in it so you might have better luck than I did with the llama-7b. |
I've reran that run with eval_steps = 20 I'm currently running using https://huggingface.co/huggyllama/llama-13b since I ran into the same CUDA errors as you mentioned and this should be done in a few hours. Also what would you suggest doing with these lines in the QLORA code: tokenizer = AutoTokenizer.from_pretrained(
args.model_name_or_path,
# cache_dir=args.cache_dir,
padding_side="right",
use_fast=False, # Fast tokenizer giving issues.
tokenizer_type='llama' if 'llama' in args.model_name_or_path else None, # Needed for HF name change
use_auth_token=args.use_auth_token,
)
if tokenizer._pad_token is None:
smart_tokenizer_and_embedding_resize(
special_tokens_dict=dict(pad_token=DEFAULT_PAD_TOKEN),
tokenizer=tokenizer,
model=model,
)
if 'llama' in args.model_name_or_path or isinstance(tokenizer, LlamaTokenizer):
# LLaMA tokenizer may not have correct special tokens set.
# Check and add them if missing to prevent them from being parsed into different tokens.
# Note that these are present in the vocabulary.
# Note also that `model.config.pad_token_id` is 0 which corresponds to `<unk>` token.
print('Adding special tokens.')
tokenizer.add_special_tokens({
"eos_token": tokenizer.convert_ids_to_tokens(model.config.eos_token_id),
"bos_token": tokenizer.convert_ids_to_tokens(model.config.bos_token_id),
"unk_token": tokenizer.convert_ids_to_tokens(
model.config.pad_token_id if model.config.pad_token_id != -1 else tokenizer.pad_token_id
),
}) for training OpenLLama? I've left them as is for the first run, as well as for the FB Llama run. |
Using the defaults for OpenLlama 7B was a bit of an overfitting disaster. See below, but not an apples-to-apples comparison. I needed to revert to 4 bit QLora as I got an out a CUDA memory error with 8bit on my 24GB GPU, which is extremely frustrating. Different data set (OAssist) and different Qlora instead of 8 bit Baize Lora. Since the datasets are different in size, the epochs versus steps are also different ratios, so I plotted per step and added the epoch in black text. |
EDIT: Misunderstood what model you used: Your data is
Looks like a Qlora issue? |
Adding similar /huggyllama/llama-13b data to the above plot will confirm is is Qlora, and not Open Llama. |
Using FB Llama the same overfitting issues. The default LORA R value is 64 for QLORA. I've been running with r=32 but maybe this is causing the overfitting. The LORA paper (IIRC) uses r=2? What rank value are you using to finetune? Edit: LORA paper uses r=4, baize uses r=8. I'm trying a run with r=4 will update in a few hours. |
Still overfitting with r=4 Are you using Baize with input/output data format? Would you be able to share the changes you've made to the baize code? It seems you are probably correct about it being a qlora issue as the divergence seems consistent. |
Also my dataset is about 27k sequences after filtering out inputs with >768 tokens. I did this for memory restraints on QLORA but it seems that Baize is actually using less memory/GPU (with DDP via torchrun) in 8bit than QLORA in 4bit which seems odd to me. |
Unscaled, below. I assume the last run was with Open Llama 13B? The plots look suspiciously similar, but I confirmed I didn't accidentally duplicate one of your
Here's my fork: gjmulder/baize-chatbot. I haven't changed much from the Baize defaults, except added a
My checkpoints are then associated with a git commit, and I can easily revert to any model run that in hindsight was my best to date. This all came about as the original Alpaca Lora code caused a bug with WanDB. I've found it a lot more flexible than WanDB as I can code any plot and comparison on the fly. The R code is written in such a way that it continually finds the latest checkpoint per run, so as the run proceeds I can semi-interactively how it is performing relative to prior runs.
Baize uses
The original Alpaca Lora code memory usage was always stable. Likewise with Baize. I've tried a few implementations of Qlora, including and I keep on seeing memory leaks when the checkpoints are being written. Another reason to checkpoint often, as that means I get CUDA memory errors that much earlier. There's a script |
Thank you for this. I was running into the memory issues while checkpointing this weekend. I'm going to try implementing FSDP instead of DDP and then continue training! |
@eschaffn Added you to my Baize repo fork if you'd like to collaborate. I looked at using the WizardLM uncensored data set with Alpaca Lora, but after reviewing the code decided to find a better implementation. So far Baize looks to be a cleaner code base, doesn't OOM, but had a lot of hyperparams hard coded, hence the refactoring I've done. If you don't code in R I can get ChatGPT-4 to translate my R code to the python equivalent for our analysis. Or move to WanDB which is likely the better solution, long term. |
Sure, I'm taking a bit of a break from the training runs maybe a couple days but I'd be happy to collaborate! I can send trainer_states.json to you manually but cannot hook up my machine to wandb. |
@eschaffn feel free to branch and push your |
Hi. Thanks for the open sourced models! This is a major step forward to the democratization of LLMs.
I'm trying to fine tune
openlm-research/open_llama_7b
using the LORA.I first tried the code and data at alpaca-lora but was getting evaluation losses about 10% higher than
decapoda-research/llama-7b-hf
.Given that
alpaca-lora
was a very early attempt at LORA training I then tried the code and data at baize-chatbot. However, I'm still getting evaluation losses about 10% higher thandecapoda-research/llama-7b-hf
:Assuming both models are approximately equivalent in terms of generative ability, I am wondering if it is the tokenizer? I am using the flag
use_fast=False
, but I notice there are additional significant differences in thetokenizer_config.json
decapoda-research/llama-7b-hf:
openlm-research/open_llama_7b:
The text was updated successfully, but these errors were encountered: