Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot Reproduce H2O Prediction Output #450

Closed
diogobragaswogo opened this issue Oct 16, 2023 · 9 comments
Closed

[BUG] Cannot Reproduce H2O Prediction Output #450

diogobragaswogo opened this issue Oct 16, 2023 · 9 comments
Labels
type/bug Bug in code

Comments

@diogobragaswogo
Copy link

🐛 Bug

I trained a model based on circulus/Llama-2-7b-orca-v1 and exported:

  • Model
  • Validation predictions
  • Logs

However, I'm currently having troubles reproducing the exact output obtained in the validation predictions CSV. Specifically, the model is using greedy search and I've tried loading the model in the following ways:

  1. Using vLLM
  2. Using HF TGI
  3. Using the sample code of the model card.

vLLM and HF TGI give the same output when running with greedy search, but it differs from the one in the prediction file.
The sample code of the model gives a different output to the ones of vLLM, HF TGI and the one in the file, when using greedy search (although it comes closer than vLLM and HF TGI).

Considering the above, I'm not sure if I'm loading the model correctly (and passing the intended generation params), or if there is an issue. The prediction params are as follows:

prediction:
    batch_size_inference: 0
    do_sample: false
    max_length_inference: 1024
    metric: BLEU
    metric_gpt_model: gpt-3.5-turbo-0301
    min_length_inference: 2
    num_beams: 1
    num_history: 4
    repetition_penalty: 1.2
    stop_tokens: ''
    temperature: 0.3
    top_k: 0
    top_p: 1.0

To Reproduce

Just train the circulus/Llama-2-7b-orca-v1 (or another model) on custom data and check whether the validation prediction data can be reproduced (i.e. the outputs) using greedy search.

LLM Studio version

  • Nightly version of LLM Studio (as obtained by running docker run ... gcr.io/vorvan/h2oai/h2o-llmstudio:nightly).
@diogobragaswogo diogobragaswogo added the type/bug Bug in code label Oct 16, 2023
@maxjeblick
Copy link
Contributor

maxjeblick commented Oct 16, 2023

Thank you for the detailed description! Would it be possible for you to share the training configuration YAML file?

Regarding the discrepancies between validation predictions and hosted inference using vLLM and HF TGI, one or more of the following could be a potential explanation:

  • LORA: If LORA is merged after training, the resulting model may produce slightly different logits.
  • Quantization/Model Data Type. This may be different during training and inference. Note that we also have additional logic to cast certain params to fp32 for model stability during training.
  • Implementation Differences between HF and vLLM/TGI: An issue related to this has been documented here.

Additionally, we've observed discrepancies in validation and chat outputs attributed to mismatches in tokenizer configurations, as detailed here.

@diogobragaswogo
Copy link
Author

Hey @maxjeblick

Thank you for your swift response. From my investigation, it may be due to LORA and Quantization (I'm still in the process of assessing whether tokenizer configurations would come into play). However, just to be sure, here's the training configuration YAML used:

training:
    batch_size: 1
    differential_learning_rate: 1.0e-05
    differential_learning_rate_layers: []
    drop_last_batch: true
    epochs: 1
    evaluate_before_training: false
    evaluation_epochs: 1.0
    grad_accumulation: 1
    gradient_clip: 0.0
    learning_rate: 0.0001
    lora: true
    lora_alpha: 16
    lora_dropout: 0.05
    lora_r: 4
    lora_target_modules: ''
    loss_function: TokenAveragedCrossEntropy
    optimizer: AdamW
    save_best_checkpoint: false
    schedule: Cosine
    train_validation_data: false
    warmup_epochs: 0.0
    weight_decay: 0.0

If it turns out to be due to LORA/Quantization, is there any recommended literature I could refer to (specific to H2O LLM Studio) for the best way to handle these, as to minimize differences between training and prediction?
My goal here is to avoid falling into a "rabbit-hole" of constantly training the model to increase the BLEU score, but then have the corresponding BLEU in prediction randomly vary (no problem if it doesn't increase as much as in training, so as long as it still increases).

@maxjeblick
Copy link
Contributor

Are your predictions differ completely, is one method constantly producing worse results, or do the predictions differ after some words?

as to minimize differences between training and prediction?

Switching to fp16 and inference batch size 1 (see discussion here) should reduce differences.
Apart from that, slight differences in logit output between using the model for inference during the training process, using the (merged) model for inference and using a potentially different implementation (vLLM/TGI) are expected. If predictions are completely different or worse for one method, there probably a bug somewhere.

@diogobragaswogo
Copy link
Author

They seem to only occur in a subset, although this subset represents around 20% of validation prompts. The methods outside of H2O tend to be producing worse results than the ones found in the validation file, but it's not always.

The prompt and output consist of JSON values (among other things), which tend to make the prediction vary past the initial 100 tokens or so.

Considering your previous explanation, I wouldn't say that they are completely different (or worse) to the point of a bug necessarily existing. However, I'll continue doing some testing with the new information you've given me and will come back with my findings! Thank you for the help so far.

@diogobragaswogo
Copy link
Author

diogobragaswogo commented Oct 25, 2023

@maxjeblick Sorry for the long hiatus - I'm back with some results and they are... Confusing.

I generated the outputs for the validation set of my model against these three different implementations:

  • Text Generation Interface.
  • Hugging Face Basic transformers pipeline.
  • vLLM

All of the above are using generation config parameters to make them as deterministic as possible (i.e. temp=0.1, do_sample=False, etc.), since in LLM Studio, the validation used deterministic generation. Here's what I found:

Text Generation Interface

  • BLEU score (truth vs. h2o_pred): 80.88295765859095
  • BLEU score (truth vs. model_pred): 28.665971262286216
  • BLEU score (h2o_pred vs. model_pred): 29.20357323737969

Hugging Face Basic Transformers Pipeline

  • BLEU score (truth vs. h2o_pred): 80.88295765859095
  • BLEU score (truth vs. model_pred): 26.181086090582695
  • BLEU score (h2o_pred vs. model_pred): 28.04616682539429

vLLM

  • BLEU score (truth vs. h2o_pred): 80.88295765859095
  • BLEU score (truth vs. model_pred): 21.74056099054717
  • BLEU score (h2o_pred vs. model_pred): 20.213948909024584

Although looking at it from a manual qualitative approach, some of the results are not as bad, they are definitely substantially different from H2O's, this is considering that all of them are being deterministic.

Based on this, I'm not sure if there's a bug, but this definitely makes it harder to improve the model's quality in production. For reference, here's the source code used to get the BLEU scores:

import pandas as pd
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Load the CSV file into a DataFrame
df = pd.read_csv("./vllm_infer_results.csv")

# Initialize the BLEU smoothing function
smooth = SmoothingFunction().method1


# Function to compute BLEU score
def compute_bleu(reference, candidate):
    reference_tokens = reference.split()
    candidate_tokens = candidate.split()
    return sentence_bleu([reference_tokens], candidate_tokens, smoothing_function=smooth)


# Compute BLEU scores for "truth" vs. "h2o_pred"
bleu_truth_vs_h2o = df.apply(lambda row: compute_bleu(row['truth'], row['h2o_pred']), axis=1)

# Compute BLEU scores for "truth" vs. "model_pred"
bleu_truth_vs_model = df.apply(lambda row: compute_bleu(row['truth'], row['model_pred']), axis=1)

# Compute BLEU scores for "h2o_pred" vs. "model_pred"
bleu_h2o_vs_model = df.apply(lambda row: compute_bleu(row['h2o_pred'], row['model_pred']), axis=1)

# BLEU scores are stored in the respective Series
# You can access the scores for individual rows as needed
print(f"BLEU score (truth vs. h2o_pred): {bleu_truth_vs_h2o.mean()}")
print(f"BLEU score (truth vs. model_pred): {bleu_truth_vs_model.mean()}")
print(f"BLEU score (h2o_pred vs. model_pred): {bleu_h2o_vs_model.mean()}")

@maxjeblick
Copy link
Contributor

maxjeblick commented Oct 25, 2023

Thanks for the detailed description! I think this recent thread (and corresponding issue) should be relevant. We'll monitor the issue mentioned and implement/support potential fixes.

As your issue is hard to debug remotely, I suggest:

  • Avoiding left-padding during inference (inference batch size=1) and disable kv cache (add use_cache=Falseargument here). We may add use_cache to the configuration (and UI hence) in the future.
  • To elobrate on the post: Train a smaller model in fp32 (without LORA if possible), and see if models are better aligned (on short texts, preferably). If so, the discrepancy is likely a result of the rounding errors.

In general, I'd also suggest to use GPT metric in favor of BLEU for incremental model finetuning, as GPT metric is much more aligned with model quality. Otherwise, Perplexity should also be fine to use.

@diogobragaswogo
Copy link
Author

That issue does sound like the scenario that is happening. I'll do a round of testing with those suggestions and will get back to here with the results! Thanks for your time and help so far @maxjeblick

@diogobragaswogo
Copy link
Author

Again, sorry for the long time between replies, but I have some new findings.

This time, I only tested with vLLM and Hugging Face Basic Transformers Pipeline. What I tried:

  • Fine-tuning with int8 weights
  • Fine-tuning with float16 weights
  • Fine-tuning with float32 weights (required deepspeed)

All of the above still had Lora activated (need to find a way to re-train without lora).

The findings:

vLLM int8

  • BLEU score (truth vs. h2o_pred): 81.745937661239095
  • BLEU score (truth vs. model_pred): 70.9285796075405
  • BLEU score (h2o_pred vs. model_pred): 63.29405626127132

vLLM float16

  • BLEU score (truth vs. h2o_pred): 82.1443564857712
  • BLEU score (truth vs. model_pred): 71.02544444704684
  • BLEU score (h2o_pred vs. model_pred): 63.98552556438229

Hugging Face Basic Transformers Pipeline int8

  • BLEU score (truth vs. h2o_pred): 86.38065571613316
  • BLEU score (truth vs. model_pred): 70.19465206463109
  • BLEU score (h2o_pred vs. model_pred): 72.98347839674122

Hugging Face Basic Transformers Pipeline float16

  • BLEU score (truth vs. h2o_pred): 84.68455141870595
  • BLEU score (truth vs. model_pred): 65.59330490822121
  • BLEU score (h2o_pred vs. model_pred): 70.46537890906082

Based on the above, there's a substantial difference in prediction when one goes from int4 to int8/float16.

Subsequently, I haven't been able to export the float32 model, as it seems it does not fit using GPUs (I had to use deepspeed). Is there a workaround to being able to download the model weights, when it only fits using deepspeed?

I still need to validate with a float32 model and without LORA, but it does seem that the issue referenced in the thread you've proded @maxjeblick could be the root cause.

@psinger
Copy link
Collaborator

psinger commented Nov 13, 2023

You can select cpu as device when exporting if it does not fit the GPU.

@psinger psinger closed this as not planned Won't fix, can't repro, duplicate, stale Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Bug in code
Projects
None yet
Development

No branches or pull requests

3 participants