-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Cannot Reproduce H2O Prediction Output #450
Comments
Thank you for the detailed description! Would it be possible for you to share the training configuration YAML file? Regarding the discrepancies between validation predictions and hosted inference using vLLM and HF TGI, one or more of the following could be a potential explanation:
Additionally, we've observed discrepancies in validation and chat outputs attributed to mismatches in tokenizer configurations, as detailed here. |
Hey @maxjeblick Thank you for your swift response. From my investigation, it may be due to LORA and Quantization (I'm still in the process of assessing whether tokenizer configurations would come into play). However, just to be sure, here's the training configuration YAML used:
If it turns out to be due to LORA/Quantization, is there any recommended literature I could refer to (specific to H2O LLM Studio) for the best way to handle these, as to minimize differences between training and prediction? |
Are your predictions differ completely, is one method constantly producing worse results, or do the predictions differ after some words?
Switching to fp16 and inference batch size 1 (see discussion here) should reduce differences. |
They seem to only occur in a subset, although this subset represents around 20% of validation prompts. The methods outside of H2O tend to be producing worse results than the ones found in the validation file, but it's not always. The prompt and output consist of JSON values (among other things), which tend to make the prediction vary past the initial 100 tokens or so. Considering your previous explanation, I wouldn't say that they are completely different (or worse) to the point of a bug necessarily existing. However, I'll continue doing some testing with the new information you've given me and will come back with my findings! Thank you for the help so far. |
@maxjeblick Sorry for the long hiatus - I'm back with some results and they are... Confusing. I generated the outputs for the validation set of my model against these three different implementations:
All of the above are using generation config parameters to make them as deterministic as possible (i.e. Text Generation Interface
Hugging Face Basic Transformers Pipeline
vLLM
Although looking at it from a manual qualitative approach, some of the results are not as bad, they are definitely substantially different from H2O's, this is considering that all of them are being deterministic. Based on this, I'm not sure if there's a bug, but this definitely makes it harder to improve the model's quality in production. For reference, here's the source code used to get the BLEU scores:
|
Thanks for the detailed description! I think this recent thread (and corresponding issue) should be relevant. We'll monitor the issue mentioned and implement/support potential fixes. As your issue is hard to debug remotely, I suggest:
In general, I'd also suggest to use GPT metric in favor of BLEU for incremental model finetuning, as GPT metric is much more aligned with model quality. Otherwise, Perplexity should also be fine to use. |
That issue does sound like the scenario that is happening. I'll do a round of testing with those suggestions and will get back to here with the results! Thanks for your time and help so far @maxjeblick |
Again, sorry for the long time between replies, but I have some new findings. This time, I only tested with vLLM and Hugging Face Basic Transformers Pipeline. What I tried:
All of the above still had Lora activated (need to find a way to re-train without lora). The findings: vLLM int8
vLLM float16
Hugging Face Basic Transformers Pipeline int8
Hugging Face Basic Transformers Pipeline float16
Based on the above, there's a substantial difference in prediction when one goes from int4 to int8/float16. Subsequently, I haven't been able to export the float32 model, as it seems it does not fit using GPUs (I had to use deepspeed). Is there a workaround to being able to download the model weights, when it only fits using deepspeed? I still need to validate with a float32 model and without LORA, but it does seem that the issue referenced in the thread you've proded @maxjeblick could be the root cause. |
You can select |
🐛 Bug
I trained a model based on circulus/Llama-2-7b-orca-v1 and exported:
However, I'm currently having troubles reproducing the exact output obtained in the validation predictions CSV. Specifically, the model is using greedy search and I've tried loading the model in the following ways:
vLLM and HF TGI give the same output when running with greedy search, but it differs from the one in the prediction file.
The sample code of the model gives a different output to the ones of vLLM, HF TGI and the one in the file, when using greedy search (although it comes closer than vLLM and HF TGI).
Considering the above, I'm not sure if I'm loading the model correctly (and passing the intended generation params), or if there is an issue. The prediction params are as follows:
To Reproduce
Just train the circulus/Llama-2-7b-orca-v1 (or another model) on custom data and check whether the validation prediction data can be reproduced (i.e. the outputs) using greedy search.
LLM Studio version
The text was updated successfully, but these errors were encountered: