-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail to reproduce the perplexity of Llama-2 7B on wikitext #2301
Comments
Hi! Can you provide a source? I'll check |
Thanks for your reply! The source for Llama-2 7B on wikitext2 is from many SOTA quantization works: |
For me, I use the codebase of GPTQ or AWQ to run the wikitext evaluation. |
I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos. |
Is it possible that the metric here for wikitext2 is different from what is used in other codebase? Since all paper reports mostly the same FP16 baseline for Llama2-7b on wikitext2 (5.47) |
Indeed. So AWQ splits the evaluation of wikitext from lm-eval, and so does QuaRot. See AWQ and QuaRot. In this way, you can reproduce the PPL resutls in most paper (5.47 for Llama2-7b and 5.69 for Llama-7b). For other tasks, lm-eval is good. |
Thanks. Do you mean manually load the dataset and process the PPL calculations on wikitext and C4 in lm-eval (change the code in lm-eval)? |
Yes. Just refer to the implementation of AWQ and QuaRot. Maybe there is a better way:) |
Thanks a lot! I'll try that |
Hi @baberabb, did you get a chance to look at this? I run into similar problems. We(PyTorch/ExecuTorch) use lm_eval as well but observed discrepancies for the wikitext2 perplexity for llama models. |
Hi! Sorry about the delay. Looked into it, and there are a couple of differences:
Made a fork here to report token normalized perplexity (for
Hope this helps! |
Thanks @baberabb for investigating on this! This is really helpful and makes a lot of sense. Looked at your fork, looks like it would require tokenizer if we want to report perplexity normalized by token count. Going forward, it seems that we won't be able to fix the difference discussed in this issue since they are caused by two different implementations and both are valid. Is it possible to let lm_eval report perplexity normalized by token count? |
Hi, when I use the command for evaluating Llama-2 7B on wikitext2:

lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf --tasks wikitext --device cuda:0 --batch_size 1
The result is
However, the fp16 result I saw in many papers is 5.47. Another confusing point is that the other tasks like piqa, winogrande, arc-e, arc-c ... I can get the exact same results as the papers reported. Thanks!
The text was updated successfully, but these errors were encountered: