Fail to reproduce the perplexity of Llama-2 7B on wikitext #2301

Yonghao-Tan · 2024-09-15T16:27:41Z

Hi, when I use the command for evaluating Llama-2 7B on wikitext2:
lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf --tasks wikitext --device cuda:0 --batch_size 1
The result is

However, the fp16 result I saw in many papers is 5.47. Another confusing point is that the other tasks like piqa, winogrande, arc-e, arc-c ... I can get the exact same results as the papers reported. Thanks!

baberabb · 2024-09-15T18:44:19Z

Hi! Can you provide a source? I'll check

Yonghao-Tan · 2024-09-15T19:15:26Z

Thanks for your reply! The source for Llama-2 7B on wikitext2 is from many SOTA quantization works:
https://arxiv.org/pdf/2306.00978 (page 7)
https://github.com/qwopqwop200/GPTQ-for-LLaMa (Llama-1 only)
https://arxiv.org/pdf/2308.13137 (page 7)
They all report PPL 5.68 for Llama-1-7b and 5.47 for Llama-2-7b as a FP16 baseline, which are far from 8.7071 as I tried in lm-eval
Thank you in advance

lonleyodd · 2024-09-26T12:25:45Z

Hi, when I use the command for evaluating Llama-2 7B on wikitext2: lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf --tasks wikitext --device cuda:0 --batch_size 1 The result is However, the fp16 result I saw in many papers is 5.47. Another confusing point is that the other tasks like piqa, winogrande, arc-e, arc-c ... I can get the exact same results as the papers reported. Thanks!

Hello, how did you test zero-shot tasks like piqa, arc...，here is my result, compared to result in paper spinquant, it seems something wrong. btw, what's difference between acc and acc norm? I don't know which one to compare with

huweim · 2024-09-27T02:31:14Z

For me, I use the codebase of GPTQ or AWQ to run the wikitext evaluation.

Yonghao-Tan · 2024-09-27T07:35:14Z

I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos.

Yonghao-Tan · 2024-09-27T07:37:19Z

Is it possible that the metric here for wikitext2 is different from what is used in other codebase? Since all paper reports mostly the same FP16 baseline for Llama2-7b on wikitext2 (5.47)

huweim · 2024-09-27T07:43:47Z

I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos.

Indeed. So AWQ splits the evaluation of wikitext from lm-eval, and so does QuaRot. See AWQ and QuaRot. In this way, you can reproduce the PPL resutls in most paper (5.47 for Llama2-7b and 5.69 for Llama-7b).
Therefore, I choose to manually load the dataset and process the PPL calculations on wikitext and C4.

For other tasks, lm-eval is good.

Yonghao-Tan · 2024-09-27T07:45:28Z

I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos.

Indeed. So AWQ splits the evaluation of wikitext from lm-eval, and so does QuaRot. See AWQ and QuaRot. In this way, you can reproduce the PPL resutls in most paper (5.47 for Llama2-7b and 5.69 for Llama-7b). Therefore, I choose to manually load the dataset and process the PPL calculations on wikitext and C4.

For other tasks, lm-eval is good.

Thanks. Do you mean manually load the dataset and process the PPL calculations on wikitext and C4 in lm-eval (change the code in lm-eval)?

huweim · 2024-09-27T07:54:18Z

I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos.

Indeed. So AWQ splits the evaluation of wikitext from lm-eval, and so does QuaRot. See AWQ and QuaRot. In this way, you can reproduce the PPL resutls in most paper (5.47 for Llama2-7b and 5.69 for Llama-7b). Therefore, I choose to manually load the dataset and process the PPL calculations on wikitext and C4.
For other tasks, lm-eval is good.

Thanks. Do you mean manually load the dataset and process the PPL calculations on wikitext and C4 in lm-eval (change the code in lm-eval)?

Yes. Just refer to the implementation of AWQ and QuaRot.

Maybe there is a better way:)

Yonghao-Tan · 2024-09-27T08:12:24Z

Thanks a lot! I'll try that

helunwencser · 2024-10-08T22:58:04Z

Hi @baberabb, did you get a chance to look at this? I run into similar problems. We(PyTorch/ExecuTorch) use lm_eval as well but observed discrepancies for the wikitext2 perplexity for llama models.

baberabb · 2024-10-09T15:19:03Z

Hi! Sorry about the delay. Looked into it, and there are a couple of differences:

Normalization: We normalize the perplexity by word count:

lm-evaluation-harness/lm_eval/tasks/wikitext/preprocess_wikitext.py

Line 42 in 0845b58

_words = len(re.split(r"\s+", doc["page"]))

and not by token count as, for example, used by llm-awq. Both are valid, but lm-eval has traditionally reported tokenizer agnostic metrics. For details see Appendix A3
The wikitext dataset used here is document level (EleutherAI/wikitext_document_level) while AWQ and QuaRot use an aggregated dataset. This means:
- Our implementation creates non-overlapping chunks of size model_length for each document separately and then reports an aggregate measure of these document-level perplexities.
- The other implementation concatenates all the text in the dataset into a single long sequence before creating the overlapping chunks.

Made a fork here to report token normalized perplexity (for Llama-2-7b-hf) instead of word-level, as well as using the aggregated dataset. The result is 5.4775 as in the other sources. Command used was:

lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf,max_length=2048 --tasks wikitext --batch_size 8

Hope this helps!

helunwencser · 2024-10-09T16:32:14Z

Thanks @baberabb for investigating on this! This is really helpful and makes a lot of sense.

Looked at your fork, looks like it would require tokenizer if we want to report perplexity normalized by token count. Going forward, it seems that we won't be able to fix the difference discussed in this issue since they are caused by two different implementations and both are valid. Is it possible to let lm_eval report perplexity normalized by token count?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to reproduce the perplexity of Llama-2 7B on wikitext #2301

Fail to reproduce the perplexity of Llama-2 7B on wikitext #2301

Yonghao-Tan commented Sep 15, 2024

baberabb commented Sep 15, 2024

Yonghao-Tan commented Sep 15, 2024

lonleyodd commented Sep 26, 2024 •

edited

Loading

huweim commented Sep 27, 2024

Yonghao-Tan commented Sep 27, 2024

Yonghao-Tan commented Sep 27, 2024

huweim commented Sep 27, 2024

Yonghao-Tan commented Sep 27, 2024

huweim commented Sep 27, 2024

Yonghao-Tan commented Sep 27, 2024

helunwencser commented Oct 8, 2024

baberabb commented Oct 9, 2024 •

edited

Loading

helunwencser commented Oct 9, 2024

Fail to reproduce the perplexity of Llama-2 7B on wikitext #2301

Fail to reproduce the perplexity of Llama-2 7B on wikitext #2301

Comments

Yonghao-Tan commented Sep 15, 2024

baberabb commented Sep 15, 2024

Yonghao-Tan commented Sep 15, 2024

lonleyodd commented Sep 26, 2024 • edited Loading

huweim commented Sep 27, 2024

Yonghao-Tan commented Sep 27, 2024

Yonghao-Tan commented Sep 27, 2024

huweim commented Sep 27, 2024

Yonghao-Tan commented Sep 27, 2024

huweim commented Sep 27, 2024

Yonghao-Tan commented Sep 27, 2024

helunwencser commented Oct 8, 2024

baberabb commented Oct 9, 2024 • edited Loading

helunwencser commented Oct 9, 2024

lonleyodd commented Sep 26, 2024 •

edited

Loading

baberabb commented Oct 9, 2024 •

edited

Loading