Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to reproduce the perplexity of Llama-2 7B on wikitext #2301

Open
Yonghao-Tan opened this issue Sep 15, 2024 · 13 comments
Open

Fail to reproduce the perplexity of Llama-2 7B on wikitext #2301

Yonghao-Tan opened this issue Sep 15, 2024 · 13 comments

Comments

@Yonghao-Tan
Copy link

Hi, when I use the command for evaluating Llama-2 7B on wikitext2:
lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf --tasks wikitext --device cuda:0 --batch_size 1
The result is
image
However, the fp16 result I saw in many papers is 5.47. Another confusing point is that the other tasks like piqa, winogrande, arc-e, arc-c ... I can get the exact same results as the papers reported. Thanks!

@baberabb
Copy link
Contributor

Hi! Can you provide a source? I'll check

@Yonghao-Tan
Copy link
Author

Thanks for your reply! The source for Llama-2 7B on wikitext2 is from many SOTA quantization works:
https://arxiv.org/pdf/2306.00978 (page 7)
https://github.com/qwopqwop200/GPTQ-for-LLaMa (Llama-1 only)
https://arxiv.org/pdf/2308.13137 (page 7)
They all report PPL 5.68 for Llama-1-7b and 5.47 for Llama-2-7b as a FP16 baseline, which are far from 8.7071 as I tried in lm-eval
Thank you in advance

@lonleyodd
Copy link

lonleyodd commented Sep 26, 2024

Hi, when I use the command for evaluating Llama-2 7B on wikitext2: lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf --tasks wikitext --device cuda:0 --batch_size 1 The result is image However, the fp16 result I saw in many papers is 5.47. Another confusing point is that the other tasks like piqa, winogrande, arc-e, arc-c ... I can get the exact same results as the papers reported. Thanks!

Hello, how did you test zero-shot tasks like piqa, arc...,here is my result, compared to result in paper spinquant, it seems something wrong. btw, what's difference between acc and acc norm? I don't know which one to compare with
image
image

@huweim
Copy link

huweim commented Sep 27, 2024

For me, I use the codebase of GPTQ or AWQ to run the wikitext evaluation.

@Yonghao-Tan
Copy link
Author

I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos.

@Yonghao-Tan
Copy link
Author

Is it possible that the metric here for wikitext2 is different from what is used in other codebase? Since all paper reports mostly the same FP16 baseline for Llama2-7b on wikitext2 (5.47)

@huweim
Copy link

huweim commented Sep 27, 2024

I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos.

Indeed. So AWQ splits the evaluation of wikitext from lm-eval, and so does QuaRot. See AWQ and QuaRot. In this way, you can reproduce the PPL resutls in most paper (5.47 for Llama2-7b and 5.69 for Llama-7b).
Therefore, I choose to manually load the dataset and process the PPL calculations on wikitext and C4.

For other tasks, lm-eval is good.

@Yonghao-Tan
Copy link
Author

I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos.

Indeed. So AWQ splits the evaluation of wikitext from lm-eval, and so does QuaRot. See AWQ and QuaRot. In this way, you can reproduce the PPL resutls in most paper (5.47 for Llama2-7b and 5.69 for Llama-7b). Therefore, I choose to manually load the dataset and process the PPL calculations on wikitext and C4.

For other tasks, lm-eval is good.

Thanks. Do you mean manually load the dataset and process the PPL calculations on wikitext and C4 in lm-eval (change the code in lm-eval)?

@huweim
Copy link

huweim commented Sep 27, 2024

I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos.

Indeed. So AWQ splits the evaluation of wikitext from lm-eval, and so does QuaRot. See AWQ and QuaRot. In this way, you can reproduce the PPL resutls in most paper (5.47 for Llama2-7b and 5.69 for Llama-7b). Therefore, I choose to manually load the dataset and process the PPL calculations on wikitext and C4.
For other tasks, lm-eval is good.

Thanks. Do you mean manually load the dataset and process the PPL calculations on wikitext and C4 in lm-eval (change the code in lm-eval)?

Yes. Just refer to the implementation of AWQ and QuaRot.

Maybe there is a better way:)

@Yonghao-Tan
Copy link
Author

Thanks a lot! I'll try that

@helunwencser
Copy link

Hi @baberabb, did you get a chance to look at this? I run into similar problems. We(PyTorch/ExecuTorch) use lm_eval as well but observed discrepancies for the wikitext2 perplexity for llama models.

@baberabb
Copy link
Contributor

baberabb commented Oct 9, 2024

Hi! Sorry about the delay. Looked into it, and there are a couple of differences:

  1. Normalization: We normalize the perplexity by word count:
    _words = len(re.split(r"\s+", doc["page"]))

    and not by token count as, for example, used by llm-awq. Both are valid, but lm-eval has traditionally reported tokenizer agnostic metrics. For details see Appendix A3
  2. The wikitext dataset used here is document level (EleutherAI/wikitext_document_level) while AWQ and QuaRot use an aggregated dataset. This means:
    • Our implementation creates non-overlapping chunks of size model_length for each document separately and then reports an aggregate measure of these document-level perplexities.
    • The other implementation concatenates all the text in the dataset into a single long sequence before creating the overlapping chunks.

Made a fork here to report token normalized perplexity (for Llama-2-7b-hf) instead of word-level, as well as using the aggregated dataset. The result is 5.4775 as in the other sources. Command used was:

lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf,max_length=2048 --tasks wikitext --batch_size 8

Hope this helps!

@helunwencser
Copy link

Thanks @baberabb for investigating on this! This is really helpful and makes a lot of sense.

Looked at your fork, looks like it would require tokenizer if we want to report perplexity normalized by token count. Going forward, it seems that we won't be able to fix the difference discussed in this issue since they are caused by two different implementations and both are valid. Is it possible to let lm_eval report perplexity normalized by token count?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants