CLongEval is a Chinese benchmark for evaluating long-context LLMs, which is characterized by three key features: (1) Sufficient data volume, comprising 7 distinct tasks and 7,267 examples; (2) Broad applicability, accommodating to models with context windows size from 1K to 100K; (3) High quality, with over 2,000 manually annotated question-answer pairs in addition to the automatically constructed labels.
The image below presents statistics for each task. We stratify the benchmark into small, medium, and large subsets. The small set primarily includes test data with lengths ranging from 1K to 16K tokens, the medium set mainly encompasses lengths from 16K to 50K tokens, and the large set primarily extends from 50K to 100K tokens.
The tables below display model scores across three subsets using automated evaluation metrics. Evaluations for GLM-4-128K were conducted up to the cut-off date of February 21, 2024, whereas those for other models were completed by February 15, 2024.
LStQA | LCvMem | LStSum | StNLab | StTDet | KpRet | TblQry | |
---|---|---|---|---|---|---|---|
Zh-LLAMA2-7B-64K | 29.34 | 41.40 | 10.29 | 0.59 | 0 | 2.86 | 7.50 |
Zh-Alpaca2-7B-64K | 35.52 | 29.34 | 14.29 | 4.97 | 0.09 | 6.39 | 9.75 |
Qwen-7B-32K | 31.94 | 47.71 | 11.20 | 4.31 | 0 | 11.18 | 6.64 |
ChatGLM3-6B-32K | 49.36 | 53.40 | 16.37 | 0.46 | 0.91 | 33.67 | 22.60 |
InternLM2-7B-32K | 49.55 | 58.34 | 17.29 | 16.46 | 2.27 | 21.87 | 20.75 |
InternLM2-20B-32K | 53.82 | 57.41 | 17.00 | 11.16 | 0.91 | 34.97 | 17.25 |
GLM-4-128K | 52.74 | 46.74 | 20.29 | 87.93 | 17.40 | 81.47 | 73.25 |
Mooshot-v1-32K | 60.21 | 51.76 | 21.56 | 89.01 | 25.36 | 86.74 | 66.50 |
GPT-4-Turbo-128K | 66.19 | 63.42 | 21.96 | 79.70 | 38.35 | 84.24 | 82.35 |
LStQA | LCvMem | LStSum | StNLab | StTDet | KpRet | TblQry | |
---|---|---|---|---|---|---|---|
Zh-LLAMA2-7B-64K | 16.90 | 26.30 | 7.74 | 0 | 0 | 1.21 | N/A |
Zh-Alpaca2-7B-64K | 18.41 | 22.45 | 8.56 | 0 | 0 | 0.93 | N/A |
InternLM2-7B-200K | 29.59 | 32.07 | 8.13 | 0 | 0 | 1.45 | 4.50 |
InternLM2-20B-200K | 25.13 | 36.84 | 13.99 | 0 | 0 | 1.64 | 6.25 |
Moonshot-v1-128K | 51.20 | 38.29 | 18.81 | 86.30 | 11.33 | 78.64 | 66.50 |
GPT-4-Turbo-128K | 52.63 | 54.18 | 17.38 | 37.40 | 9.32 | 22.34 | 52.76 |
LStQA | LCvMem | LStSum | StNLab | StTDet | KpRet | TblQry | |
---|---|---|---|---|---|---|---|
InternLM2-7B-200K | 19.03 | 18.16 | 2.36 | 0 | 0 | 0.89 | 2.67 |
InternLM2-20B-200K | 15.62 | 28.39 | 8.31 | 0 | 0 | 0.51 | 0.67 |
Moonshot-v1-128K | 41.52 | 32.59 | 16.38 | 78.48 | 4.33 | 51.50 | 52.00 |
We have uploaded CLongEval to Hugging Face. The files can be downloaded from this link and manually put into the data
directory.
We use the lmdeploy
framework for InternLM2 series inference, huggingface's native methods for Qwen-7B-32K inference, and the vllm
framework for other open-source model inference. Please modify the model path in config/model2path.json
before performing inference to ensure proper loading of the models from the local path. Our code is adapted and modified based on LongBench.
Take InternLM2-7B as an example, For single-GPU inference, use the following command:
python inference.py --model_name internlm2-7b-200k --size small.jsonl --dataset_name long_story_qa --gpu_memory_utilization 0.8 --tensor_parallel_size 1 --gpus "0"
For multi-GPU inference, use the following command:
python inference.py --model_name internlm2-7b-200k --size small.jsonl --dataset_name long_story_qa --gpu_memory_utilization 0.8 --tensor_parallel_size 2 --gpus "0,1"
You can check inference_example.sh
that shows the complete commands to run the Inference for InternLM2-7B and obtain all results. After the above command is complete, the inference results will be saved in inference_results/internlm2-7b-200k/
. Additionally, we also provide the inference results of GPT-4-Turbo and Moonshot-v1 in inference_results/
.
Use the following command to obtain the model's performance on a specific dataset:
python eval.py --model_name internlm2-7b-200k --datasets long_story_qa
The evaluated scores will be saved into the eval_results/intenlm2-7b-200k/
.
-
To reproduce Figure 2 (Performance w.r.t Answer Position) of the paper, please refer to the code and instructions provided in the
lost_in_the_middle
folder. -
To reproduce Figure 3 (Performance change analysis of GPT-4-Turbo on StNLab) and Figure 4 (Performance change analysis of Moonshot-v1 on StNLab) of the paper, please refer to the code and instructions provided in the
heat_map
folder.
If you find CLongEval useful in your research, please consider citing:
@misc{qiu2024clongeval,
title={CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models},
author={Zexuan Qiu and Jingjing Li and Shijue Huang and Wanjun Zhong and Irwin King},
year={2024},
eprint={2403.03514},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- We are grateful for the support from Moonshot AI for providing a complimentary token budget, enabling us to utilize Moonshot-v1-128K for testing on both the Medium and Large Sets.