-
Notifications
You must be signed in to change notification settings - Fork 574
longbench_en
LongBench is a benchmark for bilingual, multitask, and comprehensive assessment of long context understanding capabilities of large language models. This project tested the performance of the relevant models on the LongBench dataset.
In the following, we will introduce the prediction method for the LongBench dataset. Users can also refer to our Colab notebook:
Set up the environment according to requirements.txt
, which has been copied to scripts/longbench
:
pip install -r scripts/longbench/requirements.txt
The inference script will automatically download the dataset from 🤗 Datasets.
Run the following script:
model_path=path/to/chinese_llama2_or_alpaca2
output_path=path/to/output_dir
data_class=zh
with_inst="true" # or "false" or "auto"
max_length=3584
cd scripts/longbench
python pred_llama2.py \
--model_path ${model_path} \
--predict_on ${data_class} \
--output_dir ${output_dir} \
--with_inst ${with_inst} \
--max_length ${max_length}
-
--model_path ${model_path}
: Path to the model to be evaluated (the full Chinese-LLaMA-2 model or Chinese-Alpaca-2 model, not LoRA). -
--predict_on {data_class}
: The tasks to predict. Possible values areen
,zh
,code
or their combination such asen,zh,code
. -
--output_dir ${output_dir}
:Output directory of the predictions and logs. -
--max_length ${max_length}
:Max length of the instructions. Notice that the lengths of system prompt and task-related prompt is not included. -
--with_inst ${with_inst}
:Whether use the system prompt and template of Chinese-Alpaca-2 when constructing the instructions:-
true
:Use the system prompt and template on all tasks -
false
:Use the system prompt and template on none of tasks -
auto
:Use the system prompt and template on some tasks (default strategy of LongBench official code) We suggest setting--with_inst
toauto
when testing Alpaca; setting--with_inst
tofalse
when testing LLaMA.
-
-
--gpus ${gpus}
:Specify GPUs with this argument, such as0,1
. -
--alpha ${alpha}
: The scaling factor of NTK method, usually set tosequence_length / mdoel_context_length * 2 - 1
, or simply set toauto
. -
--e
:Predict on the LongBench-E dataset. See the official documentation for details of LongBench-E. -
--use_flash_attention_2
: use Flash-Attention to speed up inference. -
--use_ntk
: Using dynamic-ntk to extend the context window. Does not work with the 64K version of the long context model.
When the script has finished running, the prediction files are stored under ${output_dir}/pred/
or ${output_dir}/pred_e/
(depends on if testing on LongBench-E). Run the following command to compute metrics:
python eval.py --output_dir ${output_dir}
If testing on LongBench-E, provide -e
when computing metrics:
python eval.py --output_dir ${output_dir} -e
The results are stored in ${output_dir}/result.json
or ${output_dir}/pred_e/result.json
. For example, the results of Chinese-Alpaca-2-7B on LongBench Chinese tasks (--predict_on zh
) are:
{
"lsht": 20.5,
"multifieldqa_zh": 32.74,
"passage_retrieval_zh": 4.5,
"vcsum": 11.52,
"dureader": 16.59
}