Tasks | Dataset | Checkpoints | Paper | Citation | License
This repository contains resources for accessing the official benchmarks, codes and checkpoints of the paper: Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension.
The paper was accepted by EMNLP 2023 and GenBench Workshop! 🎉
We collect
The highlight of
Orca has the following salient features: 1) Data in Orca are collected from November 2021 to November 2022 on Weibo, one of the most popular social media platforms in China. This means that the collected data reflect real human interests, are quite new, and have never been included in earlier benchmarks, posing a challenge to existing language models. Moreover, good results on Orca are of practical interest. 2) We carefully annotate conversations across 33 domains. In contrast, as the current commonly-used datasets, data in CoQA are only from 7 domains, DoQA contains data from 3 domains. The variety of data domains makes Orca closer to real scenarios and better-evaluating the generalization of CMRC models. 3) Answers at each turn in a conversation are natural and informative responses from human annotation rather than certain spans in the provided passage. This way, we can both evaluate models' comprehension ability and generation ability.
Concretely, we store our dataset in json files:
"0": {
"topic": "邓伦资本版图",
"domain": "人物",
"context": {
"0": {
"query": "邓论是谁?",
"response": "邓伦,1992年10月21日出生于河北省石家庄市,中国内地影视男演员,毕业于上海戏剧学院表演系。",
"query-type": "Causal",
"passage": "邓伦,1992年10月21日出生于河北省石家庄市,中国内地影视男演员,毕业于上海戏剧学院表演系。天眼查App显示,邓伦名下有2家公司,分别为邓伦(上海)影视文化工作室和舟山邓伦影视文化工作室,2家工作室均为邓伦个人独资企业,分别成立于2016年和2018年。"
},
...
"4": {
"query": "他有哪些代言",
"response": "邓伦有宝格丽、雪花秀、欧莱雅、联合利华等代言",
"query-type": "List",
"passage": "邓伦代言汇总: 1、宝格丽:品牌代言人 2、雪花秀:品牌亚太区代言人 3、欧莱雅:彩妆品牌代言人 4、联合利华:中国区洗护发代言人,清扬净爽代言人"
}
}
},
Each conversation in the dataset contains a unique number, Turn_no
, Topic
, domain
unique within a conversation, the Question
, query
, response
, query-type
with passage
unique within each turn.
-
Support Set
-
[Test Set] For the test set, please contact the [email protected].
Here we report automatic and huaman evaluations results of four baselines in our paper.
Model | zero-shot | 5-shot | 10-shot | 200-shot |
---|---|---|---|---|
T5 | T5-base | T5-base | T5-base | T5-base |
BART | BART-Large | BART-Large | BART-Large | BART-Large |
We provide the inference code of BART, please refer to utils/inference_bart.py.
python3 utils/inference_bart.py --type --model_path --output_path --bsz
Then it can could generate two files: labels.txt and prediction.txt
Of note, to ease inference, we reformulate the test set to run this transcript. Please contact the [email protected] for the corresponding file.
python3 utils/evaluate_metrics.py --predict_path -labels_path
Running this script could lead to computing the automatic metrics of the model.
@inproceedings{Chen2023NaturalRG,
title={Natural Response Generation for Chinese Reading Comprehension},
author={Nuo Chen and Hongguang Li and Yinan Bao and Baoyuan Wang and Jia Li},
year={2023}
}
@article{Chen2023OrcaAF,
title={Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension},
author={Nuo Chen and Hongguang Li and Yinan Bao and Junqing He and Xinshi Lin and Qi Yang and Jianfeng Liu and Ruyi Gan and Jiaxing Zhang and Baoyuan Wang and Jia Li},
journal={ArXiv},
year={2023},
volume={abs/2302.13619}
}