Exploring Contrast Consistency of Open-Domain Question Answering Systems on Minimally Edited Questions
This repository contains the data and code for the paper Exploring Contrast Consistency of Open-Domain Question Answering Systems on Minimally Edited Questions in TACL 2023. In this study, we explored the problem of contrast consistency in open-domain question answering by collecting Minimally Edited Questions (MEQs) as challenging contrast sets to the popular Natural Questions (NQ) benchmark, in addition to its standard test set. Through our experiments, we find that the widely used dense passage retrieval (DPR) model performs poorly on distinguishing training questions and their minimally-edited contrast set questions. Moving a step forward, we improved the contrast consistency of DPR model via data augmentation and a query-side contrastive learning objective.
Data can be found at this Google Drive link. Data includes:
dataset/json
: the MEQ contrast sets in text form.Q1
is the original question from NQ training set.Q2
is the corresponding MEQ, either retrieved from AmbigQA or generated by InstructGPT. Their answers areA1
andA2
, respectively.dataset/retrieval
: the same data with the json files, but in the format of DPR retrieval input. Questions are listed in pairs. The odd lines are original NQ training questions and the even lines are the corresponding MEQs.dataset/ranking
: data used for ranking evaluation. We provide both the original training questions and their corresponding MEQ contrast sets to test the contrast consistency of the retrieval model (i.e., compare the performance difference between the original question and the MEQ).ambigqa-ranking.json
contains 623 examples from MEQ-AmbigQA with their gold evidence passages.gpt-ranking.json
contains 1229 examples from MEQ-GPT with their gold evidence passages. Files with namenq-train
are the corresponding training questions from NQ.dataset/train
contains the training data and the batch data indices for the model.nq-contrastive-augment-train-dpr.jsonl
is the data used to train the model, including the original NQ data and augmented MEQs from PAQ.contrastive-augment-33k-train-batches64_idx.jsonl
is the pre-computed data indices for each batch during the training process. This is used to carefully schedule the positive and negative questions used in the query-side contrastive loss. This data can be used for training on 1, 2 or 4 GPUs. If you are using 8 GPUs, usecontrastive-augment-33k-train-batches64_idx-8gpu.jsonl
instead (a re-arranged version of the same indices).dataset/dev
dev set data used in training the model.
Besides the data, this repo also contains the code of training the improved DPR model mentioned in the paper, which is equipped with additional augmented data from PAQ and a query-side contrastive learning objective. The pipeline of "training the model + generating Wikipedia passage embeddings + retrieving passages from Wikipedia + evaluating the retrieval results" is in scripts/train_dpr.sh
.
The Python environment mainly follows the one used by the original DPR repo.
- Install PyTorch:
pip install torch==1.8.2 torchvision==0.9.2 torchaudio==0.8.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu111
- Install the other dependencies:
pip install -r requirements.txt
Checkpoints of the models used in the paper can be found at this Box link. It contains the following checkpoints:
DPR_base_contrastive_33k_batch64_lr1e-5_epoch40_loss0.5_start5
: best DPR-base checkpoint in ranking evaluation. This model is trained with the InfoNCE loss with weight 0.5, and the contrastive loss starts at epoch 5. This equals to settingQUESTION_LOSS=contrastive HINGE_MARGIN=0 CONTRAST_LOSS=0.5 CONTRAST_START_EPOCH=5
inscripts/train_dpr.sh
.DPR_base_dot_33k_batch64_lr1e-5_epoch40_loss0.03
: best DPR-base checkpoint in retrieval and QA evaluation. This model is trained with the dot product loss with weight 0.03. This equals to settingQUESTION_LOSS=dot HINGE_MARGIN=0 CONTRAST_LOSS=0.03 CONTRAST_START_EPOCH=0
inscripts/train_dpr.sh
.DPR_large_PAQ_dot_33k_batch32_lr1e-5_epoch40_loss0.003
: best DPR-large checkpoint in ranking evaluation. This model is trained with the dot product loss with weight 0.003. This equals to settingQUESTION_LOSS=dot HINGE_MARGIN=0 CONTRAST_LOSS=0.003 CONTRAST_START_EPOCH=0
inscripts/train_dpr.sh
. Add optionencoder.pretrained_model_cfg=bert-large-uncased
to switch to the BERT-large encoder.DPR_large_PAQ_dot_33k_batch32_lr1e-5_epoch40_loss0.03
: best DPR-large checkpoint in retrieval and QA evaluation. This model is trained with the dot product loss with weight 0.03. This equals to settingQUESTION_LOSS=dot HINGE_MARGIN=0 CONTRAST_LOSS=0.03 CONTRAST_START_EPOCH=0
inscripts/train_dpr.sh
. Add optionencoder.pretrained_model_cfg=bert-large-uncased
to switch to the BERT-large encoder.
If you use our data or code, please kindly cite our paper:
@article{zhang2023exploring,
author={Zhihan Zhang and Wenhao Yu and Zheng Ning and Mingxuan Ju and Meng Jiang},
title={Exploring Contrast Consistency of Open-Domain Question Answering Systems on Minimally Edited Questions},
journal={Transactions of the Association for Computational Linguistics},
volume={11},
year={2023},
publisher={MIT Press}
}