This directory contains the code to reproduce the experiments that calculate QA metrics and correlations on the subset of the manually labeled Fabbri (2020) data (Tables 2 and 3).
Required data:
data/fabbri2020/{summaries,metrics}.jsonl
Required environments:
QAEVAL_ENV
TRANSFORMERS_ENV
MOVERSCORE_ENV
APES_ENV
To recalculate the numbers, run:
sh experiments/question-answering/fabbri2020/run.sh
Because the question generation model randomly generates question IDs, rerunning the question generation will result in IDs that do not match the IDs that we used to do the manual annotation.
Therefore, to recalculate the scores reported in the paper, we had to include intermediate outputs of the script, and run.sh
only finishes the processing from those outputs.
However, it contains the commented out code which we used to run the whole pipeline.
We did the annotation in batches, so you may see batch numbers 1 and 2 in the output directories and code.
Included data:
output/{1,2}/expert/mturk-output.csv
: The manually annotated answers to the questions for each batch.output/all/{expert,model}/answers.jsonl
: The post-processed answers to batches 1 and 2 combined and the QAEval answers for the same subset. We had to include these because converting from the MTurk output to theanswers.jsonl
requires the questions, and the ID mapping is off.output/all/answer-verification/labeled-answers.csv
: The manual annotations of whether the system/human answers are verified as being correct or incorrect
After the run.sh
script finishes, the metric correlations (Table 3) will be written to the following locations:
output/baselines/correlations/rouge-X.json
: The summary-level ROUGE correlations on the labeled subsetoutput/baselines/correlations/MoverScore.json
: The summary-level MoverScore correlations on the labeled subsetoutput/baselines/correlations/APES.json
: The summary-level APES correlations on the labeled subsetoutput/all/model/correlations/f1.json
: The summary-level QAEval correlations with model answers and F1 answer verificationoutput/all/expert/correlations/f1.json
: The summary-level QAEval correlations with human answers and F1 answer verificationoutput/all/answer-verification/model/correlations/human-is-correct.json
: The summary-level QAEval correlations with model answers and human answer verificationoutput/all/answer-verification/expert/correlations/human-is-correct.json
: The summary-level QAEval correlations with human answers and human answer verification
The QA metrics (Table 2) will be written to output/all/squad-metrics.json
and output/all/answer-verification/log.txt
.
The first file contains the is-answerable F1 (is_answerable -> unweighted -> f1
) and the EM/F1 scores on just the subset of the data which is answerable (is-answerable-only -> squad -> exact-match/f1
).
The second file contains the human labeled answer accuracy given the question is answerable (Accuracy given ground-truth question is answerable 0.8629737609329446
).