This dataset contains expert and Turker annotations for summaries on the CNN/DailyMail dataset as collected in [1]. The setup command will save the summaries and references for all of the systems and their corresponding annotations and input documents. See this Github repository for more details.
sacrerouge setup-dataset fabbri2020 <output-dir>
The output files are the following:
summaries.jsonl
: The model output summaries with their input documents and the ground-truth referencessummaries-with-crowd.jsonl
: The model output summaries with their input documents and the ground-truth and ten crowdsourced referencesmetrics.jsonl
: The expert and Turker annotations that correspond tosummaries.jsonl
andsummaries-with-crowd.jsonl
all-summaries-preproc-refs.jsonl.gz
: All of the model outputs across the entire CNN/DM test dataset. The corresponding reference was maintained for each model output, which is some preprocessed version of the original references that appear insummaries.jsonl
. That is, the outputs are grouped by theinstance_id
, but eachinstance_id
may have many different references due to model preprocessing differences.all-summaries-orig-refs.jsonl.gz
: All of the model outputs across the entire CNN/DM test dataset. This version uses the documents and references as extracted by the huggingface CNN/DM scripts. The documents and references should be common across the sameinstance_id
.
For all-summaries-preproc-refs.jsonl.gz
and all-summaries-orig-refs.jsonl.gz
, the aligned system outputs have duplicate instances.
We only keep the first occurrence of any instance and ensure that the summary which was judged is selected.
Notes:
- The raw data does not identify which reference summary is the original ground-truth reference, but after checking a handful of instances, it appears as if it is always the first reference in the list of references.
That first reference is the one included in
summaries.jsonl
. (Confirmed) - To make the crowd summaries distinct, each is given a
summarizer_id
ofturker-
followed by a number from 1 to 10. It is not necessarily the case that the summaries identified byturker-i
were all written by the same person and should not be treated as such.
Here are the correlations of some of the metrics implemented in this library to the responsiveness scores in this dataset.
Single-reference, summary-level
Fabbri2020 | |||
---|---|---|---|
r | p | k | |
R1-P | 0.13 | 0.12 | 0.09 |
R1-R | 0.31 | 0.28 | 0.23 |
R1-F1 | 0.28 | 0.26 | 0.20 |
R2-P | 0.15 | 0.13 | 0.09 |
R2-R | 0.26 | 0.23 | 0.18 |
R2-F1 | 0.23 | 0.19 | 0.14 |
BERTScore-P | 0.17 | 0.17 | 0.13 |
BERTScore-R | 0.37 | 0.35 | 0.27 |
BERTScore-F1 | 0.29 | 0.28 | 0.22 |
MoverScore | 0.28 | 0.24 | 0.18 |
QAEval-EM | 0.23 | 0.23 | 0.19 |
QAEval-F1 | 0.30 | 0.29 | 0.22 |
Single-reference, system-level
Fabbri2020 | |||
---|---|---|---|
r | p | k | |
R1-P | 0.29 | 0.15 | 0.03 |
R1-R | 0.55 | 0.56 | 0.42 |
R1-F1 | 0.61 | 0.62 | 0.50 |
R2-P | 0.49 | 0.41 | 0.25 |
R2-R | 0.65 | 0.78 | 0.57 |
R2-F1 | 0.64 | 0.60 | 0.43 |
BERTScore-P | 0.18 | 0.11 | 0.02 |
BERTScore-R | 0.84 | 0.91 | 0.75 |
BERTScore-F1 | 0.54 | 0.40 | 0.28 |
MoverScore | 0.56 | 0.54 | 0.42 |
QAEval-EM | 0.80 | 0.91 | 0.77 |
QAEval-F1 | 0.82 | 0.91 | 0.77 |
Multi-reference, summary-level
Fabbri2020 | |||
---|---|---|---|
r | p | k | |
R1-P | 0.13 | 0.14 | 0.10 |
R1-R | 0.33 | 0.29 | 0.23 |
R1-F1 | 0.36 | 0.33 | 0.25 |
R2-P | 0.20 | 0.21 | 0.16 |
R2-R | 0.34 | 0.31 | 0.24 |
R2-F1 | 0.33 | 0.29 | 0.22 |
BERTScore-P | 0.18 | 0.19 | 0.14 |
BERTScore-R | 0.42 | 0.38 | 0.29 |
BERTScore-F1 | 0.31 | 0.31 | 0.24 |
MoverScore | 0.33 | 0.27 | 0.21 |
QAEval-EM | 0.33 | 0.29 | 0.22 |
QAEval-F1 | 0.40 | 0.35 | 0.27 |
Multi-reference, system-level
Fabbri2020 | |||
---|---|---|---|
r | p | k | |
R1-P | 0.03 | 0.08 | 0.02 |
R1-R | 0.38 | 0.30 | 0.23 |
R1-F1 | 0.55 | 0.77 | 0.58 |
R2-P | 0.34 | 0.26 | 0.13 |
R2-R | 0.41 | 0.29 | 0.23 |
R2-F1 | 0.57 | 0.64 | 0.43 |
BERTScore-P | 0.13 | 0.14 | 0.05 |
BERTScore-R | 0.80 | 0.85 | 0.70 |
BERTScore-F1 | 0.41 | 0.48 | 0.38 |
MoverScore | 0.46 | 0.36 | 0.30 |
QAEval-EM | 0.60 | 0.58 | 0.43 |
QAEval-F1 | 0.62 | 0.65 | 0.48 |
[1] Fabbri, Alexander R and Kryscinski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard and Radev, Dragomir. "SummEval: Re-evaluating Summarization Evaluation". 2020