Skip to content

Latest commit

 

History

History
1205 lines (1195 loc) · 16.6 KB

decomposed-rouge.md

File metadata and controls

1205 lines (1195 loc) · 16.6 KB

Decomposed ROUGE

Decomposed ROUGE is a collection of ROUGE-based metrics calculated for several different categories, like noun phrases, NERs, or nsubj. It was included in the analysis described in [1]. The category-level metrics provide a better understanding of the differences between two summarization systems, for instance, by demonstrating that the F1 score on (subject, verb, object) tuples improved.

We have only included the ROUGE decomposition from [1] and not BERTScore since it is more complicated and slower. Please see the paper's experiment repository if you want the BERTScore decomposition.

Setting Up

Decomposed ROUGE has a dependency on ROUGE's dataset, so set up the ROUGE metric, then the DecomposedRouge metric:

sacrerouge setup-metric rouge
sacrerouge setup-metric decomposed-rouge

Correlations

Here are the correlations of the different category-specific metrics to the "overall responsiveness" scores on the TAC data. They were calculated using spacy version 2.3.3 and model version 2.3.1.

Summary-level, peers only:

TAC2008 TAC2009 TAC2010 TAC2011
r p k r p k r p k r p k
dep-dobj 0.23 0.23 0.19 0.39 0.35 0.28 0.31 0.31 0.26 0.27 0.27 0.22
dep-nsubj 0.15 0.15 0.12 0.34 0.28 0.22 0.29 0.27 0.22 0.24 0.20 0.16
dep-root 0.11 0.11 0.10 0.25 0.18 0.15 0.21 0.21 0.18 0.24 0.23 0.19
dep-verb+dobj 0.23 0.23 0.21 0.38 0.32 0.28 0.28 0.29 0.26 0.29 0.29 0.25
dep-verb+nsubj 0.22 0.22 0.20 0.35 0.30 0.26 0.18 0.18 0.16 0.23 0.24 0.21
dep-verb+nsubj+dobj 0.15 0.16 0.15 0.28 0.21 0.19 0.12 0.13 0.12 0.15 0.16 0.14
ner 0.32 0.30 0.24 0.40 0.35 0.27 0.42 0.38 0.31 0.40 0.32 0.26
np-chunks 0.45 0.44 0.35 0.54 0.48 0.38 0.65 0.62 0.50 0.57 0.46 0.37
pos-adj 0.26 0.24 0.20 0.30 0.26 0.21 0.42 0.42 0.34 0.35 0.31 0.25
pos-adv 0.06 0.07 0.06 0.16 0.14 0.12 0.13 0.12 0.11 0.14 0.16 0.14
pos-noun 0.37 0.35 0.28 0.48 0.42 0.33 0.60 0.56 0.45 0.53 0.44 0.35
pos-num 0.23 0.23 0.20 0.24 0.20 0.17 0.25 0.27 0.23 0.29 0.29 0.24
pos-propn 0.34 0.34 0.27 0.45 0.37 0.29 0.47 0.43 0.35 0.42 0.34 0.28
pos-verb 0.29 0.29 0.23 0.42 0.36 0.28 0.46 0.44 0.36 0.45 0.40 0.32
rouge-1 0.49 0.48 0.39 0.54 0.47 0.38 0.66 0.65 0.53 0.59 0.52 0.42
stopwords 0.24 0.23 0.18 0.36 0.28 0.21 0.46 0.38 0.30 0.48 0.33 0.26

Summary-level, peers + references:

TAC2008 TAC2009 TAC2010 TAC2011
r p k r p k r p k r p k
dep-dobj 0.29 0.27 0.22 0.36 0.35 0.28 0.36 0.35 0.28 0.28 0.29 0.23
dep-nsubj 0.25 0.21 0.17 0.37 0.32 0.25 0.36 0.33 0.26 0.25 0.22 0.18
dep-root 0.26 0.20 0.17 0.31 0.25 0.20 0.33 0.29 0.25 0.32 0.29 0.24
dep-verb+dobj 0.27 0.26 0.22 0.32 0.33 0.27 0.30 0.31 0.27 0.26 0.28 0.23
dep-verb+nsubj 0.28 0.26 0.23 0.29 0.31 0.26 0.26 0.24 0.22 0.21 0.23 0.20
dep-verb+nsubj+dobj 0.19 0.19 0.18 0.20 0.20 0.18 0.17 0.16 0.15 0.12 0.14 0.12
ner 0.33 0.32 0.25 0.39 0.35 0.28 0.43 0.40 0.32 0.36 0.30 0.24
np-chunks 0.51 0.48 0.39 0.53 0.51 0.41 0.66 0.65 0.53 0.54 0.47 0.37
pos-adj 0.30 0.27 0.22 0.29 0.27 0.21 0.40 0.41 0.33 0.30 0.28 0.22
pos-adv 0.08 0.08 0.08 0.09 0.10 0.08 0.14 0.14 0.12 0.12 0.13 0.11
pos-noun 0.46 0.41 0.32 0.46 0.44 0.34 0.64 0.60 0.49 0.53 0.45 0.36
pos-num 0.33 0.29 0.24 0.33 0.28 0.23 0.38 0.36 0.30 0.33 0.32 0.26
pos-propn 0.38 0.37 0.29 0.46 0.40 0.31 0.49 0.45 0.37 0.42 0.35 0.28
pos-verb 0.36 0.34 0.27 0.43 0.40 0.31 0.52 0.49 0.40 0.44 0.41 0.33
rouge-1 0.56 0.54 0.44 0.55 0.53 0.42 0.69 0.70 0.58 0.58 0.55 0.45
stopwords 0.25 0.25 0.20 0.35 0.31 0.24 0.44 0.39 0.31 0.46 0.35 0.28

System-level, peers only:

TAC2008 TAC2009 TAC2010 TAC2011
r p k r p k r p k r p k
dep-dobj 0.69 0.72 0.53 0.81 0.82 0.64 0.83 0.77 0.63 0.70 0.54 0.38
dep-nsubj 0.50 0.52 0.37 0.69 0.61 0.43 0.85 0.76 0.59 0.64 0.33 0.24
dep-root 0.42 0.49 0.35 0.56 0.49 0.35 0.42 0.56 0.41 0.63 0.47 0.33
dep-verb+dobj 0.78 0.79 0.60 0.72 0.79 0.63 0.79 0.77 0.62 0.86 0.78 0.59
dep-verb+nsubj 0.75 0.72 0.52 0.65 0.79 0.61 0.69 0.64 0.47 0.75 0.67 0.52
dep-verb+nsubj+dobj 0.49 0.42 0.30 0.55 0.74 0.57 0.42 0.46 0.33 0.69 0.66 0.51
ner 0.80 0.81 0.61 0.83 0.75 0.59 0.92 0.86 0.69 0.91 0.71 0.55
np-chunks 0.78 0.79 0.60 0.85 0.79 0.61 0.92 0.89 0.78 0.90 0.71 0.54
pos-adj 0.79 0.75 0.57 0.72 0.57 0.43 0.93 0.82 0.67 0.92 0.77 0.59
pos-adv 0.46 0.46 0.32 0.65 0.52 0.36 0.83 0.85 0.67 0.68 0.46 0.33
pos-noun 0.73 0.72 0.52 0.83 0.71 0.51 0.90 0.86 0.74 0.88 0.66 0.49
pos-num 0.73 0.69 0.51 0.75 0.76 0.59 0.62 0.64 0.50 0.78 0.50 0.36
pos-propn 0.76 0.78 0.58 0.83 0.74 0.58 0.89 0.83 0.67 0.89 0.68 0.52
pos-verb 0.80 0.75 0.56 0.79 0.72 0.56 0.87 0.83 0.67 0.85 0.68 0.48
rouge-1 0.80 0.80 0.60 0.83 0.78 0.60 0.90 0.95 0.84 0.91 0.79 0.59
stopwords 0.48 0.54 0.37 0.71 0.58 0.43 0.74 0.72 0.52 0.85 0.50 0.37

System-level, peers + references:

TAC2008 TAC2009 TAC2010 TAC2011
r p k r p k r p k r p k
dep-dobj 0.79 0.79 0.61 0.66 0.86 0.68 0.87 0.85 0.71 0.68 0.62 0.45
dep-nsubj 0.80 0.66 0.50 0.74 0.72 0.54 0.84 0.82 0.66 0.60 0.49 0.36
dep-root 0.87 0.66 0.49 0.68 0.65 0.49 0.81 0.72 0.56 0.80 0.64 0.48
dep-verb+dobj 0.73 0.81 0.63 0.52 0.80 0.65 0.86 0.84 0.68 0.65 0.74 0.56
dep-verb+nsubj 0.85 0.81 0.62 0.46 0.83 0.65 0.88 0.78 0.61 0.49 0.59 0.44
dep-verb+nsubj+dobj 0.63 0.47 0.36 0.31 0.75 0.57 0.71 0.62 0.46 0.22 0.40 0.32
ner 0.71 0.83 0.64 0.64 0.78 0.61 0.82 0.89 0.73 0.59 0.62 0.47
np-chunks 0.79 0.85 0.68 0.69 0.85 0.67 0.84 0.93 0.82 0.66 0.76 0.59
pos-adj 0.76 0.82 0.64 0.59 0.65 0.50 0.82 0.85 0.70 0.61 0.68 0.50
pos-adv 0.52 0.49 0.35 0.11 0.29 0.21 0.71 0.76 0.62 0.31 0.31 0.22
pos-noun 0.82 0.80 0.62 0.69 0.79 0.60 0.87 0.91 0.79 0.72 0.74 0.57
pos-num 0.80 0.77 0.59 0.85 0.83 0.67 0.86 0.78 0.62 0.82 0.66 0.50
pos-propn 0.75 0.83 0.63 0.72 0.81 0.64 0.84 0.87 0.72 0.69 0.68 0.53
pos-verb 0.88 0.83 0.65 0.70 0.80 0.63 0.90 0.89 0.75 0.76 0.76 0.57
rouge-1 0.86 0.86 0.69 0.72 0.85 0.68 0.85 0.97 0.87 0.71 0.87 0.69
stopwords 0.48 0.56 0.39 0.58 0.70 0.51 0.59 0.72 0.52 0.61 0.61 0.47

Contributions

Here are the overall contributions of each category to the overall ROUGE score. These numbers are the percent of token matches that can be explained by the corresponding category. We believe that these differ slightly from the results in the paper because they use the Spacy en_core_web_sm version 2.2.5 and the paper used 2.1.0.

TAC2008 TAC2009 TAC2010 TAC2011
dep-dobj 1.99 2.14 1.50 2.32
dep-nsubj 3.92 3.91 3.01 3.32
dep-root 1.13 1.25 1.40 1.46
dep-verb+dobj 1.22 1.73 0.82 1.69
dep-verb+nsubj 0.83 1.14 0.53 1.22
dep-verb+nsubj+dobj 0.26 0.43 0.15 0.52
ner 13.40 12.49 9.17 8.26
np-chunks 58.98 57.67 54.02 54.66
pos-adj 3.86 3.53 3.51 3.63
pos-adv 0.36 0.43 0.50 0.55
pos-noun 17.61 15.61 17.30 22.15
pos-num 1.50 1.27 1.78 2.50
pos-propn 15.38 14.74 11.13 9.81
pos-verb 4.77 5.66 4.73 5.70
stopwords 54.68 57.35 58.69 50.07

References

[1] Daniel Deutsch and Dan Roth. Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries. 2020.