Decomposed ROUGE is a collection of ROUGE-based metrics calculated for several different categories, like noun phrases, NERs, or nsubj. It was included in the analysis described in [1]. The category-level metrics provide a better understanding of the differences between two summarization systems, for instance, by demonstrating that the F1 score on (subject, verb, object) tuples improved.
We have only included the ROUGE decomposition from [1] and not BERTScore since it is more complicated and slower. Please see the paper's experiment repository if you want the BERTScore decomposition.
Decomposed ROUGE has a dependency on ROUGE's dataset, so set up the ROUGE metric, then the DecomposedRouge metric:
sacrerouge setup-metric rouge
sacrerouge setup-metric decomposed-rouge
Here are the correlations of the different category-specific metrics to the "overall responsiveness" scores on the TAC data.
They were calculated using spacy version 2.3.3
and model version 2.3.1
.
Summary-level, peers only:
TAC2008 | TAC2009 | TAC2010 | TAC2011 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
r | p | k | r | p | k | r | p | k | r | p | k | |
dep-dobj | 0.23 | 0.23 | 0.19 | 0.39 | 0.35 | 0.28 | 0.31 | 0.31 | 0.26 | 0.27 | 0.27 | 0.22 |
dep-nsubj | 0.15 | 0.15 | 0.12 | 0.34 | 0.28 | 0.22 | 0.29 | 0.27 | 0.22 | 0.24 | 0.20 | 0.16 |
dep-root | 0.11 | 0.11 | 0.10 | 0.25 | 0.18 | 0.15 | 0.21 | 0.21 | 0.18 | 0.24 | 0.23 | 0.19 |
dep-verb+dobj | 0.23 | 0.23 | 0.21 | 0.38 | 0.32 | 0.28 | 0.28 | 0.29 | 0.26 | 0.29 | 0.29 | 0.25 |
dep-verb+nsubj | 0.22 | 0.22 | 0.20 | 0.35 | 0.30 | 0.26 | 0.18 | 0.18 | 0.16 | 0.23 | 0.24 | 0.21 |
dep-verb+nsubj+dobj | 0.15 | 0.16 | 0.15 | 0.28 | 0.21 | 0.19 | 0.12 | 0.13 | 0.12 | 0.15 | 0.16 | 0.14 |
ner | 0.32 | 0.30 | 0.24 | 0.40 | 0.35 | 0.27 | 0.42 | 0.38 | 0.31 | 0.40 | 0.32 | 0.26 |
np-chunks | 0.45 | 0.44 | 0.35 | 0.54 | 0.48 | 0.38 | 0.65 | 0.62 | 0.50 | 0.57 | 0.46 | 0.37 |
pos-adj | 0.26 | 0.24 | 0.20 | 0.30 | 0.26 | 0.21 | 0.42 | 0.42 | 0.34 | 0.35 | 0.31 | 0.25 |
pos-adv | 0.06 | 0.07 | 0.06 | 0.16 | 0.14 | 0.12 | 0.13 | 0.12 | 0.11 | 0.14 | 0.16 | 0.14 |
pos-noun | 0.37 | 0.35 | 0.28 | 0.48 | 0.42 | 0.33 | 0.60 | 0.56 | 0.45 | 0.53 | 0.44 | 0.35 |
pos-num | 0.23 | 0.23 | 0.20 | 0.24 | 0.20 | 0.17 | 0.25 | 0.27 | 0.23 | 0.29 | 0.29 | 0.24 |
pos-propn | 0.34 | 0.34 | 0.27 | 0.45 | 0.37 | 0.29 | 0.47 | 0.43 | 0.35 | 0.42 | 0.34 | 0.28 |
pos-verb | 0.29 | 0.29 | 0.23 | 0.42 | 0.36 | 0.28 | 0.46 | 0.44 | 0.36 | 0.45 | 0.40 | 0.32 |
rouge-1 | 0.49 | 0.48 | 0.39 | 0.54 | 0.47 | 0.38 | 0.66 | 0.65 | 0.53 | 0.59 | 0.52 | 0.42 |
stopwords | 0.24 | 0.23 | 0.18 | 0.36 | 0.28 | 0.21 | 0.46 | 0.38 | 0.30 | 0.48 | 0.33 | 0.26 |
Summary-level, peers + references:
TAC2008 | TAC2009 | TAC2010 | TAC2011 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
r | p | k | r | p | k | r | p | k | r | p | k | |
dep-dobj | 0.29 | 0.27 | 0.22 | 0.36 | 0.35 | 0.28 | 0.36 | 0.35 | 0.28 | 0.28 | 0.29 | 0.23 |
dep-nsubj | 0.25 | 0.21 | 0.17 | 0.37 | 0.32 | 0.25 | 0.36 | 0.33 | 0.26 | 0.25 | 0.22 | 0.18 |
dep-root | 0.26 | 0.20 | 0.17 | 0.31 | 0.25 | 0.20 | 0.33 | 0.29 | 0.25 | 0.32 | 0.29 | 0.24 |
dep-verb+dobj | 0.27 | 0.26 | 0.22 | 0.32 | 0.33 | 0.27 | 0.30 | 0.31 | 0.27 | 0.26 | 0.28 | 0.23 |
dep-verb+nsubj | 0.28 | 0.26 | 0.23 | 0.29 | 0.31 | 0.26 | 0.26 | 0.24 | 0.22 | 0.21 | 0.23 | 0.20 |
dep-verb+nsubj+dobj | 0.19 | 0.19 | 0.18 | 0.20 | 0.20 | 0.18 | 0.17 | 0.16 | 0.15 | 0.12 | 0.14 | 0.12 |
ner | 0.33 | 0.32 | 0.25 | 0.39 | 0.35 | 0.28 | 0.43 | 0.40 | 0.32 | 0.36 | 0.30 | 0.24 |
np-chunks | 0.51 | 0.48 | 0.39 | 0.53 | 0.51 | 0.41 | 0.66 | 0.65 | 0.53 | 0.54 | 0.47 | 0.37 |
pos-adj | 0.30 | 0.27 | 0.22 | 0.29 | 0.27 | 0.21 | 0.40 | 0.41 | 0.33 | 0.30 | 0.28 | 0.22 |
pos-adv | 0.08 | 0.08 | 0.08 | 0.09 | 0.10 | 0.08 | 0.14 | 0.14 | 0.12 | 0.12 | 0.13 | 0.11 |
pos-noun | 0.46 | 0.41 | 0.32 | 0.46 | 0.44 | 0.34 | 0.64 | 0.60 | 0.49 | 0.53 | 0.45 | 0.36 |
pos-num | 0.33 | 0.29 | 0.24 | 0.33 | 0.28 | 0.23 | 0.38 | 0.36 | 0.30 | 0.33 | 0.32 | 0.26 |
pos-propn | 0.38 | 0.37 | 0.29 | 0.46 | 0.40 | 0.31 | 0.49 | 0.45 | 0.37 | 0.42 | 0.35 | 0.28 |
pos-verb | 0.36 | 0.34 | 0.27 | 0.43 | 0.40 | 0.31 | 0.52 | 0.49 | 0.40 | 0.44 | 0.41 | 0.33 |
rouge-1 | 0.56 | 0.54 | 0.44 | 0.55 | 0.53 | 0.42 | 0.69 | 0.70 | 0.58 | 0.58 | 0.55 | 0.45 |
stopwords | 0.25 | 0.25 | 0.20 | 0.35 | 0.31 | 0.24 | 0.44 | 0.39 | 0.31 | 0.46 | 0.35 | 0.28 |
System-level, peers only:
TAC2008 | TAC2009 | TAC2010 | TAC2011 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
r | p | k | r | p | k | r | p | k | r | p | k | |
dep-dobj | 0.69 | 0.72 | 0.53 | 0.81 | 0.82 | 0.64 | 0.83 | 0.77 | 0.63 | 0.70 | 0.54 | 0.38 |
dep-nsubj | 0.50 | 0.52 | 0.37 | 0.69 | 0.61 | 0.43 | 0.85 | 0.76 | 0.59 | 0.64 | 0.33 | 0.24 |
dep-root | 0.42 | 0.49 | 0.35 | 0.56 | 0.49 | 0.35 | 0.42 | 0.56 | 0.41 | 0.63 | 0.47 | 0.33 |
dep-verb+dobj | 0.78 | 0.79 | 0.60 | 0.72 | 0.79 | 0.63 | 0.79 | 0.77 | 0.62 | 0.86 | 0.78 | 0.59 |
dep-verb+nsubj | 0.75 | 0.72 | 0.52 | 0.65 | 0.79 | 0.61 | 0.69 | 0.64 | 0.47 | 0.75 | 0.67 | 0.52 |
dep-verb+nsubj+dobj | 0.49 | 0.42 | 0.30 | 0.55 | 0.74 | 0.57 | 0.42 | 0.46 | 0.33 | 0.69 | 0.66 | 0.51 |
ner | 0.80 | 0.81 | 0.61 | 0.83 | 0.75 | 0.59 | 0.92 | 0.86 | 0.69 | 0.91 | 0.71 | 0.55 |
np-chunks | 0.78 | 0.79 | 0.60 | 0.85 | 0.79 | 0.61 | 0.92 | 0.89 | 0.78 | 0.90 | 0.71 | 0.54 |
pos-adj | 0.79 | 0.75 | 0.57 | 0.72 | 0.57 | 0.43 | 0.93 | 0.82 | 0.67 | 0.92 | 0.77 | 0.59 |
pos-adv | 0.46 | 0.46 | 0.32 | 0.65 | 0.52 | 0.36 | 0.83 | 0.85 | 0.67 | 0.68 | 0.46 | 0.33 |
pos-noun | 0.73 | 0.72 | 0.52 | 0.83 | 0.71 | 0.51 | 0.90 | 0.86 | 0.74 | 0.88 | 0.66 | 0.49 |
pos-num | 0.73 | 0.69 | 0.51 | 0.75 | 0.76 | 0.59 | 0.62 | 0.64 | 0.50 | 0.78 | 0.50 | 0.36 |
pos-propn | 0.76 | 0.78 | 0.58 | 0.83 | 0.74 | 0.58 | 0.89 | 0.83 | 0.67 | 0.89 | 0.68 | 0.52 |
pos-verb | 0.80 | 0.75 | 0.56 | 0.79 | 0.72 | 0.56 | 0.87 | 0.83 | 0.67 | 0.85 | 0.68 | 0.48 |
rouge-1 | 0.80 | 0.80 | 0.60 | 0.83 | 0.78 | 0.60 | 0.90 | 0.95 | 0.84 | 0.91 | 0.79 | 0.59 |
stopwords | 0.48 | 0.54 | 0.37 | 0.71 | 0.58 | 0.43 | 0.74 | 0.72 | 0.52 | 0.85 | 0.50 | 0.37 |
System-level, peers + references:
TAC2008 | TAC2009 | TAC2010 | TAC2011 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
r | p | k | r | p | k | r | p | k | r | p | k | |
dep-dobj | 0.79 | 0.79 | 0.61 | 0.66 | 0.86 | 0.68 | 0.87 | 0.85 | 0.71 | 0.68 | 0.62 | 0.45 |
dep-nsubj | 0.80 | 0.66 | 0.50 | 0.74 | 0.72 | 0.54 | 0.84 | 0.82 | 0.66 | 0.60 | 0.49 | 0.36 |
dep-root | 0.87 | 0.66 | 0.49 | 0.68 | 0.65 | 0.49 | 0.81 | 0.72 | 0.56 | 0.80 | 0.64 | 0.48 |
dep-verb+dobj | 0.73 | 0.81 | 0.63 | 0.52 | 0.80 | 0.65 | 0.86 | 0.84 | 0.68 | 0.65 | 0.74 | 0.56 |
dep-verb+nsubj | 0.85 | 0.81 | 0.62 | 0.46 | 0.83 | 0.65 | 0.88 | 0.78 | 0.61 | 0.49 | 0.59 | 0.44 |
dep-verb+nsubj+dobj | 0.63 | 0.47 | 0.36 | 0.31 | 0.75 | 0.57 | 0.71 | 0.62 | 0.46 | 0.22 | 0.40 | 0.32 |
ner | 0.71 | 0.83 | 0.64 | 0.64 | 0.78 | 0.61 | 0.82 | 0.89 | 0.73 | 0.59 | 0.62 | 0.47 |
np-chunks | 0.79 | 0.85 | 0.68 | 0.69 | 0.85 | 0.67 | 0.84 | 0.93 | 0.82 | 0.66 | 0.76 | 0.59 |
pos-adj | 0.76 | 0.82 | 0.64 | 0.59 | 0.65 | 0.50 | 0.82 | 0.85 | 0.70 | 0.61 | 0.68 | 0.50 |
pos-adv | 0.52 | 0.49 | 0.35 | 0.11 | 0.29 | 0.21 | 0.71 | 0.76 | 0.62 | 0.31 | 0.31 | 0.22 |
pos-noun | 0.82 | 0.80 | 0.62 | 0.69 | 0.79 | 0.60 | 0.87 | 0.91 | 0.79 | 0.72 | 0.74 | 0.57 |
pos-num | 0.80 | 0.77 | 0.59 | 0.85 | 0.83 | 0.67 | 0.86 | 0.78 | 0.62 | 0.82 | 0.66 | 0.50 |
pos-propn | 0.75 | 0.83 | 0.63 | 0.72 | 0.81 | 0.64 | 0.84 | 0.87 | 0.72 | 0.69 | 0.68 | 0.53 |
pos-verb | 0.88 | 0.83 | 0.65 | 0.70 | 0.80 | 0.63 | 0.90 | 0.89 | 0.75 | 0.76 | 0.76 | 0.57 |
rouge-1 | 0.86 | 0.86 | 0.69 | 0.72 | 0.85 | 0.68 | 0.85 | 0.97 | 0.87 | 0.71 | 0.87 | 0.69 |
stopwords | 0.48 | 0.56 | 0.39 | 0.58 | 0.70 | 0.51 | 0.59 | 0.72 | 0.52 | 0.61 | 0.61 | 0.47 |
Here are the overall contributions of each category to the overall ROUGE score. These numbers are the percent of token matches that can be explained by the corresponding category. We believe that these differ slightly from the results in the paper because they use the Spacy en_core_web_sm version 2.2.5 and the paper used 2.1.0.
TAC2008 | TAC2009 | TAC2010 | TAC2011 | |
---|---|---|---|---|
dep-dobj | 1.99 | 2.14 | 1.50 | 2.32 |
dep-nsubj | 3.92 | 3.91 | 3.01 | 3.32 |
dep-root | 1.13 | 1.25 | 1.40 | 1.46 |
dep-verb+dobj | 1.22 | 1.73 | 0.82 | 1.69 |
dep-verb+nsubj | 0.83 | 1.14 | 0.53 | 1.22 |
dep-verb+nsubj+dobj | 0.26 | 0.43 | 0.15 | 0.52 |
ner | 13.40 | 12.49 | 9.17 | 8.26 |
np-chunks | 58.98 | 57.67 | 54.02 | 54.66 |
pos-adj | 3.86 | 3.53 | 3.51 | 3.63 |
pos-adv | 0.36 | 0.43 | 0.50 | 0.55 |
pos-noun | 17.61 | 15.61 | 17.30 | 22.15 |
pos-num | 1.50 | 1.27 | 1.78 | 2.50 |
pos-propn | 15.38 | 14.74 | 11.13 | 9.81 |
pos-verb | 4.77 | 5.66 | 4.73 | 5.70 |
stopwords | 54.68 | 57.35 | 58.69 | 50.07 |
[1] Daniel Deutsch and Dan Roth. Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries. 2020.