Decomposed ROUGE

Decomposed ROUGE is a collection of ROUGE-based metrics calculated for several different categories, like noun phrases, NERs, or nsubj. It was included in the analysis described in [1]. The category-level metrics provide a better understanding of the differences between two summarization systems, for instance, by demonstrating that the F1 score on (subject, verb, object) tuples improved.

We have only included the ROUGE decomposition from [1] and not BERTScore since it is more complicated and slower. Please see the paper's experiment repository if you want the BERTScore decomposition.

Setting Up

Decomposed ROUGE has a dependency on ROUGE's dataset, so set up the ROUGE metric, then the DecomposedRouge metric:

sacrerouge setup-metric rouge
sacrerouge setup-metric decomposed-rouge

Correlations

Here are the correlations of the different category-specific metrics to the "overall responsiveness" scores on the TAC data. They were calculated using spacy version 2.3.3 and model version 2.3.1.

Summary-level, peers only:

	TAC2008			TAC2009			TAC2010			TAC2011
	r	p	k	r	p	k	r	p	k	r	p	k
dep-dobj	0.23	0.23	0.19	0.39	0.35	0.28	0.31	0.31	0.26	0.27	0.27	0.22
dep-nsubj	0.15	0.15	0.12	0.34	0.28	0.22	0.29	0.27	0.22	0.24	0.20	0.16
dep-root	0.11	0.11	0.10	0.25	0.18	0.15	0.21	0.21	0.18	0.24	0.23	0.19
dep-verb+dobj	0.23	0.23	0.21	0.38	0.32	0.28	0.28	0.29	0.26	0.29	0.29	0.25
dep-verb+nsubj	0.22	0.22	0.20	0.35	0.30	0.26	0.18	0.18	0.16	0.23	0.24	0.21
dep-verb+nsubj+dobj	0.15	0.16	0.15	0.28	0.21	0.19	0.12	0.13	0.12	0.15	0.16	0.14
ner	0.32	0.30	0.24	0.40	0.35	0.27	0.42	0.38	0.31	0.40	0.32	0.26
np-chunks	0.45	0.44	0.35	0.54	0.48	0.38	0.65	0.62	0.50	0.57	0.46	0.37
pos-adj	0.26	0.24	0.20	0.30	0.26	0.21	0.42	0.42	0.34	0.35	0.31	0.25
pos-adv	0.06	0.07	0.06	0.16	0.14	0.12	0.13	0.12	0.11	0.14	0.16	0.14
pos-noun	0.37	0.35	0.28	0.48	0.42	0.33	0.60	0.56	0.45	0.53	0.44	0.35
pos-num	0.23	0.23	0.20	0.24	0.20	0.17	0.25	0.27	0.23	0.29	0.29	0.24
pos-propn	0.34	0.34	0.27	0.45	0.37	0.29	0.47	0.43	0.35	0.42	0.34	0.28
pos-verb	0.29	0.29	0.23	0.42	0.36	0.28	0.46	0.44	0.36	0.45	0.40	0.32
rouge-1	0.49	0.48	0.39	0.54	0.47	0.38	0.66	0.65	0.53	0.59	0.52	0.42
stopwords	0.24	0.23	0.18	0.36	0.28	0.21	0.46	0.38	0.30	0.48	0.33	0.26

Summary-level, peers + references:

	TAC2008			TAC2009			TAC2010			TAC2011
	r	p	k	r	p	k	r	p	k	r	p	k
dep-dobj	0.29	0.27	0.22	0.36	0.35	0.28	0.36	0.35	0.28	0.28	0.29	0.23
dep-nsubj	0.25	0.21	0.17	0.37	0.32	0.25	0.36	0.33	0.26	0.25	0.22	0.18
dep-root	0.26	0.20	0.17	0.31	0.25	0.20	0.33	0.29	0.25	0.32	0.29	0.24
dep-verb+dobj	0.27	0.26	0.22	0.32	0.33	0.27	0.30	0.31	0.27	0.26	0.28	0.23
dep-verb+nsubj	0.28	0.26	0.23	0.29	0.31	0.26	0.26	0.24	0.22	0.21	0.23	0.20
dep-verb+nsubj+dobj	0.19	0.19	0.18	0.20	0.20	0.18	0.17	0.16	0.15	0.12	0.14	0.12
ner	0.33	0.32	0.25	0.39	0.35	0.28	0.43	0.40	0.32	0.36	0.30	0.24
np-chunks	0.51	0.48	0.39	0.53	0.51	0.41	0.66	0.65	0.53	0.54	0.47	0.37
pos-adj	0.30	0.27	0.22	0.29	0.27	0.21	0.40	0.41	0.33	0.30	0.28	0.22
pos-adv	0.08	0.08	0.08	0.09	0.10	0.08	0.14	0.14	0.12	0.12	0.13	0.11
pos-noun	0.46	0.41	0.32	0.46	0.44	0.34	0.64	0.60	0.49	0.53	0.45	0.36
pos-num	0.33	0.29	0.24	0.33	0.28	0.23	0.38	0.36	0.30	0.33	0.32	0.26
pos-propn	0.38	0.37	0.29	0.46	0.40	0.31	0.49	0.45	0.37	0.42	0.35	0.28
pos-verb	0.36	0.34	0.27	0.43	0.40	0.31	0.52	0.49	0.40	0.44	0.41	0.33
rouge-1	0.56	0.54	0.44	0.55	0.53	0.42	0.69	0.70	0.58	0.58	0.55	0.45
stopwords	0.25	0.25	0.20	0.35	0.31	0.24	0.44	0.39	0.31	0.46	0.35	0.28

System-level, peers only:

	TAC2008			TAC2009			TAC2010			TAC2011
	r	p	k	r	p	k	r	p	k	r	p	k
dep-dobj	0.69	0.72	0.53	0.81	0.82	0.64	0.83	0.77	0.63	0.70	0.54	0.38
dep-nsubj	0.50	0.52	0.37	0.69	0.61	0.43	0.85	0.76	0.59	0.64	0.33	0.24
dep-root	0.42	0.49	0.35	0.56	0.49	0.35	0.42	0.56	0.41	0.63	0.47	0.33
dep-verb+dobj	0.78	0.79	0.60	0.72	0.79	0.63	0.79	0.77	0.62	0.86	0.78	0.59
dep-verb+nsubj	0.75	0.72	0.52	0.65	0.79	0.61	0.69	0.64	0.47	0.75	0.67	0.52
dep-verb+nsubj+dobj	0.49	0.42	0.30	0.55	0.74	0.57	0.42	0.46	0.33	0.69	0.66	0.51
ner	0.80	0.81	0.61	0.83	0.75	0.59	0.92	0.86	0.69	0.91	0.71	0.55
np-chunks	0.78	0.79	0.60	0.85	0.79	0.61	0.92	0.89	0.78	0.90	0.71	0.54
pos-adj	0.79	0.75	0.57	0.72	0.57	0.43	0.93	0.82	0.67	0.92	0.77	0.59
pos-adv	0.46	0.46	0.32	0.65	0.52	0.36	0.83	0.85	0.67	0.68	0.46	0.33
pos-noun	0.73	0.72	0.52	0.83	0.71	0.51	0.90	0.86	0.74	0.88	0.66	0.49
pos-num	0.73	0.69	0.51	0.75	0.76	0.59	0.62	0.64	0.50	0.78	0.50	0.36
pos-propn	0.76	0.78	0.58	0.83	0.74	0.58	0.89	0.83	0.67	0.89	0.68	0.52
pos-verb	0.80	0.75	0.56	0.79	0.72	0.56	0.87	0.83	0.67	0.85	0.68	0.48
rouge-1	0.80	0.80	0.60	0.83	0.78	0.60	0.90	0.95	0.84	0.91	0.79	0.59
stopwords	0.48	0.54	0.37	0.71	0.58	0.43	0.74	0.72	0.52	0.85	0.50	0.37

System-level, peers + references:

	TAC2008			TAC2009			TAC2010			TAC2011
	r	p	k	r	p	k	r	p	k	r	p	k
dep-dobj	0.79	0.79	0.61	0.66	0.86	0.68	0.87	0.85	0.71	0.68	0.62	0.45
dep-nsubj	0.80	0.66	0.50	0.74	0.72	0.54	0.84	0.82	0.66	0.60	0.49	0.36
dep-root	0.87	0.66	0.49	0.68	0.65	0.49	0.81	0.72	0.56	0.80	0.64	0.48
dep-verb+dobj	0.73	0.81	0.63	0.52	0.80	0.65	0.86	0.84	0.68	0.65	0.74	0.56
dep-verb+nsubj	0.85	0.81	0.62	0.46	0.83	0.65	0.88	0.78	0.61	0.49	0.59	0.44
dep-verb+nsubj+dobj	0.63	0.47	0.36	0.31	0.75	0.57	0.71	0.62	0.46	0.22	0.40	0.32
ner	0.71	0.83	0.64	0.64	0.78	0.61	0.82	0.89	0.73	0.59	0.62	0.47
np-chunks	0.79	0.85	0.68	0.69	0.85	0.67	0.84	0.93	0.82	0.66	0.76	0.59
pos-adj	0.76	0.82	0.64	0.59	0.65	0.50	0.82	0.85	0.70	0.61	0.68	0.50
pos-adv	0.52	0.49	0.35	0.11	0.29	0.21	0.71	0.76	0.62	0.31	0.31	0.22
pos-noun	0.82	0.80	0.62	0.69	0.79	0.60	0.87	0.91	0.79	0.72	0.74	0.57
pos-num	0.80	0.77	0.59	0.85	0.83	0.67	0.86	0.78	0.62	0.82	0.66	0.50
pos-propn	0.75	0.83	0.63	0.72	0.81	0.64	0.84	0.87	0.72	0.69	0.68	0.53
pos-verb	0.88	0.83	0.65	0.70	0.80	0.63	0.90	0.89	0.75	0.76	0.76	0.57
rouge-1	0.86	0.86	0.69	0.72	0.85	0.68	0.85	0.97	0.87	0.71	0.87	0.69
stopwords	0.48	0.56	0.39	0.58	0.70	0.51	0.59	0.72	0.52	0.61	0.61	0.47

Contributions

Here are the overall contributions of each category to the overall ROUGE score. These numbers are the percent of token matches that can be explained by the corresponding category. We believe that these differ slightly from the results in the paper because they use the Spacy en_core_web_sm version 2.2.5 and the paper used 2.1.0.

	TAC2008	TAC2009	TAC2010	TAC2011
dep-dobj	1.99	2.14	1.50	2.32
dep-nsubj	3.92	3.91	3.01	3.32
dep-root	1.13	1.25	1.40	1.46
dep-verb+dobj	1.22	1.73	0.82	1.69
dep-verb+nsubj	0.83	1.14	0.53	1.22
dep-verb+nsubj+dobj	0.26	0.43	0.15	0.52
ner	13.40	12.49	9.17	8.26
np-chunks	58.98	57.67	54.02	54.66
pos-adj	3.86	3.53	3.51	3.63
pos-adv	0.36	0.43	0.50	0.55
pos-noun	17.61	15.61	17.30	22.15
pos-num	1.50	1.27	1.78	2.50
pos-propn	15.38	14.74	11.13	9.81
pos-verb	4.77	5.66	4.73	5.70
stopwords	54.68	57.35	58.69	50.07

References

[1] Daniel Deutsch and Dan Roth. Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries. 2020.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decomposed-rouge.md

decomposed-rouge.md

Decomposed ROUGE

Setting Up

Correlations

Contributions

References

Files

decomposed-rouge.md

Latest commit

History

decomposed-rouge.md

File metadata and controls

Decomposed ROUGE

Setting Up

Correlations

Contributions

References