Bug fix CNN/DM and XSum initialization #23

alejorba · 2024-12-06T16:03:21Z

This PR builds on the changes made in PR #16, where a bug related to the CNNDM variable in summac/benchmark.py was identified.

Specifically, when the CNNDM variable is undefined, lines 44 and 54 can raise a NameError: name 'CNNDM' is not defined:

summac/summac/benchmark.py

Lines 41 to 61 in 9e4f357

    
           def get_cnndm_document(self, aid): 
        
               global CNNDM 
        
               if self.cnndm is None: 
        
                   if CNNDM is None: 
        
                       CNNDM = load_dataset("cnn_dailymail", "3.0.0") 
        
                   self.cnndm = CNNDM 
        
                   self.cnndm_id2article = {} 
        
                   for cut in ["test", "validation"]: 
        
                       self.cnndm_id2article.update({d["id"]: d["article"] for d in self.cnndm[cut]}) 
        
               return self.cnndm_id2article[aid] 
        
           def get_cnndm_reference(self, aid): 
        
               global CNNDM 
        
               if CNNDM is None: 
        
                   CNNDM = load_dataset("cnn_dailymail", "3.0.0") 
        
                   self.cnndm = CNNDM 
        
               if self.cnndm_id2reference is None: 
        
                   self.cnndm_id2reference = {} 
        
                   for cut in ["test", "validation"]: 
        
                       self.cnndm_id2reference.update({d["id"]: d["highlights"] for d in self.cnndm[cut]}) 
        
               return self.cnndm_id2reference[aid]

The fix proposed by @forrestbao effectively resolve this issu, but during testing I noticed some performance concerns.>

To address this, I propose an alternative fix that leverages class variables. Through this new approach:

The CNN/DM and XSum datasets are only loaded when an instance of the SummaCBenchmark class is created.
The datasets are loaded only once and reused across all instances of the class.

In addition, the GDrive link for the SummEval dataset provided in lines 271-276 (apparently from the 4/19/2020 update on the README.md file of the original repo https://github.com/Yale-LILY/SummEval) is broken.

summac/summac/benchmark.py

Lines 271 to 276 in 9e4f357

    
           if not os.path.exists(dataset_folder): 
        
               print("==== SummEval dataset not found, downloading from scratch") 
        
               os.makedirs(dataset_folder) 
        
               # From the 4/19/2020 update on the README: https://github.com/Yale-LILY/SummEval 
        
               download_file_from_google_drive("1d2Iaz3jNraURP1i7CfTqPIj8REZMJ3tS", fn)

I replaced it with a valid GCS bucket link that can be found on the README.md file of the same repo under the "Human annotations" header (https://storage.googleapis.com/sfr-summarization-repo-research/model_annotations.aligned.jsonl)

Changes

Fixed NameError by modified dataset loading in the SummaCBenchmark class.
Replaced broken GDrive link with a working GCS bucket link, for the SummEval dataset.

…lable

alejorba added 2 commits December 6, 2024 11:50

Fixed bug in initializing CNN/DM and XSum

c0ccad3

Fixed issue with loading SummEval dataset, gdrive file no longer avai…

02f2139

…lable

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix CNN/DM and XSum initialization #23

Bug fix CNN/DM and XSum initialization #23

alejorba commented Dec 6, 2024

	def get_cnndm_document(self, aid):
	global CNNDM
	if self.cnndm is None:
	if CNNDM is None:
	CNNDM = load_dataset("cnn_dailymail", "3.0.0")
	self.cnndm = CNNDM
	self.cnndm_id2article = {}
	for cut in ["test", "validation"]:
	self.cnndm_id2article.update({d["id"]: d["article"] for d in self.cnndm[cut]})
	return self.cnndm_id2article[aid]

	def get_cnndm_reference(self, aid):
	global CNNDM
	if CNNDM is None:
	CNNDM = load_dataset("cnn_dailymail", "3.0.0")
	self.cnndm = CNNDM
	if self.cnndm_id2reference is None:
	self.cnndm_id2reference = {}
	for cut in ["test", "validation"]:
	self.cnndm_id2reference.update({d["id"]: d["highlights"] for d in self.cnndm[cut]})
	return self.cnndm_id2reference[aid]

	if not os.path.exists(dataset_folder):
	print("==== SummEval dataset not found, downloading from scratch")
	os.makedirs(dataset_folder)

	# From the 4/19/2020 update on the README: https://github.com/Yale-LILY/SummEval
	download_file_from_google_drive("1d2Iaz3jNraURP1i7CfTqPIj8REZMJ3tS", fn)

Bug fix CNN/DM and XSum initialization #23

Are you sure you want to change the base?

Bug fix CNN/DM and XSum initialization #23

Conversation

alejorba commented Dec 6, 2024

Changes