Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix CNN/DM and XSum initialization #23

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

alejorba
Copy link

@alejorba alejorba commented Dec 6, 2024

This PR builds on the changes made in PR #16, where a bug related to the CNNDM variable in summac/benchmark.py was identified.

Specifically, when the CNNDM variable is undefined, lines 44 and 54 can raise a NameError: name 'CNNDM' is not defined:

def get_cnndm_document(self, aid):
global CNNDM
if self.cnndm is None:
if CNNDM is None:
CNNDM = load_dataset("cnn_dailymail", "3.0.0")
self.cnndm = CNNDM
self.cnndm_id2article = {}
for cut in ["test", "validation"]:
self.cnndm_id2article.update({d["id"]: d["article"] for d in self.cnndm[cut]})
return self.cnndm_id2article[aid]
def get_cnndm_reference(self, aid):
global CNNDM
if CNNDM is None:
CNNDM = load_dataset("cnn_dailymail", "3.0.0")
self.cnndm = CNNDM
if self.cnndm_id2reference is None:
self.cnndm_id2reference = {}
for cut in ["test", "validation"]:
self.cnndm_id2reference.update({d["id"]: d["highlights"] for d in self.cnndm[cut]})
return self.cnndm_id2reference[aid]

The fix proposed by @forrestbao effectively resolve this issu, but during testing I noticed some performance concerns.>

To address this, I propose an alternative fix that leverages class variables. Through this new approach:

  • The CNN/DM and XSum datasets are only loaded when an instance of the SummaCBenchmark class is created.
  • The datasets are loaded only once and reused across all instances of the class.

In addition, the GDrive link for the SummEval dataset provided in lines 271-276 (apparently from the 4/19/2020 update on the README.md file of the original repo https://github.com/Yale-LILY/SummEval) is broken.

summac/summac/benchmark.py

Lines 271 to 276 in 9e4f357

if not os.path.exists(dataset_folder):
print("==== SummEval dataset not found, downloading from scratch")
os.makedirs(dataset_folder)
# From the 4/19/2020 update on the README: https://github.com/Yale-LILY/SummEval
download_file_from_google_drive("1d2Iaz3jNraURP1i7CfTqPIj8REZMJ3tS", fn)

I replaced it with a valid GCS bucket link that can be found on the README.md file of the same repo under the "Human annotations" header (https://storage.googleapis.com/sfr-summarization-repo-research/model_annotations.aligned.jsonl)

Changes

  • Fixed NameError by modified dataset loading in the SummaCBenchmark class.
  • Replaced broken GDrive link with a working GCS bucket link, for the SummEval dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant