Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect self-BLEU Computation #69

Open
zzbuzzard opened this issue Dec 20, 2023 · 0 comments
Open

Incorrect self-BLEU Computation #69

zzbuzzard opened this issue Dec 20, 2023 · 0 comments

Comments

@zzbuzzard
Copy link

Hi, I believe the self-BLEU computation in this paper is incorrect. Self-BLEU is computed in the following way:

def diversityOfSet(sentences):
    selfBleu = []
    # print(sentences)
    for i, sentence in enumerate(sentences):
        for j in range(i+1, len(sentences)):
            # print(sentence, sentences[j])
            score = get_bleu(sentence, sentences[j])
            selfBleu.append(score)
    if len(selfBleu)==0:
        selfBleu.append(0)
    div4 = distinct_n_gram_inter_sent(sentences, 4)
    return np.mean(selfBleu), div4

That is, you report the mean value of

sum i=1->n
  sum j=i+1->n
    sentence_bleu(sent[i], sent[j])

Two issues here:

  1. BLEU is not symmetric, so sentence_bleu(sent[i], sent[j]) != sentence_bleu(sent[j], sent[i]). Therefore the second loop should start from 1, not i+1, as we need to compute for (i, j) and (j, i). For example, your code never computes sentence_bleu(sent[2], sent[1]), although it should.
  2. BLEU is supposed to be a corpus, not sentence-level metric. You should therefore use corpus_bleu, with the hypothesis equal to one sentence, and the references equal to all other sentences, rather than averaging the sentence_bleu score. This produces extremely different results, especially when using smoothing as you do.

Using my own implemenation on your provided 10 outputs (thank you for releasing these, it is very helpful for evaluation!), with nltk.corpus_bleu, I get a self-BLEU score of 0.5654, over twice as high as the one reported in the paper!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant