Incorrect self-BLEU Computation #69

zzbuzzard · 2023-12-20T17:05:59Z

Hi, I believe the self-BLEU computation in this paper is incorrect. Self-BLEU is computed in the following way:

def diversityOfSet(sentences):
    selfBleu = []
    # print(sentences)
    for i, sentence in enumerate(sentences):
        for j in range(i+1, len(sentences)):
            # print(sentence, sentences[j])
            score = get_bleu(sentence, sentences[j])
            selfBleu.append(score)
    if len(selfBleu)==0:
        selfBleu.append(0)
    div4 = distinct_n_gram_inter_sent(sentences, 4)
    return np.mean(selfBleu), div4

That is, you report the mean value of

sum i=1->n
  sum j=i+1->n
    sentence_bleu(sent[i], sent[j])

Two issues here:

BLEU is not symmetric, so sentence_bleu(sent[i], sent[j]) != sentence_bleu(sent[j], sent[i]). Therefore the second loop should start from 1, not i+1, as we need to compute for (i, j) and (j, i). For example, your code never computes sentence_bleu(sent[2], sent[1]), although it should.
BLEU is supposed to be a corpus, not sentence-level metric. You should therefore use corpus_bleu, with the hypothesis equal to one sentence, and the references equal to all other sentences, rather than averaging the sentence_bleu score. This produces extremely different results, especially when using smoothing as you do.

Using my own implemenation on your provided 10 outputs (thank you for releasing these, it is very helpful for evaluation!), with nltk.corpus_bleu, I get a self-BLEU score of 0.5654, over twice as high as the one reported in the paper!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect self-BLEU Computation #69

Incorrect self-BLEU Computation #69

zzbuzzard commented Dec 20, 2023

Incorrect self-BLEU Computation #69

Incorrect self-BLEU Computation #69

Comments

zzbuzzard commented Dec 20, 2023