You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
sum i=1->n
sum j=i+1->n
sentence_bleu(sent[i], sent[j])
Two issues here:
BLEU is not symmetric, so sentence_bleu(sent[i], sent[j]) != sentence_bleu(sent[j], sent[i]). Therefore the second loop should start from 1, not i+1, as we need to compute for (i, j) and (j, i). For example, your code never computes sentence_bleu(sent[2], sent[1]), although it should.
BLEU is supposed to be a corpus, not sentence-level metric. You should therefore use corpus_bleu, with the hypothesis equal to one sentence, and the references equal to all other sentences, rather than averaging the sentence_bleu score. This produces extremely different results, especially when using smoothing as you do.
Using my own implemenation on your provided 10 outputs (thank you for releasing these, it is very helpful for evaluation!), with nltk.corpus_bleu, I get a self-BLEU score of 0.5654, over twice as high as the one reported in the paper!
The text was updated successfully, but these errors were encountered:
Hi, I believe the self-BLEU computation in this paper is incorrect. Self-BLEU is computed in the following way:
That is, you report the mean value of
Two issues here:
sentence_bleu(sent[i], sent[j]) != sentence_bleu(sent[j], sent[i])
. Therefore the second loop should start from 1, not i+1, as we need to compute for (i, j) and (j, i). For example, your code never computessentence_bleu(sent[2], sent[1])
, although it should.corpus_bleu
, with the hypothesis equal to one sentence, and the references equal to all other sentences, rather than averaging thesentence_bleu
score. This produces extremely different results, especially when using smoothing as you do.Using my own implemenation on your provided 10 outputs (thank you for releasing these, it is very helpful for evaluation!), with
nltk.corpus_bleu
, I get a self-BLEU score of 0.5654, over twice as high as the one reported in the paper!The text was updated successfully, but these errors were encountered: