More robust Reader eval by limiting max answers and creating no answer labels #331

brandenchan · 2020-08-20T14:57:41Z

When evaluating using Reader.eval() we read data from file, store it in an ElasticsearchDocumentStore, reconstruct input dictionaries and then evaluate using a farm.Evaluator object.

Two problems were arising, linked to the DocumentStore.

When a duplicate question is asked on the same document (see squad dev set "What is the CJEU's duty?"), indexing into and fetching from the DocumentStore causes their labels to be merged. This can result in a list of answers longer than the default max number of answers 6. A quick solution is implemented here to reduce the list down to 6.
Only span labels are indexed into the DocumentStore and fetched, meaning that we lose the no_answer samples. This PR ensures that question which don't have a corresponding label in the DocumentStore are assigned a "no_answer" label.

Neither of these problems affect Reader.eval_on_file()

This PR also resolves a performance discrepancy between Reader.eval(), Reader.eval_on_file() and farm.Evaluator. All now show the following performance:

 model = deepset/roberta-base-squad2
 data = nq_dev_subset_v2.json

 Reader Top-N-Accuracy: 0.7222222222222222
 Reader Exact Match: 0.4074074074074074
 Reader F1-Score: 0.4463589193057792

brandenchan · 2020-08-21T12:48:47Z

The latest commit solves the discrepancies in performance between Reader.eval() with no_answers and Reader.eval_on_file(). Previously, there were labels that were being lost during the process of indexing and retrieving labels to and from elasticsearch

brandenchan · 2020-08-21T15:44:17Z

Evaluating in FARM has a default QuestionAnsweringPredictionHead.n_best = 5.

Haystack's Reader.top_k_per_candidate = 3 by default and consequently, QuestionAnsweringPredictionHead.n_best = 4 by default.

The discrepancy between these two repos is resolved by setting the Reader.top_k_per_candidate = 4

tholor

Only tiny comment on the warning message. Rest looks great!

tholor · 2020-08-26T10:41:35Z

haystack/reader/farm.py

+        if self.top_k_per_candidate != 4:
+            logger.warning(f"Performing Evaluation using top_k_per_candidate = {self.top_k_per_candidate} \n"
+                           f"and consequently, QuestionAnsweringPredictionHead.n_best = {self.top_k_per_candidate + 1}. \n"
+                           f"This deviates from FARM's default where QuestionAnsweringPredictionHead.n_best = 5")


What is the impact of this? As a user, it is unclear to me why this might be a problem. Is it only the top_k_accuracy metric that will be different to eval in FARM? If yes, I would rather make it an info than a warning.

tholor mentioned this pull request Aug 24, 2020

How to evaluate retriever? #336

Closed

brandenchan changed the title ~~WIP: More robust Reader eval by limiting max answers and creating no answer labels~~ More robust Reader eval by limiting max answers and creating no answer labels Aug 25, 2020

brandenchan requested a review from tholor August 25, 2020 12:46

brandenchan added 2 commits August 26, 2020 12:01

More robust eval

cca8676

Set top_k_per_candidate

b44b1ac

brandenchan force-pushed the robust_eval branch from 8c63de4 to b44b1ac Compare August 26, 2020 10:05

tholor requested changes Aug 26, 2020

View reviewed changes

Change warning to info

f108939

brandenchan merged commit 0ad22d5 into master Aug 26, 2020

julian-risch deleted the robust_eval branch November 15, 2021 07:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More robust Reader eval by limiting max answers and creating no answer labels #331

More robust Reader eval by limiting max answers and creating no answer labels #331

brandenchan commented Aug 20, 2020 •

edited

Loading

brandenchan commented Aug 21, 2020 •

edited

Loading

brandenchan commented Aug 21, 2020

tholor left a comment

tholor Aug 26, 2020

More robust Reader eval by limiting max answers and creating no answer labels #331

More robust Reader eval by limiting max answers and creating no answer labels #331

Conversation

brandenchan commented Aug 20, 2020 • edited Loading

brandenchan commented Aug 21, 2020 • edited Loading

brandenchan commented Aug 21, 2020

tholor left a comment

Choose a reason for hiding this comment

tholor Aug 26, 2020

Choose a reason for hiding this comment

brandenchan commented Aug 20, 2020 •

edited

Loading

brandenchan commented Aug 21, 2020 •

edited

Loading