Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust Reader eval by limiting max answers and creating no answer labels #331

Merged
merged 3 commits into from
Aug 26, 2020

Conversation

brandenchan
Copy link
Contributor

@brandenchan brandenchan commented Aug 20, 2020

When evaluating using Reader.eval() we read data from file, store it in an ElasticsearchDocumentStore, reconstruct input dictionaries and then evaluate using a farm.Evaluator object.

Two problems were arising, linked to the DocumentStore.

  1. When a duplicate question is asked on the same document (see squad dev set "What is the CJEU's duty?"), indexing into and fetching from the DocumentStore causes their labels to be merged. This can result in a list of answers longer than the default max number of answers 6. A quick solution is implemented here to reduce the list down to 6.

  2. Only span labels are indexed into the DocumentStore and fetched, meaning that we lose the no_answer samples. This PR ensures that question which don't have a corresponding label in the DocumentStore are assigned a "no_answer" label.

Neither of these problems affect Reader.eval_on_file()

This PR also resolves a performance discrepancy between Reader.eval(), Reader.eval_on_file() and farm.Evaluator. All now show the following performance:

 model = deepset/roberta-base-squad2
 data = nq_dev_subset_v2.json

 Reader Top-N-Accuracy: 0.7222222222222222
 Reader Exact Match: 0.4074074074074074
 Reader F1-Score: 0.4463589193057792

@brandenchan
Copy link
Contributor Author

brandenchan commented Aug 21, 2020

The latest commit solves the discrepancies in performance between Reader.eval() with no_answers and Reader.eval_on_file(). Previously, there were labels that were being lost during the process of indexing and retrieving labels to and from elasticsearch

@brandenchan
Copy link
Contributor Author

Evaluating in FARM has a default QuestionAnsweringPredictionHead.n_best = 5.

Haystack's Reader.top_k_per_candidate = 3 by default and consequently, QuestionAnsweringPredictionHead.n_best = 4 by default.

The discrepancy between these two repos is resolved by setting the Reader.top_k_per_candidate = 4

@brandenchan brandenchan changed the title WIP: More robust Reader eval by limiting max answers and creating no answer labels More robust Reader eval by limiting max answers and creating no answer labels Aug 25, 2020
@brandenchan brandenchan requested a review from tholor August 25, 2020 12:46
Copy link
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only tiny comment on the warning message. Rest looks great!

if self.top_k_per_candidate != 4:
logger.warning(f"Performing Evaluation using top_k_per_candidate = {self.top_k_per_candidate} \n"
f"and consequently, QuestionAnsweringPredictionHead.n_best = {self.top_k_per_candidate + 1}. \n"
f"This deviates from FARM's default where QuestionAnsweringPredictionHead.n_best = 5")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What is the impact of this? As a user, it is unclear to me why this might be a problem. Is it only the top_k_accuracy metric that will be different to eval in FARM? If yes, I would rather make it an info than a warning.

@brandenchan brandenchan merged commit 0ad22d5 into master Aug 26, 2020
@julian-risch julian-risch deleted the robust_eval branch November 15, 2021 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants