More robust Reader eval by limiting max answers and creating no answer labels #331
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When evaluating using Reader.eval() we read data from file, store it in an ElasticsearchDocumentStore, reconstruct input dictionaries and then evaluate using a farm.Evaluator object.
Two problems were arising, linked to the DocumentStore.
When a duplicate question is asked on the same document (see squad dev set "What is the CJEU's duty?"), indexing into and fetching from the DocumentStore causes their labels to be merged. This can result in a list of answers longer than the default max number of answers 6. A quick solution is implemented here to reduce the list down to 6.
Only span labels are indexed into the DocumentStore and fetched, meaning that we lose the no_answer samples. This PR ensures that question which don't have a corresponding label in the DocumentStore are assigned a "no_answer" label.
Neither of these problems affect Reader.eval_on_file()
This PR also resolves a performance discrepancy between Reader.eval(), Reader.eval_on_file() and farm.Evaluator. All now show the following performance: