-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using PreProcessor functions on eval data #751
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Great @Timoeller ! Thank you very much for this work I was looking into doing something similar on my side ! Are you guys working towards a Optimizer class that would take as an input an evaluation file and would tweak several parameters (e.g. mainly length of documents, retriever type, values for retriever and reader k's ...) to ajust the settings based on the customer data and push to production something optimized ? I would love to know more about this if this is the plan or help in developping this solution ! |
Hey @Rob192 thanks a lot. I am impressed how fast new features are picked up by the community :D About benchmarking: have you seen our https://haystack.deepset.ai/bm/benchmarks/ already? I could envision some small section about Passage splitting there as well. @brandenchan do you have thoughts on this? If you want to create a more detailed analysis I think it would be awesome if you could create a small blog post. There you can compare different PreProcessing steps in combination with different Retrievers. (I think Reader performance wont be so useful here) |
Yes I have seen the benchmark already, it is very usefull. Here at the LabIA of Etalab, we think the analysis is not only something you should perform on the SQUAD to get some general insights but also something that we should perform with the data of the final user. Really the performance of the stack depends on the size of the knowledge base and the typical questions the user is getting from his customers. We are using a grid search to search for our dataset what would be the best parameters. @psorianom has written a report on some of our experiments here https://etalab-ia.github.io/knowledge-base/piaf/retriever/Visualisation_csv_results.html For the moment and as you said we are mainly testing the retriever performances -> the main outcome is that the filtering of the knowledge base (based on some folders and subfolders that the user can manually choose) has a big impact on the performances. Up to now this process is still artisanal but we are working to standardize it a bit so it is easy when you start a new project with new data to generate these metrics and set up an optimized haystack. |
Hello @Timoeller ! I think there is a small bug in the
Then the variable
In this specific case, my answer is 'Oui' and the position of my answer is 0. however, the Traceback (most recent call last): I am not sure of what would be the best solution. Either make sure that the first document is indeed filled or declare |
Thanks for pointing us to your evaluation of different retriever settings. You guys seem to have done quite some detailed analysis already. I will dig into this analysis later on, for example I do not understand why the scores are so low. On which data did you do this analysis? About the bug, this is indeed problematic. Thanks for bringing this up. I think it should be best dealt with in the preprocessor class, since there a "useless" document is created that then results in the mentioned bug. I created an issue #763 and will work on it coming week. |
Hello, the dataset used for testing the retriever was composed of questions that were directly asked by the citizens to service-public.fr. In the dataset you can also find the documents indicated by the civil agents as answers. We think this is very harsh testing as the questions are sometimes unclear or the answer is not explicitly found in the document. Sometimes you need to extropolate a bit to understand how the document will help you answering the question. It is much different from a squad type dataset where the questions are tailor made to the contexts ! |
This PR adds the preprocessing functionality to evaluation data
That way we can more easily test different preprocessing configurations and their impact on performance.
Open Todos:
[x] real data testing (tested on COVIDQA.json and splitting by word and passage)
[x] changed data (tiny.json) make sure other tests do not break
[x] additional test case
More improvements that wont be covered in this PR:
Fixes #711