-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate BEIR #2333
Integrate BEIR #2333
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 It's a nice small feature that can be very helpful. I just left two comments to slightly improve the code: one is about documentation and the other about our custom Errors. Feel free to merge once this is addressed.
haystack/pipelines/base.py
Outdated
from beir.datasets.data_loader import GenericDataLoader | ||
from beir.retrieval.evaluation import EvaluateRetrieval | ||
except: | ||
raise PipelineError("beir is not installed. Please run `pip install beir`...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this maybe be a different kind of error? I mean, there is nothing wrong with the pipeline so PipelineError
might be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done: Changed to HaystackError
) -> Tuple[Dict[str, float], Dict[str, float], Dict[str, float], Dict[str, float]]: | ||
""" | ||
Runs information retrieval evaluation of a pipeline using BEIR on a specified BEIR dataset. | ||
See https://github.com/beir-cellar/beir for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's worth mentioning that an index beir_{dataset} will be created/deleted. At first, I was afraid that users will delete their already indexed data easily when trying this new feature out. Later on I learned that there is no such risk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to change this again since index creation and thus ensuring appropriate field mappings is only done at document store init time. So I got rid of the dedicated index and leave it up to the user. If the user provides a non-empty index a HaystackError will be thrown.
@julian-risch
Would be great if you can quickly scan over them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 One remark: We could think about having index = index or self.index
in all the delete_index()
methods. Many of the other methods in DocumentStore
allow the index
parameter to be None and in that case use the index
attribute of the DocumentStore
itself. In my opinion, it would make sense to have index = index or self.index
in all the delete_index()
methods for consistency reasons. If I can call delete_documents()
then why not also delete_index()
without specifying the index explicitly? D
I thought about that: reason why I chose to not implement it this way is that if you deleted the default index, in most cases you cannot use it properly until you reinstantiated the DocumentStore: (e.g. ElasticSearch: the mapping will not be set and gets lost completely, or FAISS: index is gone and you cannot recreate it unless reinstantiating FAISSDocumentStore). |
I guess we should show a warning in that case anyway... |
Proposed changes:
Pipeline.eval_beir()
beir
as optional dependencyHow to:
Status (please check what you already did):