Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate BEIR #2333

Merged
merged 16 commits into from
Mar 21, 2022
Merged

Integrate BEIR #2333

merged 16 commits into from
Mar 21, 2022

Conversation

tstadel
Copy link
Member

@tstadel tstadel commented Mar 18, 2022

Proposed changes:

  • add Pipeline.eval_beir()
  • add beir as optional dependency

How to:

import logging

logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)

from haystack.pipelines import DocumentSearchPipeline, Pipeline
from haystack.nodes import TextConverter, ElasticsearchRetriever
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore

text_converter = TextConverter()
document_store = ElasticsearchDocumentStore(search_fields=["content", "name"], index="scifact_beir")
retriever = ElasticsearchRetriever(document_store=document_store, top_k=1000)

index_pipeline = Pipeline()
index_pipeline.add_node(text_converter, name="TextConverter", inputs=["File"])
index_pipeline.add_node(document_store, name="DocumentStore", inputs=["TextConverter"])

query_pipeline = DocumentSearchPipeline(retriever=retriever)

ndcg, _map, recall, precision = Pipeline.eval_beir(
    index_pipeline=index_pipeline, query_pipeline=query_pipeline, dataset="scifact"
)

Status (please check what you already did):

  • First draft (up for discussions & feedback)
  • Final code

@tstadel tstadel requested a review from julian-risch March 18, 2022 16:25
@tstadel tstadel marked this pull request as ready for review March 18, 2022 18:11
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 It's a nice small feature that can be very helpful. I just left two comments to slightly improve the code: one is about documentation and the other about our custom Errors. Feel free to merge once this is addressed.

from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
except:
raise PipelineError("beir is not installed. Please run `pip install beir`...")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this maybe be a different kind of error? I mean, there is nothing wrong with the pipeline so PipelineError might be confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: Changed to HaystackError

) -> Tuple[Dict[str, float], Dict[str, float], Dict[str, float], Dict[str, float]]:
"""
Runs information retrieval evaluation of a pipeline using BEIR on a specified BEIR dataset.
See https://github.com/beir-cellar/beir for more information.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth mentioning that an index beir_{dataset} will be created/deleted. At first, I was afraid that users will delete their already indexed data easily when trying this new feature out. Later on I learned that there is no such risk.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to change this again since index creation and thus ensuring appropriate field mappings is only done at document store init time. So I got rid of the dedicated index and leave it up to the user. If the user provides a non-empty index a HaystackError will be thrown.

@tstadel
Copy link
Member Author

tstadel commented Mar 21, 2022

@julian-risch
I made some additional changes:

  • get rid of fixed index name: now user is in full control. If index is non-empty, an exception will be thrown.
  • streamline DocumentStore.delete_index throughout all document stores
  • introduce keep_index param to keep the index beir has been evaluated on (for futher analysis)

Would be great if you can quickly scan over them.

@tstadel tstadel requested a review from julian-risch March 21, 2022 15:15
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 One remark: We could think about having index = index or self.index in all the delete_index() methods. Many of the other methods in DocumentStore allow the index parameter to be None and in that case use the index attribute of the DocumentStore itself. In my opinion, it would make sense to have index = index or self.index in all the delete_index() methods for consistency reasons. If I can call delete_documents() then why not also delete_index() without specifying the index explicitly? D

@tstadel
Copy link
Member Author

tstadel commented Mar 21, 2022

LGTM +1 One remark: We could think about having index = index or self.index in all the delete_index() methods. Many of the other methods in DocumentStore allow the index parameter to be None and in that case use the index attribute of the DocumentStore itself. In my opinion, it would make sense to have index = index or self.index in all the delete_index() methods for consistency reasons. If I can call delete_documents() then why not also delete_index() without specifying the index explicitly? D

I thought about that: reason why I chose to not implement it this way is that if you deleted the default index, in most cases you cannot use it properly until you reinstantiated the DocumentStore: (e.g. ElasticSearch: the mapping will not be set and gets lost completely, or FAISS: index is gone and you cannot recreate it unless reinstantiating FAISSDocumentStore).
With delete_documents we do not face such a problem. You can work on the default index perfectly fine as before.

@tstadel
Copy link
Member Author

tstadel commented Mar 21, 2022

LGTM +1 One remark: We could think about having index = index or self.index in all the delete_index() methods. Many of the other methods in DocumentStore allow the index parameter to be None and in that case use the index attribute of the DocumentStore itself. In my opinion, it would make sense to have index = index or self.index in all the delete_index() methods for consistency reasons. If I can call delete_documents() then why not also delete_index() without specifying the index explicitly? D

I thought about that: reason why I chose to not implement it this way is that if you deleted the default index, in most cases you cannot use it properly until you reinstantiated the DocumentStore: (e.g. ElasticSearch: the mapping will not be set and gets lost completely, or FAISS: index is gone and you cannot recreate it unless reinstantiating FAISSDocumentStore). With delete_documents we do not face such a problem. You can work on the default index perfectly fine as before.

I guess we should show a warning in that case anyway...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:eval type:feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants