Integrate BEIR #2333

tstadel · 2022-03-18T16:25:00Z

Proposed changes:

add Pipeline.eval_beir()
add beir as optional dependency

How to:

import logging

logging.basicConfig(format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO)

from haystack.pipelines import DocumentSearchPipeline, Pipeline
from haystack.nodes import TextConverter, ElasticsearchRetriever
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore

text_converter = TextConverter()
document_store = ElasticsearchDocumentStore(search_fields=["content", "name"], index="scifact_beir")
retriever = ElasticsearchRetriever(document_store=document_store, top_k=1000)

index_pipeline = Pipeline()
index_pipeline.add_node(text_converter, name="TextConverter", inputs=["File"])
index_pipeline.add_node(document_store, name="DocumentStore", inputs=["TextConverter"])

query_pipeline = DocumentSearchPipeline(retriever=retriever)

ndcg, _map, recall, precision = Pipeline.eval_beir(
    index_pipeline=index_pipeline, query_pipeline=query_pipeline, dataset="scifact"
)

Status (please check what you already did):

First draft (up for discussions & feedback)
Final code

…integrate_beir

julian-risch

LGTM 👍 It's a nice small feature that can be very helpful. I just left two comments to slightly improve the code: one is about documentation and the other about our custom Errors. Feel free to merge once this is addressed.

julian-risch · 2022-03-21T10:18:10Z

haystack/pipelines/base.py

+            from beir.datasets.data_loader import GenericDataLoader
+            from beir.retrieval.evaluation import EvaluateRetrieval
+        except:
+            raise PipelineError("beir is not installed. Please run `pip install beir`...")


Should this maybe be a different kind of error? I mean, there is nothing wrong with the pipeline so PipelineError might be confusing.

done: Changed to HaystackError

julian-risch · 2022-03-21T10:27:01Z

haystack/pipelines/base.py

+    ) -> Tuple[Dict[str, float], Dict[str, float], Dict[str, float], Dict[str, float]]:
+        """
+        Runs information retrieval evaluation of a pipeline using BEIR on a specified BEIR dataset.
+        See https://github.com/beir-cellar/beir for more information.


Maybe it's worth mentioning that an index beir_{dataset} will be created/deleted. At first, I was afraid that users will delete their already indexed data easily when trying this new feature out. Later on I learned that there is no such risk.

I had to change this again since index creation and thus ensuring appropriate field mappings is only done at document store init time. So I got rid of the dedicated index and leave it up to the user. If the user provides a non-empty index a HaystackError will be thrown.

…integrate_beir

tstadel · 2022-03-21T15:13:26Z

@julian-risch
I made some additional changes:

get rid of fixed index name: now user is in full control. If index is non-empty, an exception will be thrown.
streamline DocumentStore.delete_index throughout all document stores
introduce keep_index param to keep the index beir has been evaluated on (for futher analysis)

Would be great if you can quickly scan over them.

julian-risch

LGTM 👍 One remark: We could think about having index = index or self.index in all the delete_index() methods. Many of the other methods in DocumentStore allow the index parameter to be None and in that case use the index attribute of the DocumentStore itself. In my opinion, it would make sense to have index = index or self.index in all the delete_index() methods for consistency reasons. If I can call delete_documents() then why not also delete_index() without specifying the index explicitly? D

tstadel · 2022-03-21T15:32:35Z

LGTM +1 One remark: We could think about having index = index or self.index in all the delete_index() methods. Many of the other methods in DocumentStore allow the index parameter to be None and in that case use the index attribute of the DocumentStore itself. In my opinion, it would make sense to have index = index or self.index in all the delete_index() methods for consistency reasons. If I can call delete_documents() then why not also delete_index() without specifying the index explicitly? D

I thought about that: reason why I chose to not implement it this way is that if you deleted the default index, in most cases you cannot use it properly until you reinstantiated the DocumentStore: (e.g. ElasticSearch: the mapping will not be set and gets lost completely, or FAISS: index is gone and you cannot recreate it unless reinstantiating FAISSDocumentStore).
With delete_documents we do not face such a problem. You can work on the default index perfectly fine as before.

tstadel · 2022-03-21T15:35:58Z

LGTM +1 One remark: We could think about having index = index or self.index in all the delete_index() methods. Many of the other methods in DocumentStore allow the index parameter to be None and in that case use the index attribute of the DocumentStore itself. In my opinion, it would make sense to have index = index or self.index in all the delete_index() methods for consistency reasons. If I can call delete_documents() then why not also delete_index() without specifying the index explicitly? D

I thought about that: reason why I chose to not implement it this way is that if you deleted the default index, in most cases you cannot use it properly until you reinstantiated the DocumentStore: (e.g. ElasticSearch: the mapping will not be set and gets lost completely, or FAISS: index is gone and you cannot recreate it unless reinstantiating FAISSDocumentStore). With delete_documents we do not face such a problem. You can work on the default index perfectly fine as before.

I guess we should show a warning in that case anyway...

tstadel added 2 commits March 18, 2022 17:20

introduce eval_beir() to Pipeline

5c3053f

add beir dependency

7d25b2b

tstadel requested a review from julian-risch March 18, 2022 16:25

github-actions bot and others added 5 commits March 18, 2022 16:28

Update Documentation & Code Style

dee8a7e

top_k_values added + refactoring

6f1bd54

Update Documentation & Code Style

f3976f2

enable titles during beir eval

7b6c43a

Merge branch 'integrate_beir' of github.com:deepset-ai/haystack into …

4086cba

…integrate_beir

tstadel marked this pull request as ready for review March 18, 2022 18:11

Update Documentation & Code Style

b90aa1f

tstadel added type:feature New feature or request topic:eval journey:intermediate labels Mar 21, 2022

julian-risch approved these changes Mar 21, 2022

View reviewed changes

tstadel added 3 commits March 21, 2022 14:48

raise HaystackError instead of PipelineError

a502377

Merge branch 'integrate_beir' of github.com:deepset-ai/haystack into …

a3379b6

…integrate_beir

get rid of forced dedicated index

7d803f0

Merge branch 'master' into integrate_beir

ba8473e

tstadel requested a review from julian-risch March 21, 2022 15:15

minor docstring and comment fixes

a298b21

julian-risch approved these changes Mar 21, 2022

View reviewed changes

tstadel and others added 3 commits March 21, 2022 16:52

show warning on default index deletion

c5f7029

Update Documentation & Code Style

be88979

add delete_index to MockDocumentStore

7aefb14

tstadel merged commit ca86cc8 into master Mar 21, 2022

tstadel deleted the integrate_beir branch March 21, 2022 18:04

tstadel mentioned this pull request Mar 21, 2022

Support for custom retrievers: common base type for EvaluateRetrieval's retriever param beir-cellar/beir#84

Open

bogdankostic mentioned this pull request Mar 28, 2022

Implementing Generative Pseudo Labeling (GPL) #1908

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate BEIR #2333

Integrate BEIR #2333

tstadel commented Mar 18, 2022 •

edited

Loading

julian-risch left a comment

julian-risch Mar 21, 2022

tstadel Mar 21, 2022

julian-risch Mar 21, 2022

tstadel Mar 21, 2022

tstadel commented Mar 21, 2022 •

edited

Loading

julian-risch left a comment

tstadel commented Mar 21, 2022

tstadel commented Mar 21, 2022

Integrate BEIR #2333

Integrate BEIR #2333

Conversation

tstadel commented Mar 18, 2022 • edited Loading

julian-risch left a comment

Choose a reason for hiding this comment

julian-risch Mar 21, 2022

Choose a reason for hiding this comment

tstadel Mar 21, 2022

Choose a reason for hiding this comment

julian-risch Mar 21, 2022

Choose a reason for hiding this comment

tstadel Mar 21, 2022

Choose a reason for hiding this comment

tstadel commented Mar 21, 2022 • edited Loading

julian-risch left a comment

Choose a reason for hiding this comment

tstadel commented Mar 21, 2022

tstadel commented Mar 21, 2022

tstadel commented Mar 18, 2022 •

edited

Loading

tstadel commented Mar 21, 2022 •

edited

Loading