Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for batch querying when using Pipelines #1239

Closed
brandenchan opened this issue Jun 29, 2021 · 1 comment · Fixed by #2481
Closed

Allow for batch querying when using Pipelines #1239

brandenchan opened this issue Jun 29, 2021 · 1 comment · Fixed by #2481
Assignees
Labels
topic:pipeline type:feature New feature or request

Comments

@brandenchan
Copy link
Contributor

brandenchan commented Jun 29, 2021

Same larger goal as #1168 but for querying instead of indexing.

As discussed with @oryx1729 and @tholor, we'd like to design the node and Pipeline APIs so that batches of queries can be processed together. This will allow for a better user experience but also allows in future for batch optimizations to speed up querying. Note that we're here focusing first on designing the interfaces. The optimization can be handled separately for each node in different issues.

One design choice we decided upon is have explicit separation of the single and batch functions. This will make for a clear user experience, and avoid any ambiguity for developers looking into the code. More specifically we want to avoid any uncertainty in the input and output formats of nodes.

The following methods should be implemented:

Pipeline.run(query)
Pipeline.run_batch(queries)

Node.run()
Node.run_batch()

If we call Pipeline.run_batch() every node should be executed using their Node.run_batch() method, not Node.run().
In particular, @bogdankostic and @julian-risch agreed in the refinement of this issue that Pipeline.run_batch() should be implemented with the following signature and should return a list of the elements usually returned by run():

def run_batch(
        self,
        queries: Optional[Union[str, List[str]]] = None,
        file_paths: Optional[List[str]] = None,
        labels: Optional[Union[MultiLabel, List[MultiLabel]]] = None,
        documents: Optional[Union[List[Document], List[List[Document]]]] = None,
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
        params: Optional[dict] = None
        debug: Optional[bool] = None

There might be use cases when the same query is executed separately on different sets of documents. Therefore queries is still allowed to be a single query.
Note that every query in the batch needs to use the same params for now. Otherwise, optimization will be hardly possible. We could change that limitation in future if necessary.

file_paths and meta are so far only used in indexing pipelines. If they are set, we should call run() instead of run_batch or trigger an error message if that is impossible.
Note that we should allow for flexibility in the format of the metadata passed into Pipeline.run_batch(). If it is a single meta dictionary, it should be applied to all queries. If it is a list, it should be one meta dict for each query.

Further, every node's predict_batch() should have a batch_size param that can be passed via params.

For now, a Node.run_batch() can be implemented in a naive, non-optimized way by simply calling Node.run() multiple times in a loop and split queries, labels, documents and meta if necessary so that the run() method is called with a single query/list of document. The individual results need to be collected in a list of results.
As an alternative, BaseComponent could implement run_batch() by calling the node's run() method.

The FARMReader already has a predict_batch() method that can be use:

def predict_batch(self, query_doc_list: List[dict], top_k: int = None, batch_size: int = None):

For the transformer based models, we could make use of transformer's pipeline batching mechanism: https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching
However, docs say

All pipelines (except zero-shot-classification and question-answering currently) can use batching.

@kathy-lee
Copy link

Hi, may I ask the new version of Haystack (now 2.8), is 'run_batch' still available? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:pipeline type:feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants