Refactor Finder to allow more flexibility (multiple retrievers, pure document search ...) #544

tholor · 2020-11-04T08:46:42Z

Is your feature request related to a problem? Please describe.

The finder was initially designed to wrap a single retriever and a single reader to do extractive QA.
As Haystack is growing and covering more use cases we need to rethink the design to allow:

multiple retrievers (see Combination of multiple retrievers #125)
a generator instead of a reader to do generative QA
pure document search (retriever only, see Refactor "most similar search" to be more generic #420)
reader only
more complex routing patterns in the future (e.g. classify if a request is a question, if yes => reader, if no => pure retriever)
optional: Having multiple retrievers in series instead of parallel. Inspired by passage reranking methods. See this blog's section on two stage ranking

To consider:

Should be easy to exchange reader with generator (e.g. same API endpoint?)
Midterm: Import/Export config that describes whole setup

Some thoughts to get started...

=> Single vs. Multiple Finder classes?

Option B) Single Finder
get_answers()

faq, generative, extractive
can we optimize the return format to cover all cases nicely?
get_documents()

Option A) Multiple Finders
a) splitting into faq, generative, extractive
b) DocFinder QAFinder => we don't gain much except clearer init
2) One vs. splitting get_answers()? generate_answer, faq_answer ...
3) Passing List of retrievers vs. Finder.add(retriever) + Finder.add(retriever) + Finder.add(reader)
4) Redesign API endpoints (e.g. doc-qa still right naming?)

I'd lean towards...
=> Single Finder
=> get_answers() get_documents()
=> API: documents object (ids+meta)
=> remove get_answers_via_similar...()

Open question: FAQ via get_documents() or via get_answers()

The text was updated successfully, but these errors were encountered:

guillim · 2020-11-04T09:16:24Z

We have a use case here @etalab with @psorianom, we have already talked about it (#125), but here is a summary :

Our need is to combine at least 2 retrievers :

sparse (BM25)
dense (ex: SBERT or DPR...)

Why: Mostly because BM25 is very efficient but lacks the retrieval of synonyms, which could be a good addon from the dense retrievers.

Note:
It could be interesting also to combine many BM25 retrievers at some point, tuned in their own ways, so I like the idea of a modular finder.

=> We are very looking for having this available with to the FastAPI swagger

tholor · 2020-11-04T10:26:01Z

@guillim Yes, this case will definitely be covered!

lalitpagaria · 2020-11-04T22:14:10Z

Midterm: Import/Export config that describes whole setup

For above, we could check this out - https://github.com/facebookresearch/hydra

tholor · 2020-11-11T05:35:23Z

Quick Update:

We are currently thinking bigger here. With Haystack, we already have many nice "lego building blocks". However, we are missing a flexible, powerful way of sticking them together. Instead of having rather rigid Finder classes, we, therefore, think of introducing a highly flexible Pipeline class that is using a Directed Acyclic Graph under the hood (a bit like Apache Airflow).

You could add "tasks" as nodes (Retriever, Reader, Generator ...) and route your query via edges. This could cover not only all of the above use cases, but would also allow many other, more complex search pipelines that we have in mind for the future.

Happy to hear your feedback on this direction!

guillim · 2020-11-12T08:44:40Z

Sounds great to me. It would indeed fit many more complex situations, and be helpful for testing combos

lalitpagaria · 2020-11-12T13:13:48Z

Nice idea @tholor

Not related to this but to make it more extensible. How about adding remote API call and callback support. Mainly I am thinking inline of Jina framework. Specially to make haystack highly distributed by adding cloud native support. So Generator, Retriever, Docs Cleaner, APIs etc will run on their own env/containers/machine but each will communicate via RPC (gRPC or Http).

tholor · 2020-11-12T14:20:58Z

Yes, absolutely. Our idea is to start with a "local" Pipeline and get the API / usage straight and then later on enabling a "distributed" pipeline with the execution of single nodes on different containers/machines.

lalitpagaria · 2020-11-12T17:53:15Z

Awesome let me know if I can contribute to it

tholor · 2020-11-25T11:15:37Z

We implemented a first basic draft with #596.
In the next weeks / months, we will extend it to:

allow more different nodes
have more utility functions (e.g. yml import / export)
make the underlying execution fully parallel and distributed

We will tackle those steps in individual PRs & Issues ...

tholor added the type:feature New feature or request label Nov 4, 2020

tholor assigned tholor and tanaysoni Nov 4, 2020

tholor added this to the #4 milestone Nov 4, 2020

tholor mentioned this issue Nov 11, 2020

[Discussion] bm25 semantic search / QueryExpander #573

Closed

tholor mentioned this issue Nov 16, 2020

Make visible an interface to the Inverse Cloze Task (ORQA Fine-tuning) #475

Closed

tanaysoni mentioned this issue Nov 17, 2020

Add support for building custom Search Pipelines #596

Merged

tholor closed this as completed Nov 25, 2020

ScottishFold007 mentioned this issue Jan 25, 2021

RequestError when use the trained DPR model #766

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Finder to allow more flexibility (multiple retrievers, pure document search ...) #544

Refactor Finder to allow more flexibility (multiple retrievers, pure document search ...) #544

tholor commented Nov 4, 2020 •

edited by brandenchan

Loading

guillim commented Nov 4, 2020

tholor commented Nov 4, 2020

lalitpagaria commented Nov 4, 2020 •

edited

Loading

tholor commented Nov 11, 2020

guillim commented Nov 12, 2020

lalitpagaria commented Nov 12, 2020

tholor commented Nov 12, 2020

lalitpagaria commented Nov 12, 2020

tholor commented Nov 25, 2020

Refactor Finder to allow more flexibility (multiple retrievers, pure document search ...) #544

Refactor Finder to allow more flexibility (multiple retrievers, pure document search ...) #544

Comments

tholor commented Nov 4, 2020 • edited by brandenchan Loading

guillim commented Nov 4, 2020

tholor commented Nov 4, 2020

lalitpagaria commented Nov 4, 2020 • edited Loading

tholor commented Nov 11, 2020

guillim commented Nov 12, 2020

lalitpagaria commented Nov 12, 2020

tholor commented Nov 12, 2020

lalitpagaria commented Nov 12, 2020

tholor commented Nov 25, 2020

tholor commented Nov 4, 2020 •

edited by brandenchan

Loading

lalitpagaria commented Nov 4, 2020 •

edited

Loading