Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Finder to allow more flexibility (multiple retrievers, pure document search ...) #544

Closed
tholor opened this issue Nov 4, 2020 · 9 comments
Assignees
Labels
type:feature New feature or request
Milestone

Comments

@tholor
Copy link
Member

tholor commented Nov 4, 2020

Is your feature request related to a problem? Please describe.

The finder was initially designed to wrap a single retriever and a single reader to do extractive QA.
As Haystack is growing and covering more use cases we need to rethink the design to allow:

To consider:

  • Should be easy to exchange reader with generator (e.g. same API endpoint?)
  • Midterm: Import/Export config that describes whole setup

Some thoughts to get started...

=> Single vs. Multiple Finder classes?

Option B) Single Finder
get_answers()

  • faq, generative, extractive
  • can we optimize the return format to cover all cases nicely?
    get_documents()

Option A) Multiple Finders
a) splitting into faq, generative, extractive
b) DocFinder QAFinder => we don't gain much except clearer init
2) One vs. splitting get_answers()? generate_answer, faq_answer ...
3) Passing List of retrievers vs. Finder.add(retriever) + Finder.add(retriever) + Finder.add(reader)
4) Redesign API endpoints (e.g. doc-qa still right naming?)

I'd lean towards...
=> Single Finder
=> get_answers() get_documents()
=> API: documents object (ids+meta)
=> remove get_answers_via_similar...()

Open question: FAQ via get_documents() or via get_answers()

@tholor tholor added the type:feature New feature or request label Nov 4, 2020
@guillim
Copy link
Contributor

guillim commented Nov 4, 2020

We have a use case here @etalab with @psorianom, we have already talked about it (#125), but here is a summary :

Our need is to combine at least 2 retrievers :

  • sparse (BM25)
  • dense (ex: SBERT or DPR...)

Why: Mostly because BM25 is very efficient but lacks the retrieval of synonyms, which could be a good addon from the dense retrievers.

Note:
It could be interesting also to combine many BM25 retrievers at some point, tuned in their own ways, so I like the idea of a modular finder.

=> We are very looking for having this available with to the FastAPI swagger

@tholor
Copy link
Member Author

tholor commented Nov 4, 2020

@guillim Yes, this case will definitely be covered!

@tholor tholor added this to the #4 milestone Nov 4, 2020
@lalitpagaria
Copy link
Contributor

lalitpagaria commented Nov 4, 2020

Midterm: Import/Export config that describes whole setup

For above, we could check this out - https://github.com/facebookresearch/hydra

@tholor
Copy link
Member Author

tholor commented Nov 11, 2020

Quick Update:

We are currently thinking bigger here. With Haystack, we already have many nice "lego building blocks". However, we are missing a flexible, powerful way of sticking them together. Instead of having rather rigid Finder classes, we, therefore, think of introducing a highly flexible Pipeline class that is using a Directed Acyclic Graph under the hood (a bit like Apache Airflow).

You could add "tasks" as nodes (Retriever, Reader, Generator ...) and route your query via edges. This could cover not only all of the above use cases, but would also allow many other, more complex search pipelines that we have in mind for the future.

image

Happy to hear your feedback on this direction!

@guillim
Copy link
Contributor

guillim commented Nov 12, 2020

Sounds great to me. It would indeed fit many more complex situations, and be helpful for testing combos

@lalitpagaria
Copy link
Contributor

Nice idea @tholor

Not related to this but to make it more extensible. How about adding remote API call and callback support. Mainly I am thinking inline of Jina framework. Specially to make haystack highly distributed by adding cloud native support. So Generator, Retriever, Docs Cleaner, APIs etc will run on their own env/containers/machine but each will communicate via RPC (gRPC or Http).

@tholor
Copy link
Member Author

tholor commented Nov 12, 2020

Yes, absolutely. Our idea is to start with a "local" Pipeline and get the API / usage straight and then later on enabling a "distributed" pipeline with the execution of single nodes on different containers/machines.

@lalitpagaria
Copy link
Contributor

Awesome let me know if I can contribute to it

@tholor
Copy link
Member Author

tholor commented Nov 25, 2020

We implemented a first basic draft with #596.
In the next weeks / months, we will extend it to:

  • allow more different nodes
  • have more utility functions (e.g. yml import / export)
  • make the underlying execution fully parallel and distributed

We will tackle those steps in individual PRs & Issues ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants