Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: DocumentStores and Retrievers #4370

Merged
merged 10 commits into from
Mar 28, 2023
Merged

proposal: DocumentStores and Retrievers #4370

merged 10 commits into from
Mar 28, 2023

Conversation

ZanSara
Copy link
Contributor

@ZanSara ZanSara commented Mar 9, 2023

Fixes #1897


Note for reviewers

Same policy as for the Pipelines proposal 🙂

Before commenting, please reach out to me first, especially if your feedback is large and varied. I'm happy to discuss and clarify all details 1:1. In any case, consider using a single comment instead of a review.

The proposal will be set as Ready to Merge once the high-level concept have been decided upon and we can move on to smaller refinements (details about the naming, the wording, some smaller concepts like which exact dataclasses to implement, etc.)

Current open questions:

None yet

@ZanSara ZanSara mentioned this pull request Mar 13, 2023
16 tasks
@ZanSara ZanSara changed the title proposal: Stores and Data proposal: DocumentStores and Retrievers Mar 14, 2023
@ZanSara ZanSara marked this pull request as ready for review March 16, 2023 09:15
@ZanSara ZanSara requested review from a team as code owners March 16, 2023 09:15
@ZanSara ZanSara requested review from masci, TuanaCelik, silvanocerza and mayankjobanputra and removed request for a team March 16, 2023 09:15
Copy link
Contributor

@silvanocerza silvanocerza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! 💯

Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@tholor
Copy link
Member

tholor commented Mar 17, 2023

  • I see the basic motivation and get the split from retriever into "retriever + embedder" so that we can use an embedder more clearly in an indexing pipeline and a retriever in a query pipeline 💯
  • I don't see the user value of making Retrievers "document-store-specific". From a user perspective, I'd find it confusing to have so many different retrievers and it puts the emphasis on the type of documentstore in a retriever and not on the used method (embedding or BM25). A huge value prop of haystack is that you can easily switch between documentstores without changing the rest of your pipelines. With this design we move actively away from it and it becomes very unclear if you can swap a MemoryRetriever with a FAISSRetriever.
  • What's the advantage of a separate "WriteDocument" Node? Wouldn't the design be more similar between indexing and query pipelines (and therefore intuitive) if we had, similar to the Retriever, something like indexing_pipe.add_node("embedder", DocumentEmbedder(store="document_store", model_name="deepset/model-name"))
  • Overall, I am a bit concerned with adding too many "different little nodes" that do "very little jobs" and I have to remember all of them as user. For example, "StringEmbedder" vs "DocumentEmbedder". Why not just one Embedder that can deal with different types? Why not creating the "query embedding" as part of the retriever (as it is right now)?
  • If the complexity of current maintenance is too big for us, I'd rather deprecate documentstores or move them to a community-maintained package

@ZanSara
Copy link
Contributor Author

ZanSara commented Mar 17, 2023

Hey @tholor thank you for the review! Let's discuss these points in person. In the meantime, I'll add a hint of what my replies to your concerns will look like.

I don't see the user value of making Retrievers "document-store-specific".

Right now we're asking users to check out a matrix of docstore/retriever supported pairs in the documentation. The matrix changes continuously and even we have trouble keeping up with the developments of new retrieval methods in vector stores. The aim is to remove this hurdle. No point wondering every time "Does InMemoryDocumentStore support DensePassageRetriever?", "Does this version of WeaviateDocumentStore support BM25?". Just pair MemoryDocumentStore with MemoryRetriever and it will work. Haystack developers will make sure it does.

About switching, I expect that to stay simple because most retriever will have very similar parameters, if not identical. It will be on us to make sure they're as easy to swap as possible.

What's the advantage of a separate "WriteDocument" Node?

We want nodes to perform a single task very well. We will always have the ability to make bigger nodes that perform the task of two or three nodes by using them under the hood. At this stage, the smaller they are, the better. It also makes easier to adapt them if/when we iterate on the underlying pipeline design and reduces the size of their signature. No one likes objects whose init method take 20+ parameters 😄

In addition, what if a user wants to create embeddings for documents and then do something else with them (for example, on-the-fly embeddings for retrieval)? Why forcing them to write to the store? Let's stay flexible.

Overall, I am a bit concerned with adding too many "different little nodes" that do "very little jobs" and I have to remember all of them as user. For example, "StringEmbedder" vs "DocumentEmbedder". Why not just one Embedder that can deal with different types? Why not creating the "query embedding" as part of the retriever (as it is right now)?

Let's keep in mind that these are just examples: in practice they might never be added. But the underlying concept holds: we will favor small nodes that do one task very well over large nodes that do many things barely. We'd like to avoid creating another PreProcessor 😅

If the complexity of current maintenance is too big for us, I'd rather deprecate documentstores or move them to a community-maintained package

Will definitely be done 👍

@ZanSara ZanSara merged commit 651be37 into main Mar 28, 2023
@ZanSara ZanSara deleted the proposal-stores branch March 28, 2023 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rethink Document Stores
5 participants