proposal: `DocumentStores` and `Retrievers` #4370

ZanSara · 2023-03-09T16:15:46Z

Note for reviewers

Same policy as for the Pipelines proposal 🙂

Before commenting, please reach out to me first, especially if your feedback is large and varied. I'm happy to discuss and clarify all details 1:1. In any case, consider using a single comment instead of a review.

The proposal will be set as Ready to Merge once the high-level concept have been decided upon and we can move on to smaller refinements (details about the naming, the wording, some smaller concepts like which exact dataclasses to implement, etc.)

Current open questions:

None yet

silvanocerza

Looks great to me! 💯

masci

🚀

tholor · 2023-03-17T07:04:38Z

I see the basic motivation and get the split from retriever into "retriever + embedder" so that we can use an embedder more clearly in an indexing pipeline and a retriever in a query pipeline 💯
I don't see the user value of making Retrievers "document-store-specific". From a user perspective, I'd find it confusing to have so many different retrievers and it puts the emphasis on the type of documentstore in a retriever and not on the used method (embedding or BM25). A huge value prop of haystack is that you can easily switch between documentstores without changing the rest of your pipelines. With this design we move actively away from it and it becomes very unclear if you can swap a MemoryRetriever with a FAISSRetriever.
What's the advantage of a separate "WriteDocument" Node? Wouldn't the design be more similar between indexing and query pipelines (and therefore intuitive) if we had, similar to the Retriever, something like indexing_pipe.add_node("embedder", DocumentEmbedder(store="document_store", model_name="deepset/model-name"))
Overall, I am a bit concerned with adding too many "different little nodes" that do "very little jobs" and I have to remember all of them as user. For example, "StringEmbedder" vs "DocumentEmbedder". Why not just one Embedder that can deal with different types? Why not creating the "query embedding" as part of the retriever (as it is right now)?
If the complexity of current maintenance is too big for us, I'd rather deprecate documentstores or move them to a community-maintained package

ZanSara · 2023-03-17T09:27:11Z

Hey @tholor thank you for the review! Let's discuss these points in person. In the meantime, I'll add a hint of what my replies to your concerns will look like.

I don't see the user value of making Retrievers "document-store-specific".

Right now we're asking users to check out a matrix of docstore/retriever supported pairs in the documentation. The matrix changes continuously and even we have trouble keeping up with the developments of new retrieval methods in vector stores. The aim is to remove this hurdle. No point wondering every time "Does InMemoryDocumentStore support DensePassageRetriever?", "Does this version of WeaviateDocumentStore support BM25?". Just pair MemoryDocumentStore with MemoryRetriever and it will work. Haystack developers will make sure it does.

About switching, I expect that to stay simple because most retriever will have very similar parameters, if not identical. It will be on us to make sure they're as easy to swap as possible.

What's the advantage of a separate "WriteDocument" Node?

We want nodes to perform a single task very well. We will always have the ability to make bigger nodes that perform the task of two or three nodes by using them under the hood. At this stage, the smaller they are, the better. It also makes easier to adapt them if/when we iterate on the underlying pipeline design and reduces the size of their signature. No one likes objects whose init method take 20+ parameters 😄

In addition, what if a user wants to create embeddings for documents and then do something else with them (for example, on-the-fly embeddings for retrieval)? Why forcing them to write to the store? Let's stay flexible.

Overall, I am a bit concerned with adding too many "different little nodes" that do "very little jobs" and I have to remember all of them as user. For example, "StringEmbedder" vs "DocumentEmbedder". Why not just one Embedder that can deal with different types? Why not creating the "query embedding" as part of the retriever (as it is right now)?

Let's keep in mind that these are just examples: in practice they might never be added. But the underlying concept holds: we will favor small nodes that do one task very well over large nodes that do many things barely. We'd like to avoid creating another PreProcessor 😅

If the complexity of current maintenance is too big for us, I'd rather deprecate documentstores or move them to a community-maintained package

Will definitely be done 👍

ZanSara added 2 commits March 9, 2023 17:10

add proposal

8b566c4

add proposal

e7117ab

ZanSara added the proposal label Mar 9, 2023

ZanSara added 2 commits March 9, 2023 17:23

pr number

0673bd2

pr number

a3a9c10

ZanSara mentioned this pull request Mar 13, 2023

Migration to new Pipeline #4390

Closed

16 tasks

ZanSara added 3 commits March 14, 2023 14:56

start second draft

70da859

Merge branch 'main' into proposal-stores

1c17549

second draft

dd3ffcb

ZanSara changed the title ~~proposal: Stores and Data~~ proposal: DocumentStores and Retrievers Mar 14, 2023

ZanSara added 2 commits March 14, 2023 15:31

node examples

95ad27a

phrasing

396ebd8

ZanSara mentioned this pull request Mar 15, 2023

Separate concepts of "Retriever" and "Embedder" #2403

Closed

silvanocerza mentioned this pull request Mar 15, 2023

Audio is a supported content type but never used in the core codebase #4424

Closed

get_documents -> filter_documents

59e9683

ZanSara marked this pull request as ready for review March 16, 2023 09:15

ZanSara requested review from a team as code owners March 16, 2023 09:15

ZanSara requested review from masci, TuanaCelik, silvanocerza and mayankjobanputra and removed request for a team March 16, 2023 09:15

silvanocerza approved these changes Mar 16, 2023

View reviewed changes

masci approved these changes Mar 16, 2023

View reviewed changes

TuanaCelik approved these changes Mar 16, 2023

View reviewed changes

This was referenced Mar 16, 2023

Add MemoryDocumentStore for new Pipelines #4446

Closed

feat: initial implementation of MemoryDocumentStore for new Pipelines #4447

Merged

ZanSara merged commit 651be37 into main Mar 28, 2023

ZanSara deleted the proposal-stores branch March 28, 2023 14:31

julian-risch mentioned this pull request Jul 4, 2023

Migrate Components to Pipeline v2 #5265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: `DocumentStores` and `Retrievers` #4370

proposal: `DocumentStores` and `Retrievers` #4370

ZanSara commented Mar 9, 2023

silvanocerza left a comment

masci left a comment

tholor commented Mar 17, 2023

ZanSara commented Mar 17, 2023

proposal: DocumentStores and Retrievers #4370

proposal: DocumentStores and Retrievers #4370

Conversation

ZanSara commented Mar 9, 2023

Note for reviewers

Current open questions:

silvanocerza left a comment

Choose a reason for hiding this comment

masci left a comment

Choose a reason for hiding this comment

tholor commented Mar 17, 2023

ZanSara commented Mar 17, 2023

proposal: `DocumentStores` and `Retrievers` #4370

proposal: `DocumentStores` and `Retrievers` #4370