-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: DocumentStores
and Retrievers
#4370
Conversation
Stores
and Data
DocumentStores
and Retrievers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
|
Hey @tholor thank you for the review! Let's discuss these points in person. In the meantime, I'll add a hint of what my replies to your concerns will look like.
Right now we're asking users to check out a matrix of docstore/retriever supported pairs in the documentation. The matrix changes continuously and even we have trouble keeping up with the developments of new retrieval methods in vector stores. The aim is to remove this hurdle. No point wondering every time "Does InMemoryDocumentStore support DensePassageRetriever?", "Does this version of WeaviateDocumentStore support BM25?". Just pair About switching, I expect that to stay simple because most retriever will have very similar parameters, if not identical. It will be on us to make sure they're as easy to swap as possible.
We want nodes to perform a single task very well. We will always have the ability to make bigger nodes that perform the task of two or three nodes by using them under the hood. At this stage, the smaller they are, the better. It also makes easier to adapt them if/when we iterate on the underlying pipeline design and reduces the size of their signature. No one likes objects whose init method take 20+ parameters 😄 In addition, what if a user wants to create embeddings for documents and then do something else with them (for example, on-the-fly embeddings for retrieval)? Why forcing them to write to the store? Let's stay flexible.
Let's keep in mind that these are just examples: in practice they might never be added. But the underlying concept holds: we will favor small nodes that do one task very well over large nodes that do many things barely. We'd like to avoid creating another
Will definitely be done 👍 |
Fixes #1897
Note for reviewers
Same policy as for the Pipelines proposal 🙂
Before commenting, please reach out to me first, especially if your feedback is large and varied. I'm happy to discuss and clarify all details 1:1. In any case, consider using a single comment instead of a review.
The proposal will be set as Ready to Merge once the high-level concept have been decided upon and we can move on to smaller refinements (details about the naming, the wording, some smaller concepts like which exact dataclasses to implement, etc.)
Current open questions:
None yet