Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] EmbeddingStore in Haystack #1004

Closed
lalitpagaria opened this issue Apr 27, 2021 · 3 comments
Closed

[Proposal] EmbeddingStore in Haystack #1004

lalitpagaria opened this issue Apr 27, 2021 · 3 comments
Labels

Comments

@lalitpagaria
Copy link
Contributor

Currently haystack have notion of DocumentStore (SQL, FAISS, Milvus, In-memory and ES).

These document stores can be classified in three categories -

  • Pure document store (SQL and In-memory): Which store text and meta data. Querying is based on full text search.
  • Pure embedding store (FAISS and Milvus): Which store text, meta data and embedding. Querying is based on vector search.
  • Hybrid store (ES): Which store text, meta data and embedding (if used as embedding store). Querying is based on vector search as well as full text search or BM25.

Except ES store, all other store embedding in separately from document storage. So idea is to introduce notion of EmbeddingStore which will -

  • Sub class of BaseDocumentStore
  • Have document_store as required parameter
  • Embedding generation (Idea to keep embedding generation logic closer to Embedding store). User can customize generation logic based on their need.
  • Document store operation will generate embedding, store embedding to it's store and write documents to document_store
  • Searching operation will generate embedding of given query, query it's store to find vector_ids and then query document_store
  • All operations like delete and update will flow from it's own storage and then to document_store

There will be three DocumentStores sub classes based on current implementation -

  • SQL
  • In-Memory
  • ES

There will be three EmbeddingStores sub classes based on current available features -

  • FAISS (Can support ES, SQL and in-memory document stores)
  • Milvus (Can support ES, SQL and in-memory document stores)
  • Sparse (Only support ES)

In future if haystack support weaviate, then it will be part of EmbeddingStore sub class.

Following pros and cons I can think of above proposal.

Pros -

  • Separation of responsibilities between doc store and embedding store
  • Embedding generation logic closer to the the storage
  • User able to customise search logic easily
  • User can use same document_store in separate pipeline and embedding store in another
  • Easy to integrate other embedding stores
  • ES also can be used with FIASS or Milvus

Cons -

  • Direct update/delete on document_store will not reflect in embedding_store. Anyway now also able to do same by interacting directly to exact store like SQL. So we would need to introduce logic in doc store that if document data contains vector_id then print error or warning
  • Confusion among user about purpose and use

I just shared my thought which I had since long time, and open for discussion about this proposal :)

@lalitpagaria lalitpagaria changed the title [Proposal] Introducing EmbeddingStore in Haystack [Proposal] EmbeddingStore in Haystack Apr 29, 2021
@tholor
Copy link
Member

tholor commented Jun 7, 2021

Hey @lalitpagaria ,

This is a very thoughtful suggestion and actually a topic that actually also came up in some discussions in our team in the last months. I thought about it quite a lot in the last days and discussed it with @oryx1729 .

Let me try to summarize our key thoughts here:

Pros:

  • Cleaner separation in the code
  • Easier combination of new docstores and embeddingstores (e.g. allowing mongoDB + milvus)

Cons:

  • Significantly bigger test surface if we want to support all combinations
  • If we don't support all combinations, it will be hard to document / explain what is possible and what not
  • Optimization and benchmarking becomes more tricky - e.g. when we need to optimize FAISS for usage with SQL and Elasticsearch
  • Harder for the user to pick the "right" combination
  • We believe the future is rather about hybrid stores (e.g. Elasticsearch + ANN plugin like in the open distro, weaviate ...) and we think for these hybrid stores the separation into docstores and embedding stores is rather confusing than helpful
  • We haven't seen demand from the community for custom combinations like FAISS+elasticsearch yet. If this changes, and we see important use cases enabled by this direction, this would be an important game changer for this discussion.

All in all, we believe the value today is not really worth the costs (cons from above + implementation work). That's why we would not like to go down this road right now. Does this make sense for you?

@tholor tholor added topic:document_store type:feature New feature or request labels Jun 7, 2021
@lalitpagaria
Copy link
Contributor Author

@tholor Yes agree with you as it is big ticket item also involve educating community about new naming.
So let's close it for now, we can revisit this in future based on need. WDYT?

@tholor
Copy link
Member

tholor commented Jun 7, 2021

Yes, let's close it for now and if we see more evidence for need from the community we can reconsider it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants