Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink Document Stores #1897

Closed
ZanSara opened this issue Dec 15, 2021 · 7 comments · Fixed by #4370
Closed

Rethink Document Stores #1897

ZanSara opened this issue Dec 15, 2021 · 7 comments · Fixed by #4370
Labels
topic:document_store type:refactor Not necessarily visible to the users

Comments

@ZanSara
Copy link
Contributor

ZanSara commented Dec 15, 2021

The current implementation of Haystack's document stores seems to be unsuitable to account for the great differences that exist between every document store's implementations.

This can be observed in several places:

Thanks to #1861 I came to the conclusion that a better architecture for doc stores is possible and would solve basically all of the above. In addition, it could be made backwards compatible until Haystack 2.0 (or any later major version) is released.

The suggested architecture is the following (the code is entirely runnable):

Classes

from typing import List, Optional, Dict, Union
from abc import abstractmethod


# ----- base.py -----

class BaseDocStoreBackend:
    """
    BaseDocStoreBackend would be the superclass of most current document stores.

    BaseDocStoreBackend contains (almost) only abstract methods with the very minimum 
    amount of arguments that make sense for every document store's implementation.
    So for `get_documents`, those could be `ids` and `index`, but little more.
    For example, no custom doc store's params like `custom_headers` 
    (for doc stores that make network calls)
    """
    @abstractmethod
    def get_documents_by_id(self, ids: List[str], index: Optional[str] = None):
        raise NotImplementedError("BaseDocStoreBackend is an abstract class. Use one of its implementations.")


# ----- [implementations] -----

class ElasticsearchDocStoreBackend(BaseDocStoreBackend):
    """
    This is a reincarnation of ElasticsearchDocumentStore.
    """
    def get_documents_by_id(self, ids: List[str], index: Optional[str] = None, custom_headers: Dict[str, str] = None):
        """
        Example implementation that adds an extra argument to the base signature from BaseDocStoreBackend.
        This code is fine with mypy.
        """
        print(" --- Getting documents from Elasticsearch --- ")
        return []


class SQLDatabase:
    """
    This class is a reincarnation of SQLDocumentStore.
    Note that it has no base class anymore.
    """
    def __init__(self, url: str):
        self.url = url

    def query(self, query: str):
        print(" --- Querying the SQL Database  --- ")
        return []


class FAISSDocStoreBackend(BaseDocStoreBackend):

    def __init__(self, sql_url: str):
        """
        In the case of FAISS, which used to be a subclass of SQLDocumentStore,
        we can also use a similar "composite" approach by defining an SQLDatabase class.
        This makes it really easy to mock during tests (see below) and keeps FAISSDocStoreBackend
        identical to any other BaseDocStoreBackend subclass (no other superclasses involved).
        """
        self.sql = SQLDatabase(url=sql_url)

    def get_documents_by_id(self, ids: List[str], index: Optional[str] = None):
        print(" --- Getting documents from FAISS --- ")
        return self.sql.query("query")


# ----- document_store.py ----- 

BACKEND_NAMES = {
    "elasticsearch": ElasticsearchDocStoreBackend,
    "faiss": FAISSDocStoreBackend
}


class UnknownBackendError(Exception):
    pass


class DocumentStore:
    """
    This class exposes all the public API of the Document Stores,
    and owns a BaseDocStoreBackend subclass that does most of the work.

    The BaseDocStoreBackend subclass could be either named ("elasticsearch", 
    "faiss", etc...) or passed directly. This second way makes providing
    your own doc store implementation really straightforward.

    Given that the backends will be allowed to have different parameters
    (see below), a lot of methods will accept *args and **kwargs here.
    Proactive validation will be possible but not necessary: error
    messages are readable anyway.
    """
    def __init__(self, backend: Union[str, BaseDocStoreBackend], *args, **kwargs):

        self._backend: BaseDocStoreBackend

        if isinstance(backend, BaseDocStoreBackend):
            self._backend = backend

        elif isinstance(backend, str) and backend in BACKEND_NAMES.keys():
            self._backend = BACKEND_NAMES[backend](*args, **kwargs)

        else:
            raise UnknownBackendError(f"Unknown document store backend ({backend}). "
                                       "Please provide a `BaseDocStoreBackend` subclass, "
                                       "or set `backend` to one of the following values: "
                                       ", ".join(BACKEND_NAMES.keys()))

    def get_documents_by_id(self, *args, **kwargs):
        """
        Example of a simple "forwarded" call to the backend.
        In here, we can perform proactive validation by inspecting the
        signature of the present backend, but it's not strictly necessary
        as the error message will be readable.
        """
        return self._backend.get_documents_by_id(self, *args, **kwargs)



# ----- API from the user's perspective ----- 

def my_function(doc_store: DocumentStore):
    """
    As expected, this function will work if the DocumentStore is 
    using an ElasticsearchDocStoreBackend, and will fail if the
    backend is an FAISSDocStoreBackend.

    This code is also fine with mypy.
    """
    return doc_store.get_documents_by_id(["uuid2"], custom_headers={"custom":"header"})


my_function(DocumentStore(backend="elasticsearch"))

# Example of error message
#my_function(DocumentStore(backend="faiss"))

# Traceback (most recent call last):
#   File "/home/sara/work/haystack/example.py", line 210, in <module>
#     my_function(DocumentStore(backend="faiss"))
#   File "/home/sara/work/haystack/example.py", line 206, in my_function
#     return doc_store.get_documents_by_id(["uuid2"], custom_headers={"custom":"header"})
#   File "/home/sara/work/haystack/example.py", line 192, in get_documents_by_id
#     self._backend.get_documents_by_id(self, *args, **kwargs)
# TypeError: get_documents_by_id() got an unexpected keyword argument 'custom_headers'

Testing & Mocking:

# ----- test_document_store.py ----- 

import pytest

DOCUMENTS = [{"meta": {"name": "name_1"}, "content": "text_1"}]


class MockBackend(BaseDocStoreBackend):
    def get_documents_by_id(self, ids: List[str], index: Optional[str] = None, *args, **kwargs):
        return DOCUMENTS


@pytest.fixture
def document_store():
    yield DocumentStore(backend=MockBackend())  # Also an example of how "3rd party" doc store could be set up


def test_document_store(document_store):
    assert document_store.get_documents_by_id(["uuid2"]) == DOCUMENTS


# ----- test_faiss.py ----- 

import pytest

DOCUMENTS = [{"meta": {"name": "name_1"}, "content": "text_1"}]


class MockSQL(BaseDocStoreBackend):
    def query(self, query: str, *args, **kwargs):
        return DOCUMENTS


@pytest.fixture
def faiss():
    docstore = DocumentStore(backend="faiss")
    docstore.sql = MockSQL()
    yield docstore


def test_faiss(document_store):
    assert document_store.get_documents_by_id(["uuid2"]) == DOCUMENTS

Does it work?

> python example.py 
 --- Getting documents from Elasticsearch --- 

> pytest example.py 
========== test session starts =============
platform linux -- Python 3.9.7, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
plugins: random-order-1.0.4, anyio-3.3.3
collected 2 items                                                                                                                                                                                                                            

example.py ..                                                                                        [100%]
==========  2 passed in 0.03s ==============

> mypy example.py 
Success: no issues found in 1 source file

Make it backwards compatible

class ElasticsearchDocumentStore(DocumentStore):
    def __init__(self, *args, **kwargs):
        super().__init__(self, backend="elasticsearch", *args, **kwargs)

    
class FAISSDocumentStore(DocumentStore):
    def __init__(self, *args, **kwargs):
        super().__init__(self, backend="faiss", *args, **kwargs)

These small classes are the full implementation. I expect them to be sufficient. Later, once we release a new major version, we can consider removing them, as they add no other value than backward-compatibility.

Time estimate

I want to be optimistic and believe that this could take me a few days (i.e. 2 to 5) to put in place, once we agree on its final form. I won't tackle the tests simplification into the same PR. It can be done later in a separate one if the resulting arch is really backwards compatible (that is, if all existing tests pass on the new code without touching them).


I'm very open to better suggestion regarding the naming, and in general please highlight all the issues you see in this implementation, so we can iterate. I see it mostly as a cleanup operation: the functionality will be moved in the backends and not a lot of new code will need to be written.

@ZanSara ZanSara added topic:document_store type:refactor Not necessarily visible to the users labels Dec 15, 2021
@lalitpagaria
Copy link
Contributor

lalitpagaria commented Dec 19, 2021

Nice idea.
You may refer #1004 and #1548 to check embedding storage related discussion.

@ZanSara
Copy link
Contributor Author

ZanSara commented Jan 12, 2022

Related: #1990

@ZanSara
Copy link
Contributor Author

ZanSara commented Jan 13, 2022

Another idea: during the dependencies refactoring (#1994), it became rather evident that not many users would ever want to have more than one or two document stores backends. For the time being, having different extras_require groups could be sufficient, but like in the case of Milvus1/2 and FAISS CPU/GPU, in some cases separate groups are not really sufficient.

An even cleaner solution would be to have each document store as a separate package. The base Haystack package could come with InMemoryDocumentStore as a fallback, while other doc stores will need to be installed separately. Note that we could still have a pip install haystack[all_docstores], which will simply pull the document store packages ad dependencies.

An external package would have full control over its dependencies, so a single haystack-milvus package could have 1x and 2x as dependency groups as long as needed, and haystack-faiss could have cpu and gpu, without polluting the extra_requires of haystack. Separate packages will also mean separate test suites, which is a benefit in my opinion. Integration tests in Haystack, however, should still test against all doc stores that we officially support.

Another advantage of these external packages approach is that it will be easier to share the maintenance burden with external contributors, and allow third parties to easily extend/fork/develop their own document store implementations without having to wait for us.

@lalitpagaria
Copy link
Contributor

Nice, I like the idea.

How we going to solve dependencies conflict. In Obsei, I am using pip-compile to resolve conflict and always pinning dependencies.

I think most of the time people going to use Haystack as a service (instead of SDK or sub dependency in another package). It is fine to pin dependencies.

@ZanSara
Copy link
Contributor Author

ZanSara commented Jan 17, 2022

That's a good question indeed. I still have to collect some feedback from the rest of the team about dependency pinning, I've heard different opinions on the topic. I personally lean for "weak" pinning, so to fix only a minimum version or a range of versions to allow for some flexibility; but that's still up to debate right now

@ZanSara
Copy link
Contributor Author

ZanSara commented Feb 21, 2022

Related: #2224

@masci
Copy link
Contributor

masci commented Nov 25, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:document_store type:refactor Not necessarily visible to the users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants