Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Augment SelfQueryRetriever to support Vespa #4008

Closed
filippo82 opened this issue May 2, 2023 · 13 comments
Closed

[Feature] Augment SelfQueryRetriever to support Vespa #4008

filippo82 opened this issue May 2, 2023 · 13 comments

Comments

@filippo82
Copy link

filippo82 commented May 2, 2023

As of 0.0.155, the SelfQueryRetriever class supports Pinecone only.

I wanted to extend it myself to support Vespa too but, after reviewing the current implementation, I discovered that SelfQueryRetriever "wraps around a vector store" instead of a retriever.

Currently, there is no VectorStore implementation for Vespa. Hence, I believe it is not possible to augment SelfQueryRetriever to support Vespa.

After thinking about it for a couple of days, I think that a valid solution to this problem would require to rethink the implementation of the SelfQueryRetriever by making it wrap a retriever instead of a vector store. This sounds reasonable because each vector sore can act as a retriever thanks to the as_retriever method.

The search method added as part of the #3607 is a wrapper around either the similarity_search or max_marginal_relevance_search method which are also wrapped by the get_relevant_documents method of a retriever. Hence, refactoring SelfQueryRetriever to wrap a retriever instead of a vector store seems reasonable to me.

Obviously, mine is a fairly narrow point of view given that I mainly looked at the code related to the implementation of the SelfQueryRetriever class and I might miss key implications of such a proposed change.

I would be happy to have a go a the SelfQueryRetriever refactoring but first it would be great of someone from the code developers team could comment on this.

Tagging here @dev2049 because you were the author of #3607

@dev2049
Copy link
Contributor

dev2049 commented May 2, 2023

How were you imagining you'd use the self-querying retriever with another retriever? Would it be rephrasing the original query string?

I ask because the main method exposed by the Retriever abstraction is

def get_relevant_documents(self, query: str) -> List[Document]:

This only takes a single string query as input. the motivation for the SelfQueryRetriever is to be able to create complex, structured queries against a data store. Since Retrievers don't expose such a method, it's not particularly interesting to wrap the self-querying functionality around a Retriever. VectorStores on the other hand support arbitrary kwargs:

def search(self, query: str, search_type: str, **kwargs: Any) -> List[Document]:

Many vector stores for example support metadata filtering, and the self-querying retriever can auto-generate those filtering conditions.

My intuition is if we just want a retriever that takes a query, passes it to an LLM to rephrase it, and then applies the rephrased query to a base retriever it's wrapped around, it's probably cleaner/easier to just introduce a new Retriever class. What do you think?

cc @hwchase17

@filippo82
Copy link
Author

To answer your first question: yes, that is what I thought. My idea would be to rephrase the original query string which is then passed to a downstream retriever.

You raise all valid points of course.

Here some thoughts (in no specific order):

  • modifying def get_relevant_documents(self, query: str) to accept an additional **kwargs argument would most likely solve this issue.
  • (this is a very opinionated point) referring to vector stores, I think that the various search (retrieval) capabilities should be exposed through their "retriever interface" only. This would work as some kind of "separation of concerns" mechanism. Everything which has to do with creation/maintenance/storing of indices happens within the VectorStore subclass of the specific vector store. However, retrieving documents should happen using methods exposed by the as_retriever interface of the vector store. All of this is to support my idea that the SelfQueryRetriever class should only be aware of the retriever capabilities of an object.
  • Ideally, one would create a SelfQueryRetriever object from an LLM and retriever SelfQueryRetriever.from_llm(llm, retriever, ...) and not from a vector store SelfQueryRetriever.from_llm(llm, vectorstore, ...). Then the get_relevant_documents of SelfQueryRetriever would just call the get_relevant_documents of the specific retriever with kwargs (filters) (this would require the modification described in the first bullet point obviously).
  • If the general idea for a retriever in LangChain is to have an interface to simply perform text based queries, then yes one might think about introduce an RetrieverWithFilters class (or something like that). This class could expose a get_relevant_documents_with_filters method for example. However I still think that adding a **kwargs to get_relevant_documents might be a leaner solution (maybe even without breaking anything 🤞🏻) which does not require creating a new class.

P.S. These are comments from my 2 AM tired brain. Hopefully they still make some kind of sense. Happy to rephrase if needed.

@hwchase17
Copy link
Contributor

making this work with vespa should be a priority

i think what makes sense is to add a method to vespa like "retrieve with metadata filter" or something (better named). this can be a new method that takes in a query + a filter

we can add that method in its own pr, that should be pr #1

after that, we can figure out the right abstraction to allow to use that in self-query

@srivatsanv1
Copy link

I am using redis and would love it if both redis hybrid search as well as SelfQueryRetriever would be available for redis.

More importantly, SelfQueryRetriever could be made independent of the vectorstore which is used though. I mean the LLM which returns the filter and the value could be made a separate entity which can then be used in any vectorstore.

@dev2049
Copy link
Contributor

dev2049 commented May 3, 2023

@srivatsanv1 the SelfQueryRetriever is generic and should work with any vector store that you build a translation layer for. the way the retriever works is the structured query is first created using an internal query language, and then that is translated to whatever format a specific vector store accepts. Pinecone is just the first (and currently only) built-in translation layer, but we should add more. But you could technically already build a translation module yourself and pass it in.

@filippo82
Copy link
Author

More thoughts:


Unfortunately, adding a new method to the Vespa retriever might not do the trick. Looking at the implementation of the Vespa retriever, the query structure needs to be provided when creating the VespaRetriever object, as shown in the docs:

from langchain.retrievers.vespa_retriever import VespaRetriever

vespa_query_body = {
    "yql": "select content from paragraph where userQuery()",  # <- THIS PART
    "hits": 5,
    "ranking": "documentation",
    "locale": "en-us"
}
vespa_content_field = "content"
retriever = VespaRetriever(vespa_app, vespa_query_body, vespa_content_field)

When using the SelfQueryRetriever upstream, then this would require a modification of the value of the yql key. For example, with the query What's a movie about toys after 1990, the yql value would become:

vespa_query_body = {
    "yql": "select content from paragraph where userQuery() AND year > 1990",  # <- THIS PART
    "hits": 5,
    "ranking": "documentation",
    "locale": "en-us",
    "query": "What's a movie about toys"
}

It should be possible to implement a clever logic to handle this modification but it can quickly become a daunting task because the definition of the yql entry is entirely left to the user. I think that it should be wise to add constraints to which kind of query one can build for Vespa (if its retriever must work with the SelfQueryRetriever).

I believe it is necessary to review how a VespaRetriever object is created.


@dev2049 I think that the recently introduced search method could be removed (this is true if it is used by the SelfQueryRetriever only though).

For example, in the retrievers/self_query/base.py, one could replace this:

        search_kwargs = {**self.search_kwargs, **new_kwargs}
        docs = self.vectorstore.search(query, self.search_type, **search_kwargs)

with this:

        kwargs = {
            "search_kwargs": {**self.search_kwargs, **new_kwargs},
            "search_type": self.search_type,
        }
        retriever = self.vectorstore.as_retriever(**kwargs)
        docs = retriever.get_relevant_documents(query)

I will have a go at a draft PR to (try to) tackle the issues discussed here.

@jobergum
Copy link

jobergum commented May 4, 2023

Hey, I'm with the Vespa team.

The Vespa query API request parameter accepts queries expressed with Vespa simple query syntax, which is a typical search syntax you expect from web search engines (Well, except when Google launched G+ and they changed the way to search for an exact term).

What's a +movie about +toys +year:>1990 

So there is still some flexibility with that, but I do think that we should a find a way to override the request body further as there are a lot of parameters that could be used, for example, native Vespa embedders

@filippo82
Copy link
Author

Hi @jobergum 👋 thanks for joining the discussion.

Question: with your comment "we should a find a way to override the request body further", are you specifically referring to the SelfQueryRetriever? Or is it a more general comment for the VespaRetriever?

@jobergum
Copy link

jobergum commented May 4, 2023

I'm talking about the VespaRetriever, it should at least support passing the query string to embed functionality or multiple embedders. Currently, it will only replace the query request parameter.

@filippo82
Copy link
Author

Hi @jobergum, would you consider this a meaningful query?

        {
            "yql": "select content from paragraph where userQuery() or {targetHits: 100}nearestNeighbor(embedding, first_query)",
            "query": "Where is Waldo?",
            "input.query(first_query)": "embed(bertModel, Where is Waldo?)",
            "type": "weakAnd",
            "hits": 5,
            "ranking": "documentation",
            "locale": "en-us"
        }

Or would you push for allowing a variable number of input.query() entries?

@jobergum
Copy link

Hi @jobergum, would you consider this a meaningful query?

Yes, hybrid retrieval like this is meaningful, and the query is valid. Not that many cases have multiple query embedding models. What we hope to add soon is support for referencing native request parameters in the embed function.

{
            "yql": "select content from paragraph where userQuery() or {targetHits: 100}nearestNeighbor(embedding, first_query)",
            "query": "Where is Waldo?",
            "last_messages": ".... ",
            "input.query(first_query)": "embed(@query)",
            "input.query(conversion_embedding)": embed(@last_messages)"
        }

In this case, both the question is encoded and the last_messages for additional context, which could be used for ranking, or retrieval (with another nearestNeighbor query operator).

@filippo82
Copy link
Author

What we hope to add soon is support for referencing native request parameters in the embed function.

That would be indeed a nice improvement!

I am moving this conversation to the discussion of the PR #4546. I like what Davis has done but I will now suggest modifications to be able to leverage the input.query.

@dosubot
Copy link

dosubot bot commented Sep 15, 2023

Hi, @filippo82! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue is about extending the SelfQueryRetriever class to support Vespa. There has been a discussion about the best approach, with suggestions to modify the get_relevant_documents method or introduce a new RetrieverWithFilters class. The Vespa team has also joined the discussion and suggested adding a method to Vespa for retrieving with metadata filters.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 15, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 22, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants