Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index() API does not respect batch_size on vector_store.add_documents() #19415

Closed
5 tasks done
znwilkins opened this issue Mar 21, 2024 · 0 comments
Closed
5 tasks done
Assignees
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: qdrant Primarily related to Qdrant vector store integration Ɑ: vector store Related to vector store module

Comments

@znwilkins
Copy link
Contributor

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import logging
import os

from langchain.indexes import SQLRecordManager, index
from langchain.vectorstores.qdrant import Qdrant
from langchain_community.embeddings import CohereEmbeddings
from langchain_core.documents import Document
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

DOCUMENT_COUNT = 100
COLLECTION_NAME = "test_index"
COHERE_EMBED_MODEL = os.getenv("COHERE_EMBED_MODEL")
COHERE_API_KEY = os.getenv("COHERE_API_KEY")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")

# Setup embeddings and vector store
embeddings = CohereEmbeddings(model=COHERE_EMBED_MODEL, cohere_api_key=COHERE_API_KEY)
vectorstore = Qdrant(
    client=QdrantClient(url="http://localhost:6333", api_key=QDRANT_API_KEY),
    collection_name=COLLECTION_NAME,
    embeddings=embeddings,
)
# Init Qdrant collection for vectors
vectorstore.client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
# Init the record manager using SQLite
namespace = f"qdrant/{COLLECTION_NAME}"
record_manager = SQLRecordManager(
    namespace, db_url="sqlite:///record_manager_cache.sql"
)
record_manager.create_schema()
# Init 100 example documents
documents = [Document(page_content=f"example{i}", metadata={"source": f"example{i}.txt"}) for i in range(DOCUMENT_COUNT)]

# Log at the INFO level so we can see output from httpx
logging.basicConfig(level=logging.INFO)

# Index 100 documents with a batch size of 100.
# EXPECTED: 1 call to Qdrant with 100 documents per call
# ACTUAL  : 2 calls to Qdrant with 64 and 36 documents per call, respectively
result = index(
    documents,
    record_manager,
    vectorstore,
    batch_size=100,
    cleanup="incremental",
    source_id_key="source",
)
print(result)

Error Message and Stack Trace (if applicable)

No response

Description

  • I'm trying to index documents to a vector store (Qdrant) using the index() API to support a record manager. I specify a batch_size that is larger than the vector store's default batch_size on my index() call.
  • I expect to see my calls to Qdrant respect the batch_size
  • LangChain indexes using the vector store implementation's default batch_size parameter (Qdrant uses 64)

Running the example code with DOCUMENT_COUNT set to 100, you would see two PUTs to Qdrant:

INFO:httpx:HTTP Request: PUT http://localhost:6333/collections/test_index/points?wait=true "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: PUT http://localhost:6333/collections/test_index/points?wait=true "HTTP/1.1 200 OK"
{'num_added': 100, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

Running the example code with DOCUMENT_COUNT set to 64, you would see one PUT to Qdrant:

INFO:httpx:HTTP Request: PUT http://localhost:6333/collections/test_index/points?wait=true "HTTP/1.1 200 OK"
{'num_added': 64, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

This is because the batch_size is not passed on calls to vector_store.add_documents(), which itself calls add_texts():

        if docs_to_index:
            vector_store.add_documents(docs_to_index, ids=uids)

(link)

As a result, the vector store implementation's default batch_size parameter is used instead:

    def add_texts(
        self,
        texts: Iterable[str],
        metadatas: Optional[List[dict]] = None,
        ids: Optional[Sequence[str]] = None,
        batch_size: int = 64,  # Here's the parameter

(link)

Suggested Fix

Update the the vector_store.add_documents() call in index() to include batch_size=batch_size:
https://github.com/langchain-ai/langchain/blob/v0.1.13/libs/langchain/langchain/indexes/_api.py#L333

        if docs_to_index:
            vector_store.add_documents(docs_to_index, ids=uids, batch_size=batch_size)

In doing so, the parameter is passed onward through kwargs to the final add_texts calls.

If you folks are good with this as a fix, I'm happy to open a PR (since this is my first issue on LangChain, I wanted to make sure I'm not barking up the wrong tree).

System Info

System Information
------------------
> OS:  Linux
> OS Version:  #1 SMP Wed Mar 2 00:30:59 UTC 2022
> Python Version:  3.10.13 (main, Aug 25 2023, 13:20:03) [GCC 9.4.0]

Package Information
-------------------
> langchain_core: 0.1.30
> langchain: 0.1.11
> langchain_community: 0.0.27
> langsmith: 0.1.23
> langchain_openai: 0.0.8
> langchain_text_splitters: 0.0.1

Packages not installed (Not Necessarily a Problem)
--------------------------------------------------
The following packages were not found:

> langgraph
> langserve
@dosubot dosubot bot added Ɑ: vector store Related to vector store module 🔌: qdrant Primarily related to Qdrant vector store integration 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Mar 21, 2024
@ccurme ccurme self-assigned this Mar 25, 2024
ccurme pushed a commit that referenced this issue Mar 25, 2024
**Description:** This change passes through `batch_size` to
`add_documents()`/`aadd_documents()` on calls to `index()` and
`aindex()` such that the documents are processed in the expected batch
size.
**Issue:** #19415
**Dependencies:** N/A
**Twitter handle:** N/A
gkorland pushed a commit to FalkorDB/langchain that referenced this issue Mar 30, 2024
…n-ai#19443)

**Description:** This change passes through `batch_size` to
`add_documents()`/`aadd_documents()` on calls to `index()` and
`aindex()` such that the documents are processed in the expected batch
size.
**Issue:** langchain-ai#19415
**Dependencies:** N/A
**Twitter handle:** N/A
chrispy-snps pushed a commit to chrispy-snps/langchain that referenced this issue Mar 30, 2024
…n-ai#19443)

**Description:** This change passes through `batch_size` to
`add_documents()`/`aadd_documents()` on calls to `index()` and
`aindex()` such that the documents are processed in the expected batch
size.
**Issue:** langchain-ai#19415
**Dependencies:** N/A
**Twitter handle:** N/A
hinthornw pushed a commit that referenced this issue Apr 26, 2024
**Description:** This change passes through `batch_size` to
`add_documents()`/`aadd_documents()` on calls to `index()` and
`aindex()` such that the documents are processed in the expected batch
size.
**Issue:** #19415
**Dependencies:** N/A
**Twitter handle:** N/A
@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 24, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 1, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: qdrant Primarily related to Qdrant vector store integration Ɑ: vector store Related to vector store module
Projects
None yet
Development

No branches or pull requests

2 participants