index() API does not respect batch_size on vector_store.add_documents() #19415

znwilkins · 2024-03-21T20:31:36Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

import logging
import os

from langchain.indexes import SQLRecordManager, index
from langchain.vectorstores.qdrant import Qdrant
from langchain_community.embeddings import CohereEmbeddings
from langchain_core.documents import Document
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

DOCUMENT_COUNT = 100
COLLECTION_NAME = "test_index"
COHERE_EMBED_MODEL = os.getenv("COHERE_EMBED_MODEL")
COHERE_API_KEY = os.getenv("COHERE_API_KEY")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")

# Setup embeddings and vector store
embeddings = CohereEmbeddings(model=COHERE_EMBED_MODEL, cohere_api_key=COHERE_API_KEY)
vectorstore = Qdrant(
    client=QdrantClient(url="http://localhost:6333", api_key=QDRANT_API_KEY),
    collection_name=COLLECTION_NAME,
    embeddings=embeddings,
)
# Init Qdrant collection for vectors
vectorstore.client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
# Init the record manager using SQLite
namespace = f"qdrant/{COLLECTION_NAME}"
record_manager = SQLRecordManager(
    namespace, db_url="sqlite:///record_manager_cache.sql"
)
record_manager.create_schema()
# Init 100 example documents
documents = [Document(page_content=f"example{i}", metadata={"source": f"example{i}.txt"}) for i in range(DOCUMENT_COUNT)]

# Log at the INFO level so we can see output from httpx
logging.basicConfig(level=logging.INFO)

# Index 100 documents with a batch size of 100.
# EXPECTED: 1 call to Qdrant with 100 documents per call
# ACTUAL  : 2 calls to Qdrant with 64 and 36 documents per call, respectively
result = index(
    documents,
    record_manager,
    vectorstore,
    batch_size=100,
    cleanup="incremental",
    source_id_key="source",
)
print(result)

Error Message and Stack Trace (if applicable)

No response

Description

I'm trying to index documents to a vector store (Qdrant) using the index() API to support a record manager. I specify a batch_size that is larger than the vector store's default batch_size on my index() call.
I expect to see my calls to Qdrant respect the batch_size
LangChain indexes using the vector store implementation's default batch_size parameter (Qdrant uses 64)

Running the example code with DOCUMENT_COUNT set to 100, you would see two PUTs to Qdrant:

INFO:httpx:HTTP Request: PUT http://localhost:6333/collections/test_index/points?wait=true "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: PUT http://localhost:6333/collections/test_index/points?wait=true "HTTP/1.1 200 OK"
{'num_added': 100, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

Running the example code with DOCUMENT_COUNT set to 64, you would see one PUT to Qdrant:

INFO:httpx:HTTP Request: PUT http://localhost:6333/collections/test_index/points?wait=true "HTTP/1.1 200 OK"
{'num_added': 64, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}

This is because the batch_size is not passed on calls to vector_store.add_documents(), which itself calls add_texts():

        if docs_to_index:
            vector_store.add_documents(docs_to_index, ids=uids)

(link)

As a result, the vector store implementation's default batch_size parameter is used instead:

    def add_texts(
        self,
        texts: Iterable[str],
        metadatas: Optional[List[dict]] = None,
        ids: Optional[Sequence[str]] = None,
        batch_size: int = 64,  # Here's the parameter

(link)

Suggested Fix

Update the the vector_store.add_documents() call in index() to include batch_size=batch_size:
https://github.com/langchain-ai/langchain/blob/v0.1.13/libs/langchain/langchain/indexes/_api.py#L333

        if docs_to_index:
            vector_store.add_documents(docs_to_index, ids=uids, batch_size=batch_size)

In doing so, the parameter is passed onward through kwargs to the final add_texts calls.

If you folks are good with this as a fix, I'm happy to open a PR (since this is my first issue on LangChain, I wanted to make sure I'm not barking up the wrong tree).

System Info

System Information
------------------
> OS:  Linux
> OS Version:  #1 SMP Wed Mar 2 00:30:59 UTC 2022
> Python Version:  3.10.13 (main, Aug 25 2023, 13:20:03) [GCC 9.4.0]

Package Information
-------------------
> langchain_core: 0.1.30
> langchain: 0.1.11
> langchain_community: 0.0.27
> langsmith: 0.1.23
> langchain_openai: 0.0.8
> langchain_text_splitters: 0.0.1

Packages not installed (Not Necessarily a Problem)
--------------------------------------------------
The following packages were not found:

> langgraph
> langserve

The text was updated successfully, but these errors were encountered:

**Description:** This change passes through `batch_size` to `add_documents()`/`aadd_documents()` on calls to `index()` and `aindex()` such that the documents are processed in the expected batch size. **Issue:** #19415 **Dependencies:** N/A **Twitter handle:** N/A

…n-ai#19443) **Description:** This change passes through `batch_size` to `add_documents()`/`aadd_documents()` on calls to `index()` and `aindex()` such that the documents are processed in the expected batch size. **Issue:** langchain-ai#19415 **Dependencies:** N/A **Twitter handle:** N/A

**Description:** This change passes through `batch_size` to `add_documents()`/`aadd_documents()` on calls to `index()` and `aindex()` such that the documents are processed in the expected batch size. **Issue:** #19415 **Dependencies:** N/A **Twitter handle:** N/A

dosubot bot added Ɑ: vector store Related to vector store module 🔌: qdrant Primarily related to Qdrant vector store integration 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Mar 21, 2024

znwilkins mentioned this issue Mar 22, 2024

langchain: Passthrough batch_size on index()/aindex() calls #19443

Merged

ccurme self-assigned this Mar 25, 2024

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 24, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 1, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index() API does not respect batch_size on vector_store.add_documents() #19415

index() API does not respect batch_size on vector_store.add_documents() #19415

znwilkins commented Mar 21, 2024

index() API does not respect batch_size on vector_store.add_documents() #19415

index() API does not respect batch_size on vector_store.add_documents() #19415

Comments

znwilkins commented Mar 21, 2024

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

Suggested Fix

System Info