community[patch]: llama cpp embeddings reset default n_batch (#17594) · langchain-ai/langchain@12843f2

Commit

community[patch]: llama cpp embeddings reset default n_batch (#17594)

When testing Nomic embeddings --
```
from langchain_community.embeddings import LlamaCppEmbeddings
embd_model_path = "/Users/rlm/Desktop/Code/llama.cpp/models/nomic-embd/nomic-embed-text-v1.Q4_K_S.gguf"
embd_lc = LlamaCppEmbeddings(model_path=embd_model_path)
embedding_lc = embd_lc.embed_query(query)
```

We were seeing this error for strings > a certain size -- 
```
File ~/miniforge3/envs/llama2/lib/python3.9/site-packages/llama_cpp/llama.py:827, in Llama.embed(self, input, normalize, truncate, return_count)
    824     s_sizes = []
    826 # add to batch
--> 827 self._batch.add_sequence(tokens, len(s_sizes), False)
    828 t_batch += n_tokens
    829 s_sizes.append(n_tokens)

File ~/miniforge3/envs/llama2/lib/python3.9/site-packages/llama_cpp/_internals.py:542, in _LlamaBatch.add_sequence(self, batch, seq_id, logits_all)
    540 self.batch.token[j] = batch[i]
    541 self.batch.pos[j] = i
--> 542 self.batch.seq_id[j][0] = seq_id
    543 self.batch.n_seq_id[j] = 1
    544 self.batch.logits[j] = logits_all

ValueError: NULL pointer access
```

The default `n_batch` of llama-cpp-python's Llama is `512` but we were
explicitly setting it to `8`.
 
These need to be set to equal for embedding models. 
* The embedding.cpp example has an assertion to make sure these are
always equal.
* Apparently this is not being done properly in llama-cpp-python.

With `n_batch` set to 8, if more than 8 tokens are passed the batch runs
out of space and it crashes.

This also explains why the CPU compute buffer size was small:

raw client with default `n_batch=512`
```
llama_new_context_with_model:        CPU input buffer size   =     3.51 MiB
llama_new_context_with_model:        CPU compute buffer size =    21.00 MiB
```
langchain with `n_batch=8`
```
llama_new_context_with_model:        CPU input buffer size   =     0.04 MiB
llama_new_context_with_model:        CPU compute buffer size =     0.33 MiB
```

We can work around this by passing `n_batch=512`, but this will not be
obvious to some users:
```
    embedding = LlamaCppEmbeddings(model_path=embd_model_path,
                                   n_batch=512)
```

From discussion w/ @cebtenzzre. Related:

abetlen/llama-cpp-python#1189

Co-authored-by: Bagatur <[email protected]>

Loading branch information

rlancemartin and baskaryan authored Mar 29, 2024

1 parent 8e97654 commit 12843f2

libs/community/langchain_community/embeddings/llamacpp.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -47,7 +47,7 @@ class LlamaCppEmbeddings(BaseModel, Embeddings): @@
         """Number of threads to use. If None, the number
         of threads is automatically determined."""
-        n_batch: Optional[int] = Field(8, alias="n_batch")
+        n_batch: Optional[int] = Field(512, alias="n_batch")
         """Number of tokens to process in parallel.
         Should be a number between 1 and n_ctx."""
@@ Expand Down @@

0 comments on commit `12843f2`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `12843f2`

Commit

There are no files selected for viewing

0 comments on commit 12843f2

0 comments on commit `12843f2`