Skip to content

Commit

Permalink
[DOC] Expand HNSW params section
Browse files Browse the repository at this point in the history
  • Loading branch information
itaismith committed Jan 8, 2025
1 parent f987ba6 commit 7eb0688
Showing 1 changed file with 32 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -47,4 +47,35 @@ let collection = await client.createCollection({

{% /TabbedCodeBlock %}

You can learn more in our [Embeddings section](../embeddings/embedding-functions).
You can learn more in our [Embeddings section](../embeddings/embedding-functions).

## Fine-Tuning HNSW Parameters

HNSW is the index we use to perform approximate nearest neighbor (ANN) search for a given embedding. In this context
* **Accuracy** measures how close the approximate results are to the true nearest neighbors.
* **Recall** refers to how many of the true nearest neighbors were retrieved.

Increasing `search_ef` normally improves accuracy and recall, but slows down query time. Similarly, increasing `construction_ef` improves accuracy and recall, but increases memory usage and time when creating the index.

Choosing the right values for your HNSW parameters depends on your data, embedding function, and requirements for accuracy, recall, and performance. You may need to experiment with different construction and search values to find the values that meet your requirements.

For example, for a dataset with 50,000 embeddings of 2048 dimensions, generated by

```python
embeddings = np.random.randn(50000, 2048).astype(np.float32).tolist()
```

we set up two Chroma collections:
* The first is configured with `hnsw:ef_search: 10`. When querying using a specific embedding from the set (with `id = 1`), the query takes `0.00529` seconds, and we get back embeddings with distances:

```
[3629.019775390625, 3666.576904296875, 3684.57080078125]
```

* The second collection is configured with `hnsw:ef_search: 100`. When issuing the same query, this time it takes `0.00753` seconds (slightly slower), but with better results as measured by their distance:

```
[0.0, 3620.593994140625, 3623.275390625]
```

When querying with the test embedding (`id=1`), the first collection failed to find the embedding itself, despite it being in the collection (where it should have appeared with a distance of `0.0`). The second collection, while slightly slower, successfully found the query embedding itself (shown by the `0.0` distance) and returned closer neighbors overall, demonstrating better accuracy at the cost of performance.

0 comments on commit 7eb0688

Please sign in to comment.