Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parameters for HNSW indexes using metadata #245

Merged
merged 9 commits into from
Mar 28, 2023

Conversation

levand
Copy link
Contributor

@levand levand commented Mar 28, 2023

This PR supersedes #228 with a different approach.

Description of changes

  • Change the Hnswlib class from being a singleton to being one instance per index
    • Allows multiple indexes to be loaded at the same time, prevents index thrashing
  • Configure at initialization time based on collection metadata
  • Support for non-uuid embedding IDs (to support future work)
  • Resizes the index by a factor of the current size to prevent resizing on every insert

Test plan

Unit test suite

Documentation Changes

TODO: Document how to set HNSW params by setting collection metadata.

@levand levand requested review from jeffchuber, HammadB and atroyn March 28, 2023 17:42
@levand levand marked this pull request as ready for review March 28, 2023 21:27
@levand
Copy link
Contributor Author

levand commented Mar 28, 2023

Tests are passing locally, ready to merge if they pass in CI.

Added tests to validated basic functionality: tests do not yet exercise every possible HNSW param or every possible permutation of saving/loading indexes. This should not be a problem assuming the params are not modified between save and load.

@atroyn
Copy link
Contributor

atroyn commented Mar 28, 2023

Added an additional test for index params to be persisted correctly.

@atroyn atroyn merged commit 5cef7f9 into main Mar 28, 2023
@atroyn atroyn deleted the lukev/hnsw-index-params branch March 28, 2023 23:28
Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, was mid through reading this and it was merged.... so will just leave my comments so far as is...

coll = self.get_collection_by_id(collection_id)
collection_metadata = coll[2]
index = Hnswlib(collection_id, self._settings, collection_metadata)
self.index_cache[collection_id] = index
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use an LRU cache with some simple heuristics, this seems like an easy way to get into hairy performance issues, but not sure if its premature to do so since most users don't have enough data to saturate RAM.


logger = logging.getLogger(__name__)


valid_params = {
"hnsw:space": r"^(l2|cosine|ip)$",
"hnsw:construction_ef": r"^\d+$",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

can we either use the terminology of the paper our own terms here
ef_search -> neighbors_to_explore_at_search
ef_construction -> neighbors_to_explore_at_construction
M -> max_neighbors_per_node

Reason being is the more human-readable terms are somewhat interpretable. I can reason that if I increase them, my search quality may go up, at the expense of performance. Something like M is strongly derived from the hnsw paper/impl and not really legible.

@atroyn
Copy link
Contributor

atroyn commented Mar 28, 2023

Sorry, was eager to get this in to unblock density (#226). I think we can add these things comments as issues and re-address them in future.

My impression is almost all users only ever use a single collection, and most users will only want to set the space parameter since understanding the others requires understanding HNSW for now.

@HammadB
Copy link
Collaborator

HammadB commented Mar 28, 2023

A good practice if you need something is to branch off of it, or cherry pick what you need and wait for a bit to let a PR settle if you are blocked on it. I am hesitant to give this a closer look over because now its in and others may build / use off it. Alas!

@atroyn atroyn mentioned this pull request Mar 29, 2023
atroyn added a commit that referenced this pull request Mar 29, 2023
* refactor hnswlib class

* update index usage

* udpate index usage in duckdb.py

* clean up unused method

* bug fixes

* fix bugs

* fix validation ordering

* persistence test

* Added params persistence test

---------

Co-authored-by: atroyn <[email protected]>
atroyn added a commit that referenced this pull request Mar 29, 2023
commit 5cef7f9
Author: Luke VanderHart <[email protected]>
Date:   Tue Mar 28 19:28:02 2023 -0400

    Support parameters for HNSW indexes using metadata (#245)

    * refactor hnswlib class

    * update index usage

    * udpate index usage in duckdb.py

    * clean up unused method

    * bug fixes

    * fix bugs

    * fix validation ordering

    * persistence test

    * Added params persistence test

    ---------

    Co-authored-by: atroyn <[email protected]>

commit 90ae684
Merge: e3aa896 6c64337
Author: Jeff Huber <[email protected]>
Date:   Tue Mar 28 14:36:30 2023 -0700

    Merge pull request #246 from chroma-core/0.3.14

    new version

commit 6c64337
Author: Jeffrey Huber <[email protected]>
Date:   Tue Mar 28 12:38:17 2023 -0700

    new version

commit e3aa896
Author: Hammad Bashir <[email protected]>
Date:   Tue Mar 28 11:03:25 2023 -0700

    Update RELEASE_PROCESS.md

commit e347476
Merge: fa847be e8db7aa
Author: Jeff Huber <[email protected]>
Date:   Tue Mar 28 10:40:18 2023 -0700

    Merge pull request #244 from chroma-core/fixDuckDbLogging

    create persist dir if not exist, correct logging of persistence

commit e8db7aa
Author: Jeffrey Huber <[email protected]>
Date:   Tue Mar 28 10:29:43 2023 -0700

    create persist dir if not exist, correct logging of persistence

commit fa847be
Author: Hammad Bashir <[email protected]>
Date:   Mon Mar 27 11:58:28 2023 -0700

    Release 0.3.13 (#240)

    * Add release process and autogenerate release notes

commit 4fef4d8
Author: Hammad Bashir <[email protected]>
Date:   Mon Mar 27 11:13:49 2023 -0700

    upgrade duckdb (#239)

commit d44d2df
Author: Hammad Bashir <[email protected]>
Date:   Sun Mar 26 16:06:24 2023 -0700

    Add telemetry (#233)

    Adds telemetry for chroma

    - Added types for delete return
    - Added telemetry which has a type for TelemetryEvents and a Telemetry base class.
    - Telemetry stores the user id in ~/cache/chroma and also stores the context for events. Context contains settings and if the client is running in the server.
    - Added Posthog which extends Telemetry and implements capture() while respecting anonymized_telemetry
    - Added events for client create, add and delete
    - Added a __ version __ to chroma this must now be set for every update

commit a8adedc
Author: jschachter <[email protected]>
Date:   Sun Mar 26 14:23:52 2023 -0700

    Update Collection.py (#235)

    fix misspelling

commit bbd9783
Author: Ikko Eltociear Ashimine <[email protected]>
Date:   Sat Mar 25 09:48:05 2023 +0900

    fix typo in DEVELOP.md (#231)

    programatically -> programmatically

commit 6da0b76
Merge: 4463d13 e6a7aee
Author: Jeff Huber <[email protected]>
Date:   Fri Mar 24 08:00:27 2023 -0700

    Merge pull request #230 from dbasch/main

    Added function for the Instructor models.

commit e6a7aee
Author: Diego Basch <[email protected]>
Date:   Thu Mar 23 20:07:13 2023 -0700

    added embedding function for the Instructor models.
@HammadB HammadB mentioned this pull request Apr 5, 2023
HammadB added a commit that referenced this pull request Apr 5, 2023
A bug was introduced in #245. We changed from passing uuids into the index to ids, which leads to breakage. This is a bugfix that fixes the functionality by passing in uuids, along with a test.
@mdhvandoorn
Copy link

mdhvandoorn commented Mar 9, 2024

I assume the 'ip' parameter signifies internal product. From what I have gathered, internal product does not necessarily equate dot product. So, just to be clear: does the 'ip' parameter signify the usage of dot product as the similarity metric?

@atroyn
Copy link
Contributor

atroyn commented Mar 11, 2024

See here for detail on the built-in distance functions: https://docs.trychroma.com/usage-guide#changing-the-distance-function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants