-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support parameters for HNSW indexes using metadata #245
Conversation
Tests are passing locally, ready to merge if they pass in CI. Added tests to validated basic functionality: tests do not yet exercise every possible HNSW param or every possible permutation of saving/loading indexes. This should not be a problem assuming the params are not modified between save and load. |
Added an additional test for index params to be persisted correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, was mid through reading this and it was merged.... so will just leave my comments so far as is...
coll = self.get_collection_by_id(collection_id) | ||
collection_metadata = coll[2] | ||
index = Hnswlib(collection_id, self._settings, collection_metadata) | ||
self.index_cache[collection_id] = index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to use an LRU cache with some simple heuristics, this seems like an easy way to get into hairy performance issues, but not sure if its premature to do so since most users don't have enough data to saturate RAM.
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
valid_params = { | ||
"hnsw:space": r"^(l2|cosine|ip)$", | ||
"hnsw:construction_ef": r"^\d+$", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
can we either use the terminology of the paper our own terms here
ef_search -> neighbors_to_explore_at_search
ef_construction -> neighbors_to_explore_at_construction
M -> max_neighbors_per_node
Reason being is the more human-readable terms are somewhat interpretable. I can reason that if I increase them, my search quality may go up, at the expense of performance. Something like M is strongly derived from the hnsw paper/impl and not really legible.
Sorry, was eager to get this in to unblock density (#226). I think we can add these things comments as issues and re-address them in future. My impression is almost all users only ever use a single collection, and most users will only want to set the space parameter since understanding the others requires understanding HNSW for now. |
A good practice if you need something is to branch off of it, or cherry pick what you need and wait for a bit to let a PR settle if you are blocked on it. I am hesitant to give this a closer look over because now its in and others may build / use off it. Alas! |
* refactor hnswlib class * update index usage * udpate index usage in duckdb.py * clean up unused method * bug fixes * fix bugs * fix validation ordering * persistence test * Added params persistence test --------- Co-authored-by: atroyn <[email protected]>
commit 5cef7f9 Author: Luke VanderHart <[email protected]> Date: Tue Mar 28 19:28:02 2023 -0400 Support parameters for HNSW indexes using metadata (#245) * refactor hnswlib class * update index usage * udpate index usage in duckdb.py * clean up unused method * bug fixes * fix bugs * fix validation ordering * persistence test * Added params persistence test --------- Co-authored-by: atroyn <[email protected]> commit 90ae684 Merge: e3aa896 6c64337 Author: Jeff Huber <[email protected]> Date: Tue Mar 28 14:36:30 2023 -0700 Merge pull request #246 from chroma-core/0.3.14 new version commit 6c64337 Author: Jeffrey Huber <[email protected]> Date: Tue Mar 28 12:38:17 2023 -0700 new version commit e3aa896 Author: Hammad Bashir <[email protected]> Date: Tue Mar 28 11:03:25 2023 -0700 Update RELEASE_PROCESS.md commit e347476 Merge: fa847be e8db7aa Author: Jeff Huber <[email protected]> Date: Tue Mar 28 10:40:18 2023 -0700 Merge pull request #244 from chroma-core/fixDuckDbLogging create persist dir if not exist, correct logging of persistence commit e8db7aa Author: Jeffrey Huber <[email protected]> Date: Tue Mar 28 10:29:43 2023 -0700 create persist dir if not exist, correct logging of persistence commit fa847be Author: Hammad Bashir <[email protected]> Date: Mon Mar 27 11:58:28 2023 -0700 Release 0.3.13 (#240) * Add release process and autogenerate release notes commit 4fef4d8 Author: Hammad Bashir <[email protected]> Date: Mon Mar 27 11:13:49 2023 -0700 upgrade duckdb (#239) commit d44d2df Author: Hammad Bashir <[email protected]> Date: Sun Mar 26 16:06:24 2023 -0700 Add telemetry (#233) Adds telemetry for chroma - Added types for delete return - Added telemetry which has a type for TelemetryEvents and a Telemetry base class. - Telemetry stores the user id in ~/cache/chroma and also stores the context for events. Context contains settings and if the client is running in the server. - Added Posthog which extends Telemetry and implements capture() while respecting anonymized_telemetry - Added events for client create, add and delete - Added a __ version __ to chroma this must now be set for every update commit a8adedc Author: jschachter <[email protected]> Date: Sun Mar 26 14:23:52 2023 -0700 Update Collection.py (#235) fix misspelling commit bbd9783 Author: Ikko Eltociear Ashimine <[email protected]> Date: Sat Mar 25 09:48:05 2023 +0900 fix typo in DEVELOP.md (#231) programatically -> programmatically commit 6da0b76 Merge: 4463d13 e6a7aee Author: Jeff Huber <[email protected]> Date: Fri Mar 24 08:00:27 2023 -0700 Merge pull request #230 from dbasch/main Added function for the Instructor models. commit e6a7aee Author: Diego Basch <[email protected]> Date: Thu Mar 23 20:07:13 2023 -0700 added embedding function for the Instructor models.
A bug was introduced in #245. We changed from passing uuids into the index to ids, which leads to breakage. This is a bugfix that fixes the functionality by passing in uuids, along with a test.
I assume the 'ip' parameter signifies internal product. From what I have gathered, internal product does not necessarily equate dot product. So, just to be clear: does the 'ip' parameter signify the usage of dot product as the similarity metric? |
See here for detail on the built-in distance functions: https://docs.trychroma.com/usage-guide#changing-the-distance-function |
This PR supersedes #228 with a different approach.
Description of changes
Test plan
Unit test suite
Documentation Changes
TODO: Document how to set HNSW params by setting collection metadata.