Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set distance function and other index params #228

Closed
wants to merge 5 commits into from
Closed

Conversation

jeffchuber
Copy link
Contributor

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • ...
  • New functionality
    • ...

Test plan

How are these changes tested?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

@HammadB HammadB requested review from levand and HammadB March 24, 2023 01:23
@HammadB
Copy link
Collaborator

HammadB commented Mar 24, 2023

@levand would you mind taking a look at this? Particularly interested in:

  1. API for how we parameterize indices as I think this is hard to walk back and also intertwined with the refactor you are doing
  2. How we handle migrations here

@levand
Copy link
Contributor

levand commented Mar 24, 2023

@jeffchuber This makes sense to me, but it's slightly misleading, because the HNSW params are not an attribute of the collection, they are an attribute of the index (and a collection could conceivably have an index of a different type, or multiple indexes.)

In the future we'll definitely want to introduce a more flexible way for the user to express their intent of how a collection should be indexed rather than making it a top-level param to create_collection.

Under the hood, we'll likely just store these index configurations in a segment's metadata anyway. Would you be ok surfacing that to the user?

api.create_collection("my_collection", metadata={
    'hnsw:space': 'cosine',
    'hnsw:ef': 100,
    'hnsw:m': 20})
 })

The disadvantage here of course is that we lose type checking in the Python client. So if we want to preserve that, for better DX, then I'd suggest the following minor tweak to your code, and make the signature of create_collection to be a little more flexible, and take a dict of param types instead:

IndexType = Literal["hnsw", "other"]
IndexParams = Union[HnswIndexParams, OtherIndexParams]
index_params: Optional[dict[IndexType, IndexParams]] = None,

That way we have types for users, but aren't baking HNSW into the API as a special/only thing. And we'd still just store them as namespaced metadata keys under the hood.

Copy link
Contributor

@levand levand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See question about what happens to index params after modification.

Also, what's up with the massive diff in the tests file and all the commented out tests? Seems wrong?

@@ -118,6 +126,7 @@ def _modify(
current_name: str,
new_name: Optional[str] = None,
new_metadata: Optional[Dict] = None,
new_index_params: Optional[HnswIndexParams] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does modification mean here? Changing index params would mean rebuilding the whole index. Seems like we should either do that, or error if you try to change the params after the index has already been built.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now we just don't allow index modification.

@@ -81,16 +94,31 @@ def get_or_create_collection(
name: str,
metadata: Optional[Dict] = None,
embedding_function: Optional[Callable] = None,
index_params: Optional[HnswIndexParams] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we'll have to deal with alongside persisting embedding functions, but fine for now.

@levand
Copy link
Contributor

levand commented Mar 27, 2023

@atroyn any feedback on the concept of storing index params as metadata rather than adding a new column? Less complexity and risk around schema breakage.

@levand
Copy link
Contributor

levand commented Mar 27, 2023

Also just for situational awareness: #214 contains a more comprehensive approach to migrations (e.g. https://github.com/chroma-core/chroma/blob/lukev/decoupled-architecture/migrations/sysdb/00002-segments.duckdb.sql), so the migration code in this PR will be short-lived regardless (assuming we accept the approach in 214)

@jeffchuber
Copy link
Contributor Author

This functionality landed in #245

@jeffchuber jeffchuber closed this Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants