Connect to existing elastic indexes #210

ehofesmann · 2024-11-27T23:25:44Z

Goal: Make the elastic integration more flexible so it can be integrated with existing Elasticsearch indexes

Issues this addresses:

Previously, the way that FiftyOne samples/patches were matched to elastic documents was by using the sample/patch id, harcoded as the elastic document id.
This made it so you 1) couldn't use existing elastic indexes with embeddings that may have existed before the FiftyOne dataset was created, and 2) you can't use the same index for two datasets, even if the dataset have the exact same samples (and therefore embeddings)

Summary of changes:

Add new kwargs defining the FiftyOne and Elastic field/attribute names that contain the "keys" used to match a FiftyOne sample with a document in an Elastic index.
- This allows you to map a field in samples/patches of an existing fiftyone dataset to an arbitrarily named field in an elastic index
Have sample id information be stored for embeddings in a patch index alongside patch ids

Questions:

What do you think about the naming convention for these kwargs? We may want to map these to other backends
What is the best way to handle _kneighbors returning samples not in the dataset? Potential solutions:
- If a full dataset is being searched, then only return samples from _kneighbors that are in the dataset. If a view is being searched, then limit the query to only samples that actually exist in the view:
  - Pros:
    - Currently implemented
    - Scales to large dataset sizes because there is no expensive filter provided to the knn search endpoint, just the results of the search are filtered for what is in the dataset. (Large views could cause the same issues here)
  - Cons:
    - Large views could cause issues because a filter containing all ids is passed into the knn search endpoint
    - Confusing behavior when results are <k
- Always provide an elastic knn filter for all of the samples in the dataset. The concern is that this will cause issues for large datasets, passing potentially millions of values of filter ids to the elastic search endpoint.
- Rerun the knn search if <k results that exist in the dataset/view were returned
What of this can be reused for other backends? Is there parts of the code we should abstract (like converting ids/keys, etc)
What would changes would be needed in the base sort_by_similarity() method to avoid the need to remap sample ids to the key_field values inside of the _kneighbors() and other methods?

TODO:

Add the FiftyOne migration that adds backend_patch_key_field = _id to existing elastic run configs for backwards compatibility
Test patch similarity flows more thoroughly
Improve error message when the query sample id doesn't have a corresponding vector in elastic. Currently it just returns BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'failed to create query: The query vector has a different number of dimensions [0] than the document vectors [512].')

elastic rewrite

3a7faf6

ehofesmann self-assigned this Nov 27, 2024

ehofesmann added the enhancement Code enhancement label Nov 27, 2024

Provide feedback