Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect to existing elastic indexes #210

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from
Draft

Conversation

ehofesmann
Copy link
Member

@ehofesmann ehofesmann commented Nov 27, 2024

Goal: Make the elastic integration more flexible so it can be integrated with existing Elasticsearch indexes

Issues this addresses:

  • Previously, the way that FiftyOne samples/patches were matched to elastic documents was by using the sample/patch id, harcoded as the elastic document id.
  • This made it so you 1) couldn't use existing elastic indexes with embeddings that may have existed before the FiftyOne dataset was created, and 2) you can't use the same index for two datasets, even if the dataset have the exact same samples (and therefore embeddings)

Summary of changes:

  • Add new kwargs defining the FiftyOne and Elastic field/attribute names that contain the "keys" used to match a FiftyOne sample with a document in an Elastic index.
    • This allows you to map a field in samples/patches of an existing fiftyone dataset to an arbitrarily named field in an elastic index
  • Have sample id information be stored for embeddings in a patch index alongside patch ids

Questions:

  • What do you think about the naming convention for these kwargs? We may want to map these to other backends
  • What is the best way to handle _kneighbors returning samples not in the dataset? Potential solutions:
    • If a full dataset is being searched, then only return samples from _kneighbors that are in the dataset. If a view is being searched, then limit the query to only samples that actually exist in the view:
      • Pros:
        • Currently implemented
        • Scales to large dataset sizes because there is no expensive filter provided to the knn search endpoint, just the results of the search are filtered for what is in the dataset. (Large views could cause the same issues here)
      • Cons:
        • Large views could cause issues because a filter containing all ids is passed into the knn search endpoint
        • Confusing behavior when results are <k
    • Always provide an elastic knn filter for all of the samples in the dataset. The concern is that this will cause issues for large datasets, passing potentially millions of values of filter ids to the elastic search endpoint.
    • Rerun the knn search if <k results that exist in the dataset/view were returned
  • What of this can be reused for other backends? Is there parts of the code we should abstract (like converting ids/keys, etc)
  • What would changes would be needed in the base sort_by_similarity() method to avoid the need to remap sample ids to the key_field values inside of the _kneighbors() and other methods?

TODO:

  • Add the FiftyOne migration that adds backend_patch_key_field = _id to existing elastic run configs for backwards compatibility
  • Test patch similarity flows more thoroughly
  • Improve error message when the query sample id doesn't have a corresponding vector in elastic. Currently it just returns BadRequestError: BadRequestError(400, 'search_phase_execution_exception', 'failed to create query: The query vector has a different number of dimensions [0] than the document vectors [512].')

@ehofesmann ehofesmann self-assigned this Nov 27, 2024
@ehofesmann ehofesmann added the enhancement Code enhancement label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Code enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant