Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use efficient kNN filtering, fix filtering when input value is array of string #16393

Merged
merged 9 commits into from
Oct 14, 2024

Conversation

tchoedak
Copy link
Contributor

@tchoedak tchoedak commented Oct 6, 2024

Description

This MR introduces 2 changes.

  1. Update the default approximate search to use kNN with efficient filtering which is available as of opensearch 2.9. The current implementation only supports filtering using script scoring or painless scripting with pre-filtering. This document describes how efficient filtering has advantages over both pre-filtering and post-filtering. Efficient filtering opens the door towards more advanced use cases like supporting pagination more efficiently.

  2. A fix when building a filter for array-to-array based membership when the input array is a list of strings. Similar to how the equality_postfix ensures that the input field to search against is correct when input value is text, this fix ensures we also use the correct input field when filtering for an array of strings.

Here is an example query built without the fix when input is an array of strings:

{'size': 100,
 'query': {'script_score': {'query': {'bool': {'filter': [{'bool': {'must': [{'terms': {'metadata.location': ['Nevada',
            'California',
            'Illinois']}}]}}]}},
   'script': {'source': "1/(1.0 + l2Squared(params.query_value, doc['embedding']))",
    'params': {'field': 'embedding', 'query_value': [0.1, 0.1, 0.1]}}}}}

With the fix, the rebuilt query looks like this:

{'size': 100,
 'query': {'script_score': {'query': {'bool': {'filter': [{'bool': {'must': [{'terms': {'metadata.location.keyword': ['Nevada',
            'California',
            'Illinois']}}]}}]}},
   'script': {'source': "1/(1.0 + l2Squared(params.query_value, doc['embedding']))",
    'params': {'field': 'embedding', 'query_value': [0.1, 0.1, 0.1]}}}}}
  1. tests added for efficient filtering and the array of strings fix. Updated test fixture and all existing tests to use an index with lucene as the engine as well as the default nmslib engine.

NOTE: default behavior is that the OpensearchVectorClient will still initialize with engine=nmslib, and either painless or script scoring method is used for kNN searching when filters are applied.

There are no dependencies added for this change.

Fixes # (issue)

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

Tenzin Choedak added 2 commits October 6, 2024 13:22
…ucene or faiss. fix terms search when input is a list of strings.
…ucene or faiss. fix terms search when input is a list of strings.
@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Oct 6, 2024
@tchoedak
Copy link
Contributor Author

tchoedak commented Oct 7, 2024

@logan-markewich I'm confused on the coverage reporting:

  • 18% for opensearch/base.py
  • 29% for test_opensearch_client.py

I don't think I've drastically reduced coverage for base.py, and i'm not sure why test coverage on the test module itself is necessary.

@logan-markewich
Copy link
Collaborator

@tchoedak all the tests for open search are marked as skipif -- hence, there is basically zero coverage day-to-day.

Ideally there would be a properly mocked out client

I agree that the coverage probably shouldn't check the test file (although it would be at 100% if there was a mocked out client lol)

We recently added this check, so I won't hold it against you to write out a mocked client

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 14, 2024
@logan-markewich logan-markewich merged commit c872b58 into run-llama:main Oct 14, 2024
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants