Use efficient kNN filtering, fix filtering when input value is array of string #16393

tchoedak · 2024-10-06T22:11:03Z

Description

This MR introduces 2 changes.

Update the default approximate search to use kNN with efficient filtering which is available as of opensearch 2.9. The current implementation only supports filtering using script scoring or painless scripting with pre-filtering. This document describes how efficient filtering has advantages over both pre-filtering and post-filtering. Efficient filtering opens the door towards more advanced use cases like supporting pagination more efficiently.
A fix when building a filter for array-to-array based membership when the input array is a list of strings. Similar to how the equality_postfix ensures that the input field to search against is correct when input value is text, this fix ensures we also use the correct input field when filtering for an array of strings.

Here is an example query built without the fix when input is an array of strings:

{'size': 100,
 'query': {'script_score': {'query': {'bool': {'filter': [{'bool': {'must': [{'terms': {'metadata.location': ['Nevada',
            'California',
            'Illinois']}}]}}]}},
   'script': {'source': "1/(1.0 + l2Squared(params.query_value, doc['embedding']))",
    'params': {'field': 'embedding', 'query_value': [0.1, 0.1, 0.1]}}}}}

With the fix, the rebuilt query looks like this:

{'size': 100,
 'query': {'script_score': {'query': {'bool': {'filter': [{'bool': {'must': [{'terms': {'metadata.location.keyword': ['Nevada',
            'California',
            'Illinois']}}]}}]}},
   'script': {'source': "1/(1.0 + l2Squared(params.query_value, doc['embedding']))",
    'params': {'field': 'embedding', 'query_value': [0.1, 0.1, 0.1]}}}}}

tests added for efficient filtering and the array of strings fix. Updated test fixture and all existing tests to use an index with lucene as the engine as well as the default nmslib engine.

NOTE: default behavior is that the OpensearchVectorClient will still initialize with engine=nmslib, and either painless or script scoring method is used for kNN searching when filters are applied.

There are no dependencies added for this change.

Fixes # (issue)

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

Yes
No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

Yes
No

Type of Change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

I added new unit tests to cover this change
I believe this change is already covered by existing unit tests

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

…ucene or faiss. fix terms search when input is a list of strings.

tchoedak · 2024-10-07T18:30:56Z

@logan-markewich I'm confused on the coverage reporting:

18% for opensearch/base.py
29% for test_opensearch_client.py

I don't think I've drastically reduced coverage for base.py, and i'm not sure why test coverage on the test module itself is necessary.

logan-markewich · 2024-10-07T22:20:47Z

@tchoedak all the tests for open search are marked as skipif -- hence, there is basically zero coverage day-to-day.

Ideally there would be a properly mocked out client

I agree that the coverage probably shouldn't check the test file (although it would be at 100% if there was a mocked out client lol)

We recently added this check, so I won't hold it against you to write out a mocked client

Tenzin Choedak added 2 commits October 6, 2024 13:22

update opensearch client use efficient knn filtering when engine is l…

3c3ecf7

…ucene or faiss. fix terms search when input is a list of strings.

update opensearch client use efficient knn filtering when engine is l…

8da24e9

…ucene or faiss. fix terms search when input is a list of strings.

dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Oct 6, 2024

Tenzin Choedak and others added 3 commits October 6, 2024 15:13

bump version

aa3d42e

rm debug statement

4dc293c

Merge branch 'main' into main

a27afbb

Merge branch 'main' into main

5c6da9a

tchoedak added 3 commits October 8, 2024 09:32

Merge branch 'main' into main

2ad81d6

Merge branch 'main' into main

da2350d

Merge branch 'main' into main

6228436

logan-markewich approved these changes Oct 14, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 14, 2024

logan-markewich merged commit c872b58 into run-llama:main Oct 14, 2024
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use efficient kNN filtering, fix filtering when input value is array of string #16393

Use efficient kNN filtering, fix filtering when input value is array of string #16393

tchoedak commented Oct 6, 2024 •

edited

Loading

tchoedak commented Oct 7, 2024

logan-markewich commented Oct 7, 2024

Use efficient kNN filtering, fix filtering when input value is array of string #16393

Use efficient kNN filtering, fix filtering when input value is array of string #16393

Conversation

tchoedak commented Oct 6, 2024 • edited Loading

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

tchoedak commented Oct 7, 2024

logan-markewich commented Oct 7, 2024

tchoedak commented Oct 6, 2024 •

edited

Loading