Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Sporadic empty inner hits on nested kNN search #466

Closed
BM25-enthusiast opened this issue Jul 22, 2022 · 10 comments
Closed

[BUG] Sporadic empty inner hits on nested kNN search #466

BM25-enthusiast opened this issue Jul 22, 2022 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@BM25-enthusiast
Copy link

Describe the bug

Hi there,

There seems to be a bug within nested approximate kNN search. The problem is that, occasionally, documents will be returned without any inner hits. How is this possible?

To Reproduce

  1. Create an index:
{'mappings': {'properties': {'nested_object': {'properties': {'cool_vector_field': {'dimension': 5,
                                                                                    'method': {'engine': 'nmslib',
                                                                                               'name': 'hnsw',
                                                                                               'parameters': {'ef_construction': 128,
                                                                                                              'm': 24},
                                                                                               'space_type': 'innerproduct'},
                                                                                    'type': 'knn_vector'},
                                                              'some_text_field': {'type': 'text'}},
                                               'type': 'nested'}}},
 'settings': {'index': {'knn': True,
                        'knn.algo_param.ef_search': 100,
                        'refresh_interval': '30s'},
              'number_of_shards': 1}}
  1. index over 10k documents.

  2. Perform a nested approximate-kNN query. You may have to perform a lot of queries with different vectors before this issue is encountered (500+ even):

{'query': {'nested': {'inner_hits': {},
                      'path': 'nested_object',
                      'query': {'knn': {'nested_object.cool_vector_field': {'k': 3,
                                                                            'vector': [-0.53387915, -0.14078664, -0.41952186,  0.11891716, -0.30830444]}}},
                      'score_mode': 'max'}},
 'size': 3}

This is an example search result, where the issue is encountered. Notice the two last hits have empty inner hits:

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': 'bRkrI4IBuKQL3UqO8DkV',
                    '_index': 'synthetic_data_index',
                    '_score': 2.0406117,
                    '_source': {'nested_object': {'cool_vector_field': [0.2234513587608724, 0.8878394741163076, 0.3087446303001422, 0.5401258921662346, -0.9228053400350715],
                                                  'some_text_field': 'This is nested text for doc number 383'},
                                'non_nested_text': 'This doc is number 383'},
                    'inner_hits': {'nested_object': {'hits': {'hits': [{'_id': 'bRkrI4IBuKQL3UqO8DkV',
                                                                        '_index': 'synthetic_data_index',
                                                                        '_nested': {'field': 'nested_object', 'offset': 0},
                                                                        '_score': 2.0406117,
                                                                        '_source': {'cool_vector_field': [0.2234513587608724,
                                                                                                          0.8878394741163076,
                                                                                                          0.3087446303001422,
                                                                                                          0.5401258921662346,
                                                                                                          -0.9228053400350715],
                                                                                    'some_text_field': 'This is nested text for doc number 383'}}],
                                                              'max_score': 2.0406117,
                                                              'total': {'relation': 'eq', 'value': 1}}}}},
                   {'_id': 'bhkrI4IBuKQL3UqO8DkV',
                    '_index': 'synthetic_data_index',
                    '_score': 2.0406117,
                    '_source': {'nested_object': {'cool_vector_field': [-0.3667193179233421, 0.04664013242577236, -0.4679759075333949, 0.9335512141017783, 0.9847209912260526],
                                                  'some_text_field': 'This is nested text for doc number 384'},
                                'non_nested_text': 'This doc is number 384'},
                    'inner_hits': {'nested_object': {'hits': {'hits': [], 'max_score': None, 'total': {'relation': 'eq', 'value': 0}}}}},
                   {'_id': 'bxkrI4IBuKQL3UqO8DkV',
                    '_index': 'synthetic_data_index',
                    '_score': 2.0406117,
                    '_source': {'nested_object': {'cool_vector_field': [-0.9203098606975535, -0.8629298912981729, -0.4274567965220182, 0.5190442025173878, -0.32420767814040885],
                                                  'some_text_field': 'This is nested text for doc number 385'},
                                'non_nested_text': 'This doc is number 385'},
                    'inner_hits': {'nested_object': {'hits': {'hits': [], 'max_score': None, 'total': {'relation': 'eq', 'value': 0}}}}}],
          'max_score': 2.0406117,
          'total': {'relation': 'gte', 'value': 10000}},
 'status': 200,
 'timed_out': False,
 'took': 3}

This is more prevalent with datasets larger than 10000 docs. It seems like this issue is more prevalent:

  • The more docs there are
  • The more shards that are used
  • When size > k
  • When size and k are large

I have a small script which generates lots of random documents, indexes them and then searches. Happy to share it if it would help solve the issue.

Expected behavior
No returned hits should have empty inner hits.

Host/Environment (please complete the following information):

  • OS: docker container running in MacOS,
  • Version OpenSearch 2.1.0
@BM25-enthusiast BM25-enthusiast added bug Something isn't working untriaged labels Jul 22, 2022
@BM25-enthusiast
Copy link
Author

After playing around with this problem more, it seems like all the documents retrieved without inner hits for a single query are groups of consecutively indexed documents. No idea why. For example:

[‘3395’, ‘3396’, ‘3397’, ‘3398’, ‘3399’, ‘4250’, ‘4251’, ‘4252’, ‘4253’, ‘4254’]

(I set the ID to correspond to the order the document is indexed)

@andrross andrross transferred this issue from opensearch-project/OpenSearch Jul 26, 2022
@jmazanec15
Copy link
Member

Hi @BM25-enthusiast, taking a look at this

@jmazanec15 jmazanec15 self-assigned this Jul 26, 2022
@dblock
Copy link
Member

dblock commented Jul 27, 2022

Does this belong in https://github.com/opensearch-project/k-NN?

@BM25-enthusiast
Copy link
Author

Hi @BM25-enthusiast, taking a look at this

Thank you! let me know if I can help!

@jmazanec15
Copy link
Member

@BM25-enthusiast tried to write some code to reproduce it locally: https://gist.github.com/jmazanec15/c7bbcc4ecd3a17f58d18b33737e769bf. I got a different error:

=== Standard error of node `node{::integTest-0}` ===
»   ↓ last 40 non error or warning messages from k-NN-1/build/testclusters/integTest-0/logs/opensearch.stderr.log ↓
» fatal error in thread [opensearch[integTest-0][search][T#2]], exiting
»  java.lang.AssertionError: Sub-iterators of ConjunctionDISI are not on the same document!
»       at org.apache.lucene.search.ConjunctionDISI$BitSetConjunctionDISI.nextDoc(ConjunctionDISI.java:268)
»       at org.apache.lucene.search.join.ToParentBlockJoinQuery$BlockJoinScorer.setScoreAndFreq(ToParentBlockJoinQuery.java:358)
»       at org.apache.lucene.search.join.ToParentBlockJoinQuery$BlockJoinScorer.score(ToParentBlockJoinQuery.java:333)
»       at org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect(TopScoreDocCollector.java:73)
»       at org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:274)
»       at org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:254)
»       at org.opensearch.search.internal.CancellableBulkScorer.score(CancellableBulkScorer.java:71)
»       at org.apache.lucene.search.BulkScorer.score(BulkScorer.java:38)
»       at org.opensearch.search.internal.ContextIndexSearcher.searchLeaf(ContextIndexSearcher.java:277)
»       at org.opensearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:250)
»       at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:555)
»       at org.opensearch.search.query.QueryPhase.searchWithCollector(QueryPhase.java:349)
»       at org.opensearch.search.query.QueryPhase$DefaultQueryPhaseSearcher.searchWithCollector(QueryPhase.java:443)
»       at org.opensearch.search.query.QueryPhase$DefaultQueryPhaseSearcher.searchWith(QueryPhase.java:432)
»       at org.opensearch.search.query.QueryPhase.executeInternal(QueryPhase.java:275)
»       at org.opensearch.search.query.QueryPhase.execute(QueryPhase.java:152)
»       at org.opensearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:450)
»       at org.opensearch.search.SearchService.executeQueryPhase(SearchService.java:514)
»       at org.opensearch.search.SearchService$2.lambda$onResponse$0(SearchService.java:483)
»       at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:73)
»       at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:88)
»       at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
»       at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59)
»       at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:798)
»       at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
»       at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
»       at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
»       at java.base/java.lang.Thread.run(Thread.java:834)

^ Seems like this might also be an issue Im going to investigate. Would you be able to provide your repro script as well so that I can compare?

@BM25-enthusiast
Copy link
Author

BM25-enthusiast commented Jul 29, 2022

Hi @jmazanec15,

Here is the code I used:

https://gist.github.com/BM25-enthusiast/56b00d98926942db56f8ad31830e81c8

Let me know if there's anything else I can help with

@jmazanec15
Copy link
Member

@BM25-enthusiast Issue I was running into in #466 (comment) seems to be an issue with a lucene component, BitSetConjunctionDISI. I created an issue here: https://issues.apache.org/jira/browse/LUCENE-10674.

When I ran your script, I got the same error as above. Im going to try to build a local fix for lucene and validate if it resolves the issue.

@BM25-enthusiast
Copy link
Author

Interesting. Wondering why I didn't get that issue. What OpenSearch version and OS are you using @jmazanec15 ? I didn't get this issue on MacOS and OpenSearch 2.1.0 (docker)

I also ran similar scripts in AWS Sagemaker without encountering this issue

@jmazanec15
Copy link
Member

Interesting. Wondering why I didn't get that issue. What OpenSearch version and OS are you using @jmazanec15 ? I didn't get this issue on MacOS and OpenSearch 2.1.0 (docker)

I was just checking this repo out at main and running ./gradlew run on my mac. Strange that it didnt fail in a similar way. It might not have been hitting that error condition.

After running it with #502 fix, I dont get the error any more with my script and also with yours I think it works (I ran it a couple times):

INFO:minimal_syntethic_data_test:Finished indexing all batches. Time taken: 0:00:01.022978
INFO:minimal_syntethic_data_test:Count index. Response: {'count': 0, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}
INFO:minimal_syntethic_data_test:cat indices. Response:
{'docs.count': '0',
 'docs.deleted': '0',
 'health': 'green',
 'index': 'synthetic_data_index',
 'pri': '1',
 'pri.store.size': '595.3kb',
 'rep': '0',
 'status': 'open',
 'store.size': '595.3kb',
 'uuid': 'Qa0kYhKQSPqIwHxsye70eg'}
INFO:minimal_syntethic_data_test:refreshing index. Response: {'_shards': {'failed': 0, 'successful': 1, 'total': 1}}
For a 1 shard index, k=10 and size=10, 1 queries were searched. Of the queries searched, 0 queries returned a document containing no inner hits (0.0%).
Out of 10 docs found, 0 didn't contain inner hits. (0.0%)

shard count, size, k, Problematic Queries, Problematic Documents, Total Documents, Ave query time (MS)
1, 10, 10, 0, 0, 10, 50.0
INFO:minimal_syntethic_data_test:Count index. Response: {'count': 100, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}
INFO:minimal_syntethic_data_test:cat indices. Response:
{'docs.count': '200',
 'docs.deleted': '0',
 'health': 'green',
 'index': 'synthetic_data_index',
 'pri': '1',
 'pri.store.size': '595.3kb',
 'rep': '0',
 'status': 'open',
 'store.size': '595.3kb',
 'uuid': 'Qa0kYhKQSPqIwHxsye70eg'}

I think #502 indirectly fixes it. I think the underlying issue is in Lucene. I created a fix for it in Lucene here: apache/lucene#1068. Its still in review.

Im going to run a few more tests to see if it works correctly when there are multiple nested objects.

@jmazanec15
Copy link
Member

I modified my script to add multiple nested objects. I think #502 will fix most cases it. Here is the script.

Query vector: [0.7735246214594997, 0.2805968191366889, 0.06916041197098954, 0.13091214441925791, 0.34899567001178966]
{
    "took": 7,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 3.5810897,
        "hits": [
            {
                "_index": "test-index",
                "_id": "18",
                "_score": 3.5810897,
                "inner_hits": {
                    "nested_object": {
                        "hits": {
                            "total": {
                                "value": 2,
                                "relation": "eq"
                            },
                            "max_score": 3.5810897,
                            "hits": [
                                {
                                    "_index": "test-index",
                                    "_id": "18",
                                    "_nested": {
                                        "field": "nested_object",
                                        "offset": 0
                                    },
                                    "_score": 3.5810897,
                                    "_source": {
                                        "some_text_field": "hello",
                                        "cool_vector_field": [
                                            1.7903215149717562,
                                            1.7549253314359374,
                                            0.887724580407165,
                                            1.3986051276671214,
                                            1.3161069091284088
                                        ]
                                    }
                                },
                                {
                                    "_index": "test-index",
                                    "_id": "18",
                                    "_nested": {
                                        "field": "nested_object",
                                        "offset": 3
                                    },
                                    "_score": 3.5149467,
                                    "_source": {
                                        "some_text_field": "hello",
                                        "cool_vector_field": [
                                            1.6754646187181712,
                                            1.5121818575951325,
                                            0.8033541547254825,
                                            1.3854276165035029,
                                            1.597986768029314
                                        ]
                                    }
                                }
                            ]
                        }
                    }
                }
            },
            {
                "_index": "test-index",
                "_id": "1",
                "_score": 3.521487,
                "inner_hits": {
                    "nested_object": {
                        "hits": {
                            "total": {
                                "value": 1,
                                "relation": "eq"
                            },
                            "max_score": 3.521487,
                            "hits": [
                                {
                                    "_index": "test-index",
                                    "_id": "1",
                                    "_nested": {
                                        "field": "nested_object",
                                        "offset": 4
                                    },
                                    "_score": 3.521487,
                                    "_source": {
                                        "some_text_field": "hello",
                                        "cool_vector_field": [
                                            1.8011595064488835,
                                            1.2149623176266322,
                                            0.9240693295161011,
                                            1.5902692895908728,
                                            1.4763408745511817
                                        ]
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

I did notice that when there are more nested objects, the old version is less likely to fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants