[BUG FIX] Fix bwc failure in neural sparse search #696

zhichao-aws · 2024-04-18T07:47:40Z

Description

Recently we find some bwc failures of neural sparse search in rolling-upgrade case (from 2.x to 3.0 upgrade).
ref:
https://github.com/opensearch-project/neural-search/actions/runs/8649948821/job/23718643879?pr=683
https://github.com/opensearch-project/neural-search/actions/runs/8728772234/job/23957283115?pr=694

The reason is, we introduced max_token_score to 2.11 and we cut the PR directly to 2.x, the main branch is not involved. Although we deprecate this field after upgrade to lucene 9.8, this field is still a breaking change in main and 2.x. It will cause serialization/deserialization inconsistence between 3.0 nodes and 2.x nodes. This inconsistence will fail the search request on the shard, and then we get empty search response in bwc test above.

In this PR we add the max_token_score parsing logics to NeuralSparseQueryBuilder. We can parse this field but ignore it in the doToQuery. This logics is consistence with what we do in 2.x after we deprecate the field. After this get merged, max_token_score field parsing and UT logics in main is consistent with 2.x now

Issues Resolved

#688

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…oken_score Signed-off-by: zhichao-aws <[email protected]>

Signed-off-by: zhichao-aws <[email protected]>

zhichao-aws · 2024-04-18T08:09:13Z

https://github.com/opensearch-project/neural-search/actions/runs/8734284147/job/23965092639?pr=696
The bwc failed due to another exception, and I found we already created issue to track it opensearch-project/ml-commons#2333

REPRODUCE WITH: ./gradlew ':qa:restart-upgrade:testAgainstNewCluster' --tests "org.opensearch.neuralsearch.bwc.SemanticSearchIT.testTextEmbeddingProcessor_E2EFlow" -Dtests.seed=A0EB48412B9D6A40 -Dtests.security.manager=false -Dtests.bwc.version=2.10.0 -Dtests.locale=sr-Latn -Dtests.timezone=Australia/Eucla -Druntime.java=21

org.opensearch.neuralsearch.bwc.SemanticSearchIT > testTextEmbeddingProcessor_E2EFlow FAILED
    org.opensearch.client.ResponseException: method [DELETE], host [http://127.0.0.1:40357/], URI [/_plugins/_ml/models/FZY88I4BM2JwIbQlryn-], status line [HTTP/1.1 500 Internal Server Error]
    {"error":{"root_cause":[{"type":"null_pointer_exception","reason":"Cannot invoke \"java.lang.Boolean.booleanValue()\" because \"isHidden\" is null"}],"type":"null_pointer_exception","reason":"Cannot invoke \"java.lang.Boolean.booleanValue()\" because \"isHidden\" is null"},"status":500}

Suite: Test class org.opensearch.neuralsearch.bwc.SemanticSearchIT
  2> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
  2> SLF4J: Defaulting to no-operation (NOP) logger implementation
  2> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
  2> REPRODUCE WITH: ./gradlew ':qa:restart-upgrade:testAgainstNewCluster' --tests "org.opensearch.neuralsearch.bwc.SemanticSearchIT.testTextEmbeddingProcessor_E2EFlow" -Dtests.seed=A0EB48412B9D6A40 -Dtests.security.manager=false -Dtests.bwc.version=2.10.0 -Dtests.locale=sr-Latn -Dtests.timezone=Australia/Eucla -Druntime.java=21
  2> org.opensearch.client.ResponseException: method [DELETE], host [http://127.0.0.1:40357/], URI [/_plugins/_ml/models/FZY88I4BM2JwIbQlryn-], status line [HTTP/1.1 500 Internal Server Error]
    {"error":{"root_cause":[{"type":"null_pointer_exception","reason":"Cannot invoke \"java.lang.Boolean.booleanValue()\" because \"isHidden\" is null"}],"type":"null_pointer_exception","reason":"Cannot invoke \"java.lang.Boolean.booleanValue()\" because \"isHidden\" is null"},"status":500}
        at __randomizedtesting.SeedInfo.seed([A0EB48412B9D6A40:B0771E0F66EF3A19]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:385)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:355)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:330)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.makeRequest(BaseNeuralSearchIT.java:905)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.makeRequest(BaseNeuralSearchIT.java:878)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.deleteModel(BaseNeuralSearchIT.java:937)
        at app//org.opensearch.neuralsearch.BaseNeuralSearchIT.wipeOfTestResources(BaseNeuralSearchIT.java:1231)
        at app//org.opensearch.neuralsearch.bwc.SemanticSearchIT.testTextEmbeddingProcessor_E2EFlow(SemanticSearchIT.java:46)

zhichao-aws · 2024-04-18T08:38:35Z

Just found another bwc flaky test ref. It looks just like the flaky test for neural sparse: we ingest doc in old cluster, but can not search it in the mixed cluster.

org.opensearch.neuralsearch.bwc.HybridSearchIT > testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow FAILED
    java.lang.AssertionError: expected:<1> but was:<0>
        at __randomizedtesting.SeedInfo.seed([15108A61038B5F36:BCA3B86F76A79AD2]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.opensearch.neuralsearch.bwc.HybridSearchIT.validateTestIndexOnUpgrade(HybridSearchIT.java:99)
        at org.opensearch.neuralsearch.bwc.HybridSearchIT.testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow(HybridSearchIT.java:62)
REPRODUCE WITH: ./gradlew ':qa:rolling-upgrade:testAgainstOneThirdUpgradedCluster' --tests "org.opensearch.neuralsearch.bwc.HybridSearchIT.testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow" -Dtests.seed=15108A61038B5F36 -Dtests.security.manager=false -Dtests.bwc.version=2.14.0-SNAPSHOT -Dtests.locale=hr -Dtests.timezone=Africa/Maseru -Druntime.java=11
  1> [2024-04-18T10:14:02,796][INFO ][o.o.n.b.HybridSearchIT   ] [testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow] before test
  1> [2024-04-18T10:14:03,023][INFO ][o.o.n.b.HybridSearchIT   ] [testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow] initializing REST clients against [http://[::1]:44971, http://127.0.0.1:46107/, http://[::1]:34955, http://127.0.0.1:37089/, http://[::1]:45727, http://127.0.0.1:37209/]
  1> [2024-04-18T10:14:07,199][INFO ][o.o.n.b.HybridSearchIT   ] [testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow] There are still tasks running after this test that might break subsequent tests [indices:data/read/search, indices:data/read/search[phase/query], indices:data/write/bulk, indices:data/write/bulk[s], indices:data/write/bulk[s][p], indices:data/write/bulk[s][r]].

Suite: Test class org.opensearch.neuralsearch.bwc.HybridSearchIT
  2> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
  2> SLF4J: Defaulting to no-operation (NOP) logger implementation
  2> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
  2> REPRODUCE WITH: ./gradlew ':qa:rolling-upgrade:testAgainstOneThirdUpgradedCluster' --tests "org.opensearch.neuralsearch.bwc.HybridSearchIT.testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow" -Dtests.seed=15108A61038B5F36 -Dtests.security.manager=false -Dtests.bwc.version=2.14.0-SNAPSHOT -Dtests.locale=hr -Dtests.timezone=Africa/Maseru -Druntime.java=11
  2> java.lang.AssertionError: expected:<1> but was:<0>
        at __randomizedtesting.SeedInfo.seed([15108A61038B5F36:BCA3B86F76A79AD2]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.opensearch.neuralsearch.bwc.HybridSearchIT.validateTestIndexOnUpgrade(HybridSearchIT.java:99)
        at org.opensearch.neuralsearch.bwc.HybridSearchIT.testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow(HybridSearchIT.java:62)
  2> NOTE: leaving temporary files on disk at: /home/runner/work/neural-search/neural-search/qa/rolling-upgrade/build/testrun/testAgainstOneThirdUpgradedCluster/temp/org.opensearch.neuralsearch.bwc.HybridSearchIT_15108A61038B5F36-001
  2> NOTE: test params are: codec=Asserting(Lucene99): {}, docValues:{}, maxPointsInLeafNode=1523, maxMBSortInHeap=6.857239323607784, sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=hr, timezone=Africa/Maseru
  2> NOTE: Linux 6.5.0-1018-azure amd64/Azul Systems, Inc. 11.0.23 (64-bit)/cpus=4,threads=3,free=457856512,total=536870912
  2> NOTE: All tests run in this JVM: [HybridSearchIT]
  1> [2024-04-18T10:14:07,226][INFO ][o.o.n.b.HybridSearchIT   ] [testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow] after test

Then I do some experiments further. I create a cluster with one 2.14 node and another 3.0 node. Then I create a index with 2 shards(one shard for each node) and write 10 docs to it. When I call cat index to each node, we can see stream IO related errors in node log, and cat index api only return the doc number of local shard(4+6)

And I use simple search API to search on each nodes, the request failed to broadcast between nodes, and only return the doc number of local shard

The search response looks like this. the search request fails to broadcast between nodes due to stream i_o_exception

{'took': 11,
 'timed_out': False,
 '_shards': {'total': 2,
  'successful': 1,
  'skipped': 0,
  'failed': 1,
  'failures': [{'shard': 0,
    'index': 'test',
    'node': 'GrRxwmwZRnu7WJPbuafscA',
    'reason': {'type': 'i_o_exception',
     'reason': 'Invalid vInt ((ffffffef & 0x7f) << 28) | 51ce6d'}}]},
 'hits': {' ...

In those failed bwc tests, we only ingest one document. If this doc is ingested to other node, we'll get empty search response.(where we failed in bwc)

zane-neo · 2024-04-19T06:14:38Z

How to confirm this is the same issue in this issue: #688, it looks they have totally different error logs?

zane-neo · 2024-04-19T06:16:27Z

Just found another bwc flaky test ref. It looks just like the flaky test for neural sparse: we ingest doc in old cluster, but can not search it in the mixed cluster.

org.opensearch.neuralsearch.bwc.HybridSearchIT > testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow FAILED
    java.lang.AssertionError: expected:<1> but was:<0>
        at __randomizedtesting.SeedInfo.seed([15108A61038B5F36:BCA3B86F76A79AD2]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.opensearch.neuralsearch.bwc.HybridSearchIT.validateTestIndexOnUpgrade(HybridSearchIT.java:99)
        at org.opensearch.neuralsearch.bwc.HybridSearchIT.testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow(HybridSearchIT.java:62)
REPRODUCE WITH: ./gradlew ':qa:rolling-upgrade:testAgainstOneThirdUpgradedCluster' --tests "org.opensearch.neuralsearch.bwc.HybridSearchIT.testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow" -Dtests.seed=15108A61038B5F36 -Dtests.security.manager=false -Dtests.bwc.version=2.14.0-SNAPSHOT -Dtests.locale=hr -Dtests.timezone=Africa/Maseru -Druntime.java=11
  1> [2024-04-18T10:14:02,796][INFO ][o.o.n.b.HybridSearchIT   ] [testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow] before test
  1> [2024-04-18T10:14:03,023][INFO ][o.o.n.b.HybridSearchIT   ] [testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow] initializing REST clients against [http://[::1]:44971, http://127.0.0.1:46107/, http://[::1]:34955, http://127.0.0.1:37089/, http://[::1]:45727, http://127.0.0.1:37209/]
  1> [2024-04-18T10:14:07,199][INFO ][o.o.n.b.HybridSearchIT   ] [testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow] There are still tasks running after this test that might break subsequent tests [indices:data/read/search, indices:data/read/search[phase/query], indices:data/write/bulk, indices:data/write/bulk[s], indices:data/write/bulk[s][p], indices:data/write/bulk[s][r]].

Suite: Test class org.opensearch.neuralsearch.bwc.HybridSearchIT
  2> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
  2> SLF4J: Defaulting to no-operation (NOP) logger implementation
  2> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
  2> REPRODUCE WITH: ./gradlew ':qa:rolling-upgrade:testAgainstOneThirdUpgradedCluster' --tests "org.opensearch.neuralsearch.bwc.HybridSearchIT.testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow" -Dtests.seed=15108A61038B5F36 -Dtests.security.manager=false -Dtests.bwc.version=2.14.0-SNAPSHOT -Dtests.locale=hr -Dtests.timezone=Africa/Maseru -Druntime.java=11
  2> java.lang.AssertionError: expected:<1> but was:<0>
        at __randomizedtesting.SeedInfo.seed([15108A61038B5F36:BCA3B86F76A79AD2]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.opensearch.neuralsearch.bwc.HybridSearchIT.validateTestIndexOnUpgrade(HybridSearchIT.java:99)
        at org.opensearch.neuralsearch.bwc.HybridSearchIT.testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow(HybridSearchIT.java:62)
  2> NOTE: leaving temporary files on disk at: /home/runner/work/neural-search/neural-search/qa/rolling-upgrade/build/testrun/testAgainstOneThirdUpgradedCluster/temp/org.opensearch.neuralsearch.bwc.HybridSearchIT_15108A61038B5F36-001
  2> NOTE: test params are: codec=Asserting(Lucene99): {}, docValues:{}, maxPointsInLeafNode=1523, maxMBSortInHeap=6.857239323607784, sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=hr, timezone=Africa/Maseru
  2> NOTE: Linux 6.5.0-1018-azure amd64/Azul Systems, Inc. 11.0.23 (64-bit)/cpus=4,threads=3,free=457856512,total=536870912
  2> NOTE: All tests run in this JVM: [HybridSearchIT]
  1> [2024-04-18T10:14:07,226][INFO ][o.o.n.b.HybridSearchIT   ] [testNormalizationProcessor_whenIndexWithMultipleShards_E2EFlow] after test

Then I do some experiments further. I create a cluster with one 2.14 node and another 3.0 node. Then I create a index with 2 shards(one shard for each node) and write 10 docs to it. When I call cat index to each node, we can see stream IO related errors in node log, and cat index api only return the doc number of local shard(4+6) And I use simple search API to search on each nodes, the request failed to broadcast between nodes, and only return the doc number of local shard The search response looks like this. the search request fails to broadcast between nodes due to stream i_o_exception

{'took': 11,
 'timed_out': False,
 '_shards': {'total': 2,
  'successful': 1,
  'skipped': 0,
  'failed': 1,
  'failures': [{'shard': 0,
    'index': 'test',
    'node': 'GrRxwmwZRnu7WJPbuafscA',
    'reason': {'type': 'i_o_exception',
     'reason': 'Invalid vInt ((ffffffef & 0x7f) << 28) | 51ce6d'}}]},
 'hits': {' ...

In those failed bwc tests, we only ingest one document. If this doc is ingested to other node, we'll get empty search response.(where we failed in bwc)

Is it possible to identify the incompatible object between 2.14 and 3.0? If this is a breaking change, we might can skip this test.

CHANGELOG.md

zhichao-aws · 2024-04-19T09:17:18Z

How to confirm this is the same issue in this issue: #688, it looks they have totally different error logs?

Now we have 3 flaky test here: neural sparse search; hybrid search match query; and a test with error log "java.lang.Boolean.booleanValue()" because "isHidden" is null". This PR is fixing the first one.

For the first flaky test, there are 2 possible test cases, one is org.opensearch.neuralsearch.bwc.NeuralQueryEnricherProcessorIT.testNeuralQueryEnricherProcessor_NeuralSparseSearch_E2EFlow; and another is org.opensearch.neuralsearch.bwc.NeuralSparseSearchIT.testSparseEncodingProcessor_E2EFlow. The reason behind is same: the neural sparse search request fail to broadcast between nodes and only return local result.

zhichao-aws · 2024-04-19T09:18:37Z

Is it possible to identify the incompatible object between 2.14 and 3.0? If this is a breaking change, we might can skip this test.

I just learned that we need to keep bwc between latest 2.x and 3.0 . So we can not skip this test. The bwc issue in core should be considered a BUG. (Now I've found the match query and cat index api have bwc issue)

zane-neo · 2024-04-19T10:07:37Z

How to confirm this is the same issue in this issue: #688, it looks they have totally different error logs?

Now we have 3 flaky test here: neural sparse search; hybrid search match query; and a test with error log "java.lang.Boolean.booleanValue()" because "isHidden" is null". This PR is fixing the first one.

For the first flaky test, there are 2 possible test cases, one is org.opensearch.neuralsearch.bwc.NeuralQueryEnricherProcessorIT.testNeuralQueryEnricherProcessor_NeuralSparseSearch_E2EFlow; and another is org.opensearch.neuralsearch.bwc.NeuralSparseSearchIT.testSparseEncodingProcessor_E2EFlow. The reason behind is same: the neural sparse search request fail to broadcast between nodes and only return local result.

I don't see this PR is fixing the issue you mentioned in the related issue part, let's merge this but do not close the issue: #688

vibrantvarun · 2024-04-20T06:26:31Z

LGTM

* Adding integ tests for scenario of hybrid query with aggregations (opensearch-project#632) * Adding tests and params to ignore tests if needed Signed-off-by: Martin Gaievski <[email protected]> * [BUG FIX] Fix bwc failure in neural sparse search (opensearch-project#696) * fix comments Signed-off-by: zhichao-aws <[email protected]> --------- Signed-off-by: Martin Gaievski <[email protected]> Signed-off-by: zhichao-aws <[email protected]> Co-authored-by: Martin Gaievski <[email protected]>

zhichao-aws added 2 commits April 18, 2024 15:18

fix bwc error by adding max_token_score placeholder; add UT for max_t…

1ac9467

…oken_score Signed-off-by: zhichao-aws <[email protected]>

add changelog

b242d12

Signed-off-by: zhichao-aws <[email protected]>

zhichao-aws requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, zane-neo, ylwu-amzn, jngz-es and vibrantvarun as code owners April 18, 2024 07:47

typo

c0ea9c0

Signed-off-by: zhichao-aws <[email protected]>

yuye-aws reviewed Apr 19, 2024

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

zane-neo approved these changes Apr 19, 2024

View reviewed changes

vibrantvarun approved these changes Apr 20, 2024

View reviewed changes

vibrantvarun requested a review from yuye-aws April 20, 2024 06:27

zhichao-aws merged commit 7b0229d into opensearch-project:main Apr 20, 2024
70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG FIX] Fix bwc failure in neural sparse search #696

[BUG FIX] Fix bwc failure in neural sparse search #696

zhichao-aws commented Apr 18, 2024 •

edited

Loading

zhichao-aws commented Apr 18, 2024

zhichao-aws commented Apr 18, 2024

zane-neo commented Apr 19, 2024

zane-neo commented Apr 19, 2024

zhichao-aws commented Apr 19, 2024

zhichao-aws commented Apr 19, 2024 •

edited

Loading

zane-neo commented Apr 19, 2024

vibrantvarun commented Apr 20, 2024

[BUG FIX] Fix bwc failure in neural sparse search #696

[BUG FIX] Fix bwc failure in neural sparse search #696

Conversation

zhichao-aws commented Apr 18, 2024 • edited Loading

Description

Issues Resolved

Check List

zhichao-aws commented Apr 18, 2024

zhichao-aws commented Apr 18, 2024

zane-neo commented Apr 19, 2024

zane-neo commented Apr 19, 2024

zhichao-aws commented Apr 19, 2024

zhichao-aws commented Apr 19, 2024 • edited Loading

zane-neo commented Apr 19, 2024

vibrantvarun commented Apr 20, 2024

zhichao-aws commented Apr 18, 2024 •

edited

Loading

zhichao-aws commented Apr 19, 2024 •

edited

Loading