Adding serialization for filter field in KnnQueryBuilder #564

martin-gaievski · 2022-09-28T01:10:47Z

Signed-off-by: Martin Gaievski [email protected]

Description

Adding serialization for filter in Knn query. Without the change multi-node cluster may be in inconsistent state, meaning some shards will execute knn query without the filter and search result will be irrelevant.

Special case for rolling and restart upgrades. Let's say we do have existing cluster on 2.3 (we call it old) and we want to upgrade to 3.0 (we call it new). Nodes need to handle following cases for success run:

old -> old just existing cluster works before the upgrade. This should already work
old -> new this is part of restart and rolling upgrades
new -> old part of rolling upgrade
new -> new upgraded cluster works

The easiest way to satisfy all criteria is to only write and read filter field in the last case of new->new. In all other cases input stream will be corrupted and logic after knn query will fail. To address this we introducing the cluster min deployed version concept that is set as a lowest out of all versions deployed in the cluster.

One more added check is for manual k-NN search query, it minimal cluster version is below one that supports filtering feature (3.0.0 for this branch), query will be rejected with 4xx Bad request error. This is to address cluster upgrades, we need to avoid accepting queries with filter until upgrade is completed. Otherwise only some of the cluster nodes will use filter and this affects recall/relevance of the result.

Issues Resolved

#376

Check List

New functionality includes testing.
- All tests pass
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov-commenter · 2022-09-28T01:24:39Z

Codecov Report

Merging #564 (af6ad48) into feature/efficient-filtering (ce68025) will increase coverage by 0.38%.
The diff coverage is 97.22%.

@@                        Coverage Diff                        @@
##             feature/efficient-filtering     #564      +/-   ##
=================================================================
+ Coverage                          83.72%   84.10%   +0.38%     
- Complexity                          1035     1049      +14     
=================================================================
  Files                                148      149       +1     
  Lines                               4282     4316      +34     
  Branches                             379      384       +5     
=================================================================
+ Hits                                3585     3630      +45     
+ Misses                               520      509      -11     
  Partials                             177      177

Impacted Files	Coverage Δ
...rg/opensearch/knn/index/query/KNNQueryBuilder.java	`84.10% <92.85%> (+9.28%)`	⬆️
...va/org/opensearch/knn/index/KNNClusterContext.java	`100.00% <100.00%> (ø)`
...main/java/org/opensearch/knn/plugin/KNNPlugin.java	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

navneet1v · 2022-09-28T17:55:36Z

src/main/java/org/opensearch/knn/index/query/KNNQueryBuilder.java

@@ -181,6 +184,12 @@ protected void doWriteTo(StreamOutput out) throws IOException {
        out.writeString(fieldName);
        out.writeFloatArray(vector);
        out.writeInt(k);
+        if (filter != null) {
+            out.writeBoolean(true);


is this boolean required? can we somehow skip it?

Use writeOptionalNamedWriteable? I think it abstracts this logic.

thanks Jack, will update to writeOptionalNamedWriteable

navneet1v · 2022-09-28T17:59:59Z

src/main/java/org/opensearch/knn/index/query/KNNQueryBuilder.java

@@ -109,6 +109,9 @@ public KNNQueryBuilder(StreamInput in) throws IOException {
            fieldName = in.readString();
            vector = in.readFloatArray();
            k = in.readInt();
+            if (in.readBoolean()) {
+                filter = in.readNamedWriteable(QueryBuilder.class);
+            }
        } catch (IOException ex) {
            throw new RuntimeException("[KNN] Unable to create KNNQueryBuilder: " + ex);


can we fix this error also. This exception will eat up the stack trace. I know its not part of your change.

Suggested change

throw new RuntimeException("[KNN] Unable to create KNNQueryBuilder: " + ex);

throw new RuntimeException("[KNN] Unable to create KNNQueryBuilder ", ex);

heemin32 · 2022-09-28T18:17:00Z

src/test/java/org/opensearch/knn/index/query/KNNQueryBuilderTests.java

+
+        final KNNQueryBuilder knnQueryBuilder = new KNNQueryBuilder(FIELD_NAME, queryVector, K);
+
+        try (BytesStreamOutput output = new BytesStreamOutput()) {


Can we extract this block as a private method? I see the only difference is checking if getFilter return null or expected filter which can be handled by parameter?

let me think on proper refactoring, it should be possible. I'm adding one more similar block so code duplication is definitely there

Signed-off-by: Martin Gaievski <[email protected]>

heemin32 · 2022-09-30T15:57:38Z

src/test/java/org/opensearch/knn/index/query/KNNQueryBuilderTests.java

+        }
+    }
+
+    private void assertSerialization(final QueryBuilder deserializedQuery, boolean assertFilterIsNull) {


TERM_QUERY is decided by KNNQueryBuilder but this method does not get the value from the KNNQueryBuilder.
Also I see there is a room for further optimization.

public void testSerialization() throws Exception { testSerialization(Version.CURRENT, null); testSerialization(Version.CURRENT, TERM_QUERY); testSerialization(Version.V_2_3_0, null); }

Signed-off-by: Martin Gaievski <[email protected]>

heemin32 · 2022-10-04T19:12:28Z

src/main/java/org/opensearch/knn/index/KNNClusterContext.java

+        ImmutableOpenMap<String, DiscoveryNode> clusterDiscoveryNodes = ImmutableOpenMap.of();
+        log.debug("Reading cluster min version");
+        try {
+            clusterDiscoveryNodes = this.clusterService.state().getNodes().getNodes();


Is this method call cheap? Otherwise, calling this method every time when we serialize/deserialize query might lead to performance degradation.

It should be cheap, state is cached locally and updated asynchronously if nodes are joining/leaving cluster. I want to run perf tests on multi-node cluster once this is merged to feature branch and compare latencies with previous solution.

Signed-off-by: Martin Gaievski <[email protected]>

* Adding serialization/deserialization for filter field in Lucene knn query Signed-off-by: Martin Gaievski <[email protected]>

…project#564) * Adding serialization/deserialization for filter field in Lucene knn query Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski added Enhancements Increases software capabilities beyond original client specifications 2.4.0 feature branch labels Sep 28, 2022

martin-gaievski requested a review from a team September 28, 2022 01:10

navneet1v reviewed Sep 28, 2022

View reviewed changes

heemin32 reviewed Sep 28, 2022

View reviewed changes

martin-gaievski added 2 commits September 28, 2022 11:31

Adding serialization for KnnQueryBuilder

097506a

Signed-off-by: Martin Gaievski <[email protected]>

Adding version check for deserialization of filter field

c385555

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski force-pushed the feature/efficient-filtering-mg branch from bf7469c to c385555 Compare September 28, 2022 19:06

Fixed spotless, version for stream serialization

536ed11

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski force-pushed the feature/efficient-filtering-mg branch from 195f025 to 536ed11 Compare September 29, 2022 00:20

heemin32 reviewed Sep 30, 2022

View reviewed changes

martin-gaievski added 2 commits October 4, 2022 11:17

Adding min cluster deployed version

bd246df

Signed-off-by: Martin Gaievski <[email protected]>

Adjust for latest Lucene 9.4

e71da03

Signed-off-by: Martin Gaievski <[email protected]>

heemin32 reviewed Oct 4, 2022

View reviewed changes

Refactor unit test for better code reusage

fc4f02b

Signed-off-by: Martin Gaievski <[email protected]>

heemin32 approved these changes Oct 4, 2022

View reviewed changes

martin-gaievski force-pushed the feature/efficient-filtering-mg branch 8 times, most recently from 255edae to af6ad48 Compare October 5, 2022 05:13

Add cluster version check for search request

c314b11

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski force-pushed the feature/efficient-filtering-mg branch from af6ad48 to c314b11 Compare October 5, 2022 15:49

junqiu-lei approved these changes Oct 5, 2022

View reviewed changes

martin-gaievski merged commit 1729748 into opensearch-project:feature/efficient-filtering Oct 5, 2022

martin-gaievski added a commit that referenced this pull request Oct 14, 2022

Adding serialization for filter field in KnnQueryBuilder (#564)

2e18ae8

* Adding serialization/deserialization for filter field in Lucene knn query Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski mentioned this pull request Oct 21, 2022

Merge efficient filtering from feature branch #588

Merged

1 task

heemin32 added v2.4.0 'Issues and PRs related to version v2.4.0' and removed 2.4.0 labels Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding serialization for filter field in KnnQueryBuilder #564

Adding serialization for filter field in KnnQueryBuilder #564

martin-gaievski commented Sep 28, 2022 •

edited

Loading

codecov-commenter commented Sep 28, 2022 •

edited

Loading

navneet1v Sep 28, 2022

jmazanec15 Sep 28, 2022

martin-gaievski Sep 28, 2022

navneet1v Sep 28, 2022

martin-gaievski Sep 28, 2022

heemin32 Sep 28, 2022

martin-gaievski Sep 28, 2022

heemin32 Sep 30, 2022

martin-gaievski Oct 4, 2022

heemin32 Oct 4, 2022

martin-gaievski Oct 4, 2022

	throw new RuntimeException("[KNN] Unable to create KNNQueryBuilder: " + ex);
	throw new RuntimeException("[KNN] Unable to create KNNQueryBuilder ", ex);


		final KNNQueryBuilder knnQueryBuilder = new KNNQueryBuilder(FIELD_NAME, queryVector, K);

		try (BytesStreamOutput output = new BytesStreamOutput()) {

Adding serialization for filter field in KnnQueryBuilder #564

Adding serialization for filter field in KnnQueryBuilder #564

Conversation

martin-gaievski commented Sep 28, 2022 • edited Loading

Description

Issues Resolved

Check List

codecov-commenter commented Sep 28, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martin-gaievski commented Sep 28, 2022 •

edited

Loading

codecov-commenter commented Sep 28, 2022 •

edited

Loading