Allow nested knn field mapping when train model #1318

junqiu-lei · 2023-11-21T22:22:42Z

Description

This PR fixed the bug when trying to train a knn model from a field that is not a top-level field in the OpenSearch field mappings. This fix allows user to pass a nested filed path at training_field to train model.

Example:

{
  "training_index": "train-index",
  "training_field": "a.train-field",
  "dimension": 8,
  "description": "My model description",
  "method": {
    "name": "ivf",
    "engine": "faiss",
    "space_type": "l2",
    "parameters": {
      "nlist": 1,
      "nprobes": 2
    }
  }
}

Issues Resolved

Close #1293

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov · 2023-11-21T22:39:22Z

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (5e2f899) 85.13% compared to head (06647db) 85.15%.

Files	Patch %	Lines
...java/org/opensearch/knn/training/VectorReader.java	70.58%	3 Missing and 2 partials ⚠️
.../main/java/org/opensearch/knn/index/IndexUtil.java	93.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1318      +/-   ##
============================================
+ Coverage     85.13%   85.15%   +0.02%     
- Complexity     1210     1216       +6     
============================================
  Files           160      160              
  Lines          4931     4958      +27     
  Branches        449      457       +8     
============================================
+ Hits           4198     4222      +24     
- Misses          537      538       +1     
- Partials        196      198       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/main/java/org/opensearch/knn/index/IndexUtil.java

src/test/java/org/opensearch/knn/index/IndexUtilTests.java

martin-gaievski · 2023-11-22T02:46:53Z

src/main/java/org/opensearch/knn/index/IndexUtil.java

+        for (Map.Entry<String, Object> entry : properties.entrySet()) {
+            Object value = entry.getValue();
+            if (value instanceof Map) {
+                Object result = getFieldMapping((Map<String, Object>) value, field);


if that is a recursive call should we add a limit on number of depth levels to avoid stack overflow exception? We can set something like 20, that should be ok I think

there is an index setting that tells how much nesting is allowed in Opensearch we should use that.

index.mapping.nested_fields.limit : default value is 50.

Code ref: https://github.com/opensearch-project/OpenSearch/blob/b974dfbe7208f41357ec7e8e20b77fafa99997a1/server/src/main/java/org/opensearch/index/mapper/MapperService.java#L126

Since the nested fields are limited from OpenSearch firstly, can we ignore the limit here?

@junqiu-lei but this is a rest handler that will be hit by our code path and will not come via Opensearch search path. Hence we need to use the limits.

Added the nested field path length check. https://github.com/opensearch-project/k-NN/pull/1318/files#diff-a074263cc1a3244e0cd9c1dc1754ecb5f3bde6a6750cc9bec73e3b7c34c50e89R148

@navneet1v @martin-gaievski I dont think we need to add a check. We get the mapping from OpenSearch so it will already have passed the nested limit check. Adding a check on the user input string will not be required because we are not going to recurse any further than the mapping allows us to recurse.

Lets make sure that we mention this on top of this function that we don't need this check because Opensearch already does this check.

Added the valid description on top of the function

navneet1v · 2023-11-22T05:54:17Z

@junqiu-lei can you add the successful case on how you tested this.

navneet1v · 2023-11-22T05:55:50Z

@junqiu-lei can we also add the IT for this fix.

navneet1v · 2023-11-22T05:58:48Z

src/main/java/org/opensearch/knn/index/IndexUtil.java

+     * @param field      The name of the field to retrieve.
+     * @return           The value of the field if found, or null if the field is not present in the map.
+     */
+    public static Object getFieldMapping(final Map<String, Object> properties, final String field) {


@junqiu-lei can you please provide some explanation on how this function is able to get the nested fields.

So, if I have a nested field, then in the parameter field we are just getting the deeply nested value?
example: if I have opensearch field like a.b.c.vectorfield.

then then parameter field is vectorField right?

Yes, it can retrieve a field from deep nested mapping.

Updated the method description to have more details.

junqiu-lei · 2023-11-22T19:33:05Z

@junqiu-lei can we also add the IT for this fix.

Will add it

junqiu-lei · 2023-11-29T18:28:15Z

@junqiu-lei can we also add the IT for this fix.

Will add it

@navneet1v IT test added

martin-gaievski · 2023-11-29T18:40:05Z

src/main/java/org/opensearch/knn/index/IndexUtil.java

+
+        // Check filed path length is valid
+        if (fieldPaths.length == 0 || fieldPaths.length > nestedFieldMaxLimit) {
+            exception.addValidationError(String.format("Field path length \"%s\" is invalid, it should > 0 and <= %d",


we also need to add locale argument, root should be good String.format(Locale.ROOT,...

src/main/java/org/opensearch/knn/training/VectorReader.java

jmazanec15 · 2023-12-01T17:45:33Z

src/main/java/org/opensearch/knn/index/IndexUtil.java

+     * @param fieldPaths The field path list that make up the path to the field mapping.
+     * @return           The value of the field if found, or null if the field is not present in the map.
+     */
+    public static Object getFieldMapping(final Map<String, Object> properties, final String[] fieldPaths) {


Can you search if similar code exists in OpenSearch core that we could re-use instead of implementing on our own?

Also, this function comment isnt quite clear to me. Can you add more details in the comment?

@jmazanec15 I tried but didn’t find method we can directly use. Will update function comment.

Updated function comment

Got it. I saw this method in index-management: https://github.com/opensearch-project/index-management/blob/417d0d9c3ac630b720081f3ea383dea26f4a6456/src/main/kotlin/org/opensearch/indexmanagement/util/IndexUtils.kt#L180-L188. Its in kotlin, but I think its a good reference.

I would say that we should have users pass in a single field name instead of final String[] fieldPaths. Then, in this function, we should do the splitting. This will allow users to not have to worry if they are dealing with nested or non-nested field.

having a splitting logic make sense in this function we this function is public. I don't see any reason for making this function public, I would recommend keeping the interface as it and make the function private.

Ref: #1318 (comment)

Updated the function from feedback

src/main/java/org/opensearch/knn/training/VectorReader.java

jmazanec15 · 2023-12-04T23:46:29Z

src/main/java/org/opensearch/knn/index/IndexUtil.java

+     * @param fieldPaths The field path list that make up the path to the field mapping.
+     * @return           The value of the field if found, or null if the field is not present in the map.
+     */
+    public static Object getFieldMapping(final Map<String, Object> properties, final String[] fieldPaths) {


Got it. I saw this method in index-management: https://github.com/opensearch-project/index-management/blob/417d0d9c3ac630b720081f3ea383dea26f4a6456/src/main/kotlin/org/opensearch/indexmanagement/util/IndexUtils.kt#L180-L188. Its in kotlin, but I think its a good reference.

I would say that we should have users pass in a single field name instead of final String[] fieldPaths. Then, in this function, we should do the splitting. This will allow users to not have to worry if they are dealing with nested or non-nested field.

src/main/java/org/opensearch/knn/index/IndexUtil.java

src/main/java/org/opensearch/knn/training/VectorReader.java

navneet1v · 2023-12-06T08:00:04Z

src/main/java/org/opensearch/knn/training/VectorReader.java

+                String[] fieldPath = fieldName.split("\\.");
+
+                for (int pathPart = 0; pathPart < fieldPath.length - 1; pathPart++) {
+                    currentMap = (Map<String, Object>) currentMap.get(fieldPath[pathPart]);


we should check if currentMap.get(fieldPath[pathPart]) != null and then only type cast the value otherwise there is a possibility of NullPointerException.

Offline synced with Navneet, added checks in updated code

navneet1v · 2023-12-06T08:00:20Z

src/main/java/org/opensearch/knn/training/VectorReader.java

+                    currentMap = (Map<String, Object>) currentMap.get(fieldPath[pathPart]);
+                }
+
+                List<Number> fieldList = (List<Number>) currentMap.get(fieldPath[fieldPath.length - 1]);


same as above.

replied above

navneet1v · 2023-12-06T08:00:53Z

src/main/java/org/opensearch/knn/training/VectorReader.java

+
+                List<Number> fieldList = (List<Number>) currentMap.get(fieldPath[fieldPath.length - 1]);
+
+                trainingData.add(fieldList.stream().map(Number::floatValue).toArray(Float[]::new));


filter out the null objects from this stream otherwise there can be NullPointerExceptions.

Synced with Navneet, because the null object check have been passed when ingesting data, so we don't need filter here.

src/main/java/org/opensearch/knn/index/IndexUtil.java

jmazanec15 · 2023-12-06T22:47:58Z

src/main/java/org/opensearch/knn/training/VectorReader.java

+
+                for (int pathPart = 0; pathPart < fieldPath.length - 1; pathPart++) {
+                    if (currentMap.get(fieldPath[pathPart]) == null) {
+                        logger.warn("Field path {} does not exist in document", fieldName);


This could get logged 1000s or millions of times because its per vector. Im just wondering if we should either aggregate a log message outside of this loop with a summary of the non-existent fields or switch it to debug mode.

You're right. Since the filed path will anyway be validated from IndexUtil.validateKnnField before use here, removed this warn log. (Offline synced with Navneet)

jmazanec15 · 2023-12-06T22:48:13Z

src/main/java/org/opensearch/knn/training/VectorReader.java

+                }
+
+                if (currentMap.get(fieldPath[fieldPath.length - 1]) instanceof List<?> == false) {
+                    logger.warn("No vectors found for field {} in doc {}", fieldName, hits[vector].getId());


Same as above on how often this could get logged.

Updated to aggregated log after for loop, so that we can know the total count of null docs if exists.

Signed-off-by: Junqiu Lei <[email protected]>

jmazanec15

LGTM thanks @junqiu-lei !

jmazanec15 · 2023-12-07T18:01:02Z

@junqiu-lei add backport to 2.x label

Signed-off-by: Junqiu Lei <[email protected]> (cherry picked from commit 2e3ab95)

) Signed-off-by: Junqiu Lei <[email protected]> (cherry picked from commit 2e3ab95) Signed-off-by: Junqiu Lei <[email protected]>

(cherry picked from commit 2e3ab95) Signed-off-by: Junqiu Lei <[email protected]>

junqiu-lei added bug Something isn't working v2.12.0 labels Nov 21, 2023

junqiu-lei requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, martin-gaievski and ryanbogan as code owners November 21, 2023 22:22

junqiu-lei force-pushed the fix_nested_field_mapping branch from 71c45a5 to 30ddc20 Compare November 21, 2023 22:41

martin-gaievski reviewed Nov 21, 2023

View reviewed changes

junqiu-lei force-pushed the fix_nested_field_mapping branch from 30ddc20 to 134274e Compare November 22, 2023 01:07

junqiu-lei requested a review from martin-gaievski November 22, 2023 01:13

martin-gaievski reviewed Nov 22, 2023

View reviewed changes

navneet1v reviewed Nov 22, 2023

View reviewed changes

This comment was marked as outdated.

Sign in to view

junqiu-lei force-pushed the fix_nested_field_mapping branch from 134274e to 0d33bae Compare November 22, 2023 19:31

junqiu-lei force-pushed the fix_nested_field_mapping branch from 0d33bae to be741f1 Compare November 29, 2023 18:27

junqiu-lei self-assigned this Nov 29, 2023

junqiu-lei requested review from navneet1v and martin-gaievski November 29, 2023 18:32

junqiu-lei force-pushed the fix_nested_field_mapping branch from be741f1 to 32b0e7d Compare November 29, 2023 18:36

junqiu-lei changed the title ~~Fixed field value from nested mapping when train model~~ Allow nested knn field mapping when train model Nov 29, 2023

martin-gaievski reviewed Nov 29, 2023

View reviewed changes

jmazanec15 reviewed Dec 1, 2023

View reviewed changes

junqiu-lei dismissed martin-gaievski’s stale review via df94d65 December 4, 2023 19:26

junqiu-lei force-pushed the fix_nested_field_mapping branch from ac80de6 to df94d65 Compare December 4, 2023 19:26

junqiu-lei requested review from navneet1v, jmazanec15 and martin-gaievski December 4, 2023 19:30

jmazanec15 reviewed Dec 5, 2023

View reviewed changes

junqiu-lei force-pushed the fix_nested_field_mapping branch 2 times, most recently from 5965901 to 1d4217d Compare December 5, 2023 23:11

junqiu-lei requested a review from jmazanec15 December 5, 2023 23:18

navneet1v reviewed Dec 6, 2023

View reviewed changes

jmazanec15 reviewed Dec 6, 2023

View reviewed changes

src/main/java/org/opensearch/knn/index/IndexUtil.java Show resolved Hide resolved

junqiu-lei force-pushed the fix_nested_field_mapping branch from 1d4217d to 90c65f0 Compare December 6, 2023 21:19

jmazanec15 reviewed Dec 6, 2023

View reviewed changes

Allow nested knn field mapping when train model

06647db

Signed-off-by: Junqiu Lei <[email protected]>

junqiu-lei force-pushed the fix_nested_field_mapping branch from 90c65f0 to 06647db Compare December 6, 2023 23:24

junqiu-lei requested review from jmazanec15 and navneet1v December 7, 2023 17:55

jmazanec15 approved these changes Dec 7, 2023

View reviewed changes

junqiu-lei added the backport 2.x label Dec 7, 2023

navneet1v approved these changes Dec 7, 2023

View reviewed changes

junqiu-lei merged commit 2e3ab95 into opensearch-project:main Dec 7, 2023
52 of 53 checks passed

opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 7, 2023

Allow nested knn field mapping when train model (#1318)

1a8dca8

Signed-off-by: Junqiu Lei <[email protected]> (cherry picked from commit 2e3ab95)

opensearch-trigger-bot bot mentioned this pull request Dec 7, 2023

[Backport 2.x] Allow nested knn field mapping when train model #1338

Closed

junqiu-lei mentioned this pull request Dec 7, 2023

[Manually backport 2.x] Allow nested knn field mapping when train model #1339

Merged

5 tasks

junqiu-lei added a commit that referenced this pull request Dec 7, 2023

Allow nested knn field mapping when train model (#1318) (#1339)

06d52d5

(cherry picked from commit 2e3ab95) Signed-off-by: Junqiu Lei <[email protected]>


		List<Number> fieldList = (List<Number>) currentMap.get(fieldPath[fieldPath.length - 1]);

		trainingData.add(fieldList.stream().map(Number::floatValue).toArray(Float[]::new));

Allow nested knn field mapping when train model #1318

Allow nested knn field mapping when train model #1318

Conversation

junqiu-lei commented Nov 21, 2023 • edited Loading

Description

Issues Resolved

Check List

codecov bot commented Nov 21, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navneet1v commented Nov 22, 2023

navneet1v commented Nov 22, 2023

Choose a reason for hiding this comment

navneet1v Nov 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

junqiu-lei commented Nov 22, 2023 • edited Loading

junqiu-lei commented Nov 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junqiu-lei Dec 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junqiu-lei Dec 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junqiu-lei Dec 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmazanec15 left a comment

Choose a reason for hiding this comment

jmazanec15 commented Dec 7, 2023

junqiu-lei commented Nov 21, 2023 •

edited

Loading

codecov bot commented Nov 21, 2023 •

edited

Loading

navneet1v Nov 22, 2023 •

edited

Loading

junqiu-lei commented Nov 22, 2023 •

edited

Loading

junqiu-lei Dec 6, 2023 •

edited

Loading

junqiu-lei Dec 6, 2023 •

edited

Loading

junqiu-lei Dec 6, 2023 •

edited

Loading