Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow nested knn field mapping when train model #1318

Merged

Conversation

junqiu-lei
Copy link
Member

@junqiu-lei junqiu-lei commented Nov 21, 2023

Description

This PR fixed the bug when trying to train a knn model from a field that is not a top-level field in the OpenSearch field mappings. This fix allows user to pass a nested filed path at training_field to train model.

Example:

{
  "training_index": "train-index",
  "training_field": "a.train-field",
  "dimension": 8,
  "description": "My model description",
  "method": {
    "name": "ivf",
    "engine": "faiss",
    "space_type": "l2",
    "parameters": {
      "nlist": 1,
      "nprobes": 2
    }
  }
}

Issues Resolved

Close #1293

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Nov 21, 2023

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (5e2f899) 85.13% compared to head (06647db) 85.15%.

Files Patch % Lines
...java/org/opensearch/knn/training/VectorReader.java 70.58% 3 Missing and 2 partials ⚠️
.../main/java/org/opensearch/knn/index/IndexUtil.java 93.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1318      +/-   ##
============================================
+ Coverage     85.13%   85.15%   +0.02%     
- Complexity     1210     1216       +6     
============================================
  Files           160      160              
  Lines          4931     4958      +27     
  Branches        449      457       +8     
============================================
+ Hits           4198     4222      +24     
- Misses          537      538       +1     
- Partials        196      198       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@junqiu-lei junqiu-lei force-pushed the fix_nested_field_mapping branch from 71c45a5 to 30ddc20 Compare November 21, 2023 22:41
for (Map.Entry<String, Object> entry : properties.entrySet()) {
Object value = entry.getValue();
if (value instanceof Map) {
Object result = getFieldMapping((Map<String, Object>) value, field);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that is a recursive call should we add a limit on number of depth levels to avoid stack overflow exception? We can set something like 20, that should be ok I think

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an index setting that tells how much nesting is allowed in Opensearch we should use that.

index.mapping.nested_fields.limit : default value is 50.

Code ref: https://github.com/opensearch-project/OpenSearch/blob/b974dfbe7208f41357ec7e8e20b77fafa99997a1/server/src/main/java/org/opensearch/index/mapper/MapperService.java#L126

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the nested fields are limited from OpenSearch firstly, can we ignore the limit here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junqiu-lei but this is a rest handler that will be hit by our code path and will not come via Opensearch search path. Hence we need to use the limits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@navneet1v @martin-gaievski I dont think we need to add a check. We get the mapping from OpenSearch so it will already have passed the nested limit check. Adding a check on the user input string will not be required because we are not going to recurse any further than the mapping allows us to recurse.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make sure that we mention this on top of this function that we don't need this check because Opensearch already does this check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the valid description on top of the function

@navneet1v
Copy link
Collaborator

@junqiu-lei can you add the successful case on how you tested this.

@navneet1v
Copy link
Collaborator

@junqiu-lei can we also add the IT for this fix.

* @param field The name of the field to retrieve.
* @return The value of the field if found, or null if the field is not present in the map.
*/
public static Object getFieldMapping(final Map<String, Object> properties, final String field) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junqiu-lei can you please provide some explanation on how this function is able to get the nested fields.

Copy link
Collaborator

@navneet1v navneet1v Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I have a nested field, then in the parameter field we are just getting the deeply nested value?
example: if I have opensearch field like a.b.c.vectorfield.

then then parameter field is vectorField right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can retrieve a field from deep nested mapping.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the method description to have more details.

@junqiu-lei

This comment was marked as outdated.

@junqiu-lei junqiu-lei force-pushed the fix_nested_field_mapping branch from 134274e to 0d33bae Compare November 22, 2023 19:31
@junqiu-lei
Copy link
Member Author

junqiu-lei commented Nov 22, 2023

@junqiu-lei can we also add the IT for this fix.

Will add it

@junqiu-lei junqiu-lei force-pushed the fix_nested_field_mapping branch from 0d33bae to be741f1 Compare November 29, 2023 18:27
@junqiu-lei
Copy link
Member Author

@junqiu-lei can we also add the IT for this fix.

Will add it

@navneet1v IT test added

@junqiu-lei junqiu-lei self-assigned this Nov 29, 2023
@junqiu-lei junqiu-lei force-pushed the fix_nested_field_mapping branch from be741f1 to 32b0e7d Compare November 29, 2023 18:36
@junqiu-lei junqiu-lei changed the title Fixed field value from nested mapping when train model Allow nested knn field mapping when train model Nov 29, 2023

// Check filed path length is valid
if (fieldPaths.length == 0 || fieldPaths.length > nestedFieldMaxLimit) {
exception.addValidationError(String.format("Field path length \"%s\" is invalid, it should > 0 and <= %d",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also need to add locale argument, root should be good String.format(Locale.ROOT,...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

* @param fieldPaths The field path list that make up the path to the field mapping.
* @return The value of the field if found, or null if the field is not present in the map.
*/
public static Object getFieldMapping(final Map<String, Object> properties, final String[] fieldPaths) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you search if similar code exists in OpenSearch core that we could re-use instead of implementing on our own?

Also, this function comment isnt quite clear to me. Can you add more details in the comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmazanec15 I tried but didn’t find method we can directly use. Will update function comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated function comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I saw this method in index-management: https://github.com/opensearch-project/index-management/blob/417d0d9c3ac630b720081f3ea383dea26f4a6456/src/main/kotlin/org/opensearch/indexmanagement/util/IndexUtils.kt#L180-L188. Its in kotlin, but I think its a good reference.

I would say that we should have users pass in a single field name instead of final String[] fieldPaths. Then, in this function, we should do the splitting. This will allow users to not have to worry if they are dealing with nested or non-nested field.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having a splitting logic make sense in this function we this function is public. I don't see any reason for making this function public, I would recommend keeping the interface as it and make the function private.

Ref: #1318 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the function from feedback

* @param fieldPaths The field path list that make up the path to the field mapping.
* @return The value of the field if found, or null if the field is not present in the map.
*/
public static Object getFieldMapping(final Map<String, Object> properties, final String[] fieldPaths) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I saw this method in index-management: https://github.com/opensearch-project/index-management/blob/417d0d9c3ac630b720081f3ea383dea26f4a6456/src/main/kotlin/org/opensearch/indexmanagement/util/IndexUtils.kt#L180-L188. Its in kotlin, but I think its a good reference.

I would say that we should have users pass in a single field name instead of final String[] fieldPaths. Then, in this function, we should do the splitting. This will allow users to not have to worry if they are dealing with nested or non-nested field.

src/main/java/org/opensearch/knn/index/IndexUtil.java Outdated Show resolved Hide resolved
@junqiu-lei junqiu-lei force-pushed the fix_nested_field_mapping branch 2 times, most recently from 5965901 to 1d4217d Compare December 5, 2023 23:11
@junqiu-lei junqiu-lei requested a review from jmazanec15 December 5, 2023 23:18
String[] fieldPath = fieldName.split("\\.");

for (int pathPart = 0; pathPart < fieldPath.length - 1; pathPart++) {
currentMap = (Map<String, Object>) currentMap.get(fieldPath[pathPart]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should check if currentMap.get(fieldPath[pathPart]) != null and then only type cast the value otherwise there is a possibility of NullPointerException.

Copy link
Member Author

@junqiu-lei junqiu-lei Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offline synced with Navneet, added checks in updated code

currentMap = (Map<String, Object>) currentMap.get(fieldPath[pathPart]);
}

List<Number> fieldList = (List<Number>) currentMap.get(fieldPath[fieldPath.length - 1]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replied above


List<Number> fieldList = (List<Number>) currentMap.get(fieldPath[fieldPath.length - 1]);

trainingData.add(fieldList.stream().map(Number::floatValue).toArray(Float[]::new));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filter out the null objects from this stream otherwise there can be NullPointerExceptions.

Copy link
Member Author

@junqiu-lei junqiu-lei Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced with Navneet, because the null object check have been passed when ingesting data, so we don't need filter here.

@junqiu-lei junqiu-lei force-pushed the fix_nested_field_mapping branch from 1d4217d to 90c65f0 Compare December 6, 2023 21:19

for (int pathPart = 0; pathPart < fieldPath.length - 1; pathPart++) {
if (currentMap.get(fieldPath[pathPart]) == null) {
logger.warn("Field path {} does not exist in document", fieldName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could get logged 1000s or millions of times because its per vector. Im just wondering if we should either aggregate a log message outside of this loop with a summary of the non-existent fields or switch it to debug mode.

Copy link
Member Author

@junqiu-lei junqiu-lei Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Since the filed path will anyway be validated from IndexUtil.validateKnnField before use here, removed this warn log. (Offline synced with Navneet)

}

if (currentMap.get(fieldPath[fieldPath.length - 1]) instanceof List<?> == false) {
logger.warn("No vectors found for field {} in doc {}", fieldName, hits[vector].getId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above on how often this could get logged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to aggregated log after for loop, so that we can know the total count of null docs if exists.

@junqiu-lei junqiu-lei force-pushed the fix_nested_field_mapping branch from 90c65f0 to 06647db Compare December 6, 2023 23:24
Copy link
Member

@jmazanec15 jmazanec15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @junqiu-lei !

@jmazanec15
Copy link
Member

@junqiu-lei add backport to 2.x label

@junqiu-lei junqiu-lei merged commit 2e3ab95 into opensearch-project:main Dec 7, 2023
52 of 53 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 7, 2023
Signed-off-by: Junqiu Lei <[email protected]>
(cherry picked from commit 2e3ab95)
junqiu-lei added a commit to junqiu-lei/k-NN that referenced this pull request Dec 7, 2023
)

Signed-off-by: Junqiu Lei <[email protected]>
(cherry picked from commit 2e3ab95)
Signed-off-by: Junqiu Lei <[email protected]>
junqiu-lei added a commit that referenced this pull request Dec 7, 2023
(cherry picked from commit 2e3ab95)

Signed-off-by: Junqiu Lei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x bug Something isn't working v2.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Creating a model from a non top level field causes a 'field does not exist' error
4 participants