-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow nested knn field mapping when train model #1318
Allow nested knn field mapping when train model #1318
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #1318 +/- ##
============================================
+ Coverage 85.13% 85.15% +0.02%
- Complexity 1210 1216 +6
============================================
Files 160 160
Lines 4931 4958 +27
Branches 449 457 +8
============================================
+ Hits 4198 4222 +24
- Misses 537 538 +1
- Partials 196 198 +2 ☔ View full report in Codecov by Sentry. |
71c45a5
to
30ddc20
Compare
30ddc20
to
134274e
Compare
for (Map.Entry<String, Object> entry : properties.entrySet()) { | ||
Object value = entry.getValue(); | ||
if (value instanceof Map) { | ||
Object result = getFieldMapping((Map<String, Object>) value, field); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if that is a recursive call should we add a limit on number of depth levels to avoid stack overflow exception? We can set something like 20, that should be ok I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is an index setting that tells how much nesting is allowed in Opensearch we should use that.
index.mapping.nested_fields.limit
: default value is 50.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the nested fields are limited from OpenSearch firstly, can we ignore the limit here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junqiu-lei but this is a rest handler that will be hit by our code path and will not come via Opensearch search path. Hence we need to use the limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the nested field path length check. https://github.com/opensearch-project/k-NN/pull/1318/files#diff-a074263cc1a3244e0cd9c1dc1754ecb5f3bde6a6750cc9bec73e3b7c34c50e89R148
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@navneet1v @martin-gaievski I dont think we need to add a check. We get the mapping from OpenSearch so it will already have passed the nested limit check. Adding a check on the user input string will not be required because we are not going to recurse any further than the mapping allows us to recurse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets make sure that we mention this on top of this function that we don't need this check because Opensearch already does this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the valid description on top of the function
@junqiu-lei can you add the successful case on how you tested this. |
@junqiu-lei can we also add the IT for this fix. |
* @param field The name of the field to retrieve. | ||
* @return The value of the field if found, or null if the field is not present in the map. | ||
*/ | ||
public static Object getFieldMapping(final Map<String, Object> properties, final String field) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junqiu-lei can you please provide some explanation on how this function is able to get the nested fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if I have a nested field, then in the parameter field
we are just getting the deeply nested value?
example: if I have opensearch field like a.b.c.vectorfield.
then then parameter field
is vectorField
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can retrieve a field from deep nested mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the method description to have more details.
This comment was marked as outdated.
This comment was marked as outdated.
134274e
to
0d33bae
Compare
Will add it |
0d33bae
to
be741f1
Compare
@navneet1v IT test added |
be741f1
to
32b0e7d
Compare
|
||
// Check filed path length is valid | ||
if (fieldPaths.length == 0 || fieldPaths.length > nestedFieldMaxLimit) { | ||
exception.addValidationError(String.format("Field path length \"%s\" is invalid, it should > 0 and <= %d", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also need to add locale argument, root should be good String.format(Locale.ROOT,...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
* @param fieldPaths The field path list that make up the path to the field mapping. | ||
* @return The value of the field if found, or null if the field is not present in the map. | ||
*/ | ||
public static Object getFieldMapping(final Map<String, Object> properties, final String[] fieldPaths) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you search if similar code exists in OpenSearch core that we could re-use instead of implementing on our own?
Also, this function comment isnt quite clear to me. Can you add more details in the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmazanec15 I tried but didn’t find method we can directly use. Will update function comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated function comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I saw this method in index-management: https://github.com/opensearch-project/index-management/blob/417d0d9c3ac630b720081f3ea383dea26f4a6456/src/main/kotlin/org/opensearch/indexmanagement/util/IndexUtils.kt#L180-L188. Its in kotlin, but I think its a good reference.
I would say that we should have users pass in a single field name instead of final String[] fieldPaths
. Then, in this function, we should do the splitting. This will allow users to not have to worry if they are dealing with nested or non-nested field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having a splitting logic make sense in this function we this function is public. I don't see any reason for making this function public, I would recommend keeping the interface as it and make the function private.
Ref: #1318 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the function from feedback
ac80de6
to
df94d65
Compare
* @param fieldPaths The field path list that make up the path to the field mapping. | ||
* @return The value of the field if found, or null if the field is not present in the map. | ||
*/ | ||
public static Object getFieldMapping(final Map<String, Object> properties, final String[] fieldPaths) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I saw this method in index-management: https://github.com/opensearch-project/index-management/blob/417d0d9c3ac630b720081f3ea383dea26f4a6456/src/main/kotlin/org/opensearch/indexmanagement/util/IndexUtils.kt#L180-L188. Its in kotlin, but I think its a good reference.
I would say that we should have users pass in a single field name instead of final String[] fieldPaths
. Then, in this function, we should do the splitting. This will allow users to not have to worry if they are dealing with nested or non-nested field.
5965901
to
1d4217d
Compare
String[] fieldPath = fieldName.split("\\."); | ||
|
||
for (int pathPart = 0; pathPart < fieldPath.length - 1; pathPart++) { | ||
currentMap = (Map<String, Object>) currentMap.get(fieldPath[pathPart]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should check if currentMap.get(fieldPath[pathPart]) != null and then only type cast the value otherwise there is a possibility of NullPointerException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offline synced with Navneet, added checks in updated code
currentMap = (Map<String, Object>) currentMap.get(fieldPath[pathPart]); | ||
} | ||
|
||
List<Number> fieldList = (List<Number>) currentMap.get(fieldPath[fieldPath.length - 1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replied above
|
||
List<Number> fieldList = (List<Number>) currentMap.get(fieldPath[fieldPath.length - 1]); | ||
|
||
trainingData.add(fieldList.stream().map(Number::floatValue).toArray(Float[]::new)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filter out the null objects from this stream otherwise there can be NullPointerExceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced with Navneet, because the null object check have been passed when ingesting data, so we don't need filter here.
1d4217d
to
90c65f0
Compare
|
||
for (int pathPart = 0; pathPart < fieldPath.length - 1; pathPart++) { | ||
if (currentMap.get(fieldPath[pathPart]) == null) { | ||
logger.warn("Field path {} does not exist in document", fieldName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could get logged 1000s or millions of times because its per vector. Im just wondering if we should either aggregate a log message outside of this loop with a summary of the non-existent fields or switch it to debug mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Since the filed path will anyway be validated from IndexUtil.validateKnnField
before use here, removed this warn log. (Offline synced with Navneet)
} | ||
|
||
if (currentMap.get(fieldPath[fieldPath.length - 1]) instanceof List<?> == false) { | ||
logger.warn("No vectors found for field {} in doc {}", fieldName, hits[vector].getId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above on how often this could get logged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to aggregated log after for loop, so that we can know the total count of null docs if exists.
Signed-off-by: Junqiu Lei <[email protected]>
90c65f0
to
06647db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks @junqiu-lei !
@junqiu-lei add backport to 2.x label |
Signed-off-by: Junqiu Lei <[email protected]> (cherry picked from commit 2e3ab95)
) Signed-off-by: Junqiu Lei <[email protected]> (cherry picked from commit 2e3ab95) Signed-off-by: Junqiu Lei <[email protected]>
(cherry picked from commit 2e3ab95) Signed-off-by: Junqiu Lei <[email protected]>
Description
This PR fixed the bug when trying to train a knn model from a field that is not a top-level field in the OpenSearch field mappings. This fix allows user to pass a nested filed path at
training_field
to train model.Example:
Issues Resolved
Close #1293
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.