You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When querying and using the nmslib engine, we ran into a segfault with the following error:
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f4a9aa3e4a4, pid=26, tid=1428
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.6+10 (17.0.6+10) (build 17.0.6+10)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (17.0.6+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C [libopensearchknn_nmslib.so+0xa14a4] similarity::Hnsw<float>::Search(similarity::KNNQuery<float>*, int) const+0x4
#
# Core dump will be written. Default location: /opt/amazon/distribution/local/opensearch-1.0.0/core.26
#
# An error report file with more information is saved as:
# logs/hs_err_pid26.log
#
# If you would like to submit a bug report, please visit:
# https://github.com/adoptium/adoptium-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
this->data_ is actually a dummy variable that does not store any data. This is because for nmslib the graph construction is optimized to contain vectors interleaved with level 1 neighbors. Never the less, it is used here as an empty check.
The commit where this was introduced was a refactoring I had done awhile ago related to clang tidy warnings: opendistro-for-elasticsearch/k-NN@7640186. Im still digging into why I made this change.
Suggested fix
Just move data back from local to struct variable. This will tie the lifetime of data with the IndexWrapper.
Open Questions
This error never seems to occur. We have run millions of queries without seeing any segfaults, but it seems like a pretty clear this is a bug. The question is why don't we see this issue more frequently. A couple guesses:
The stack memory is very rarely reclaimed by OS and overwritten by process - the former seems reasonable, but the latter seems a little bit more far-fetched. However, it is possible that the call this->_data.empty() typically will return false when junk is provided in the bits. This would explain infrequency of seeing this error.
The compiler optimizes away this pass by reference. Seems unlikely but not sure.
Another open question is why I made the change in the first place. Seems clang-tidy may have suggested it or I may have misinterpreted clang-tidy warnings. I will investigate this as well.
Will need to do more investigation here.
The text was updated successfully, but these errors were encountered:
Description
When querying and using the nmslib engine, we ran into a segfault with the following error:
This maps to this line in nmslib: https://github.com/nmslib/nmslib/blob/master/similarity_search/src/method/hnsw.cc#L721. In the disassembled lib, we see:
which suggests its segfaulting on call to:
A good explanation for this is that we are passing in a local variable for
this->data_
to an arg that requires a reference: https://github.com/opensearch-project/k-NN/blob/main/jni/include/nmslib_wrapper.h#L51. This creates a use-after-free scenario where the variable is popped off the stack but then referenced later.this->data_
is actually a dummy variable that does not store any data. This is because for nmslib the graph construction is optimized to contain vectors interleaved with level 1 neighbors. Never the less, it is used here as an empty check.The commit where this was introduced was a refactoring I had done awhile ago related to clang tidy warnings: opendistro-for-elasticsearch/k-NN@7640186. Im still digging into why I made this change.
Suggested fix
Just move data back from local to struct variable. This will tie the lifetime of data with the IndexWrapper.
Open Questions
This error never seems to occur. We have run millions of queries without seeing any segfaults, but it seems like a pretty clear this is a bug. The question is why don't we see this issue more frequently. A couple guesses:
this->_data.empty()
typically will return false when junk is provided in the bits. This would explain infrequency of seeing this error.Another open question is why I made the change in the first place. Seems clang-tidy may have suggested it or I may have misinterpreted clang-tidy warnings. I will investigate this as well.
Will need to do more investigation here.
The text was updated successfully, but these errors were encountered: