-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] [class_not_found_exception] during rolling upgrade on security enabled cluster #1259
Comments
All calls to OpenSearch node will fail due to this. |
I did a take a look at it. Moving this to security to understand the problem. |
We were able to reproduce the issue. Investigating the root cause. |
The issue only happens in mixed cluster scenario(Cluster has both ES and OpenSearch nodes). The renamed namespace( |
How do we fix this? cc: @nknize |
The issue only happens in mixed cluster. With all nodes upgraded to OpenSearch from ES, the issue doesn't happen. |
This definitely helps. Worst case we could say the new nodes are not reachable during the upgrade but after all the nodes are upgraded it works. On the other hand we need to understand why the classname identifier is passed through in the transport while serializing and de-serializing. |
I tested a mixed cluster with 1 Opensearch master node and 2 Elasticsearch data nodes. node1: Opensearch (localhost:9200) REST request to master node is working fine.
REST request to data node that involves transport request will fail
As we can see the exception Stack trace
The root cause comes from here |
Another test case: On node 1 and node 2:
Only on node 3, the error happens. And it is OPENSEARCH user class not found. In comparison, the error message in the test case above is OPENDISTRO user class not found:
|
Could you create a test build of OpenSearch security plugin with org.opendistro.security.user.User and try it. |
Why would an old node even need to be aware of the new namespace at runtime? This is in the security plugin implementation so outside my knowledge sphere... is reflection being used somewhere where the namespace is being passed over the wire? |
|
Oh..of course it is. |
gah, the full fix would be to include replacing the InputStream variable here with StreamInput so transport layer compatibility is handled correctly across breaking changes. That might be a longer term fix, though. Either way we really need to wrap the core logic changes with backward and forward compatibility version checks. |
@nknize see #1278 for POC that I did. The change from |
It is pure java serialization/deserialization issue with java deserialization using Reflection. Should there be any other java serialization incompatible changes, we would see the same problem. |
@cliu123 it would be good to give me a credit when presenting option 1. For the open source projects it is a common practice to wait for an author to present a solution or at least to give the author a credit. |
Fixed by #1278. If not, please reopen. |
Will do. |
Re-opening this.
|
Looks like I dont have reopen permissions :-( |
Can you provide the test case? |
Looks like this what happened: |
Please double check that the fix from #1278 is applied on the node. The stack trace points to line 185 and it should be line 250 for the security plugin with the applied fix. |
Fix was missing one node. After adding the fix to that node, we are good. Closing this 👍 |
Describe the bug
To Reproduce
Steps to reproduce the behavior:
Copy the data folder and yml config to OpenSearch
Host/Environment (please complete the following information):
Linux AL2
Additional context
The text was updated successfully, but these errors were encountered: