Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ISM Explain API calls can trigger a recursive permission exception blocking management and network threads #414

Closed
getsaurabh02 opened this issue Jul 14, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@getsaurabh02
Copy link
Member

getsaurabh02 commented Jul 14, 2022

What is the bug?
When a user with no permissions to access Indices executes an Explain API with the request such as GET /_opendistro/_ism/explain/*, it tells the system to execute the API on all indices that match the ”*“ wildcard expression. If the cluster has large number of indices, the exceptions results in recursive failures which then eventually fail with StackOverflow.

These recursive failures can block the transport worker threads executing the transport action for a long duration, leading to high CPU utilization, while starving the other critical operations on the node such as acknowledging pings from the master nodes. Failure to acknowledge master pings in time leads to the disconnection of node from the masters perspective, until the node sends the join request again and joins back the cluster.

The code path for reference:

ISM recursive calls on exception: https://code.amazon.com/packages/Opendistro-index-management/blobs/45447daa645e3a1cd3d25b78b9f8bfdd8df6894e/--/src/main/kotlin/org/opensearch/indexmanagement/indexstatemanagement/transport/action/explain/TransportExplainAction.kt#L414

Security Plugin exception:
https://code.amazon.com/packages/Opendistro-for-elasticsearch-security/blobs/15936499fbf455a46189a2eb74267058c0ff4212/--/security/src/main/java/org/opensearch/security/filter/SecurityFilter.java#L319

Eventual failure with StackOverflow:
https://code.amazon.com/packages/Opendistro-for-elasticsearch-security/blobs/15936499fbf455a46189a2eb74267058c0ff4212/--/security/src/main/java/org/opensearch/security/filter/SecurityFilter.java#L376

How can one reproduce the bug?
A cluster with large number of indices (order of few hundreds), and user with no permissions to access Indices executes an Explain API with the request such as GET /_opendistro/_ism/explain/*

What is the expected behavior?

  • Instead of recursive calls, we should rather iterate on the indices.
  • Fix the exception handling for the security exceptions during the ISM “Explain” API call in a more graceful manner, and back port the same to OpenSearch 1.x.

What is your host/environment?

  • OS: [Linux]
  • Version [1.x,2.x]
  • ISM

Do you have any additional context?

100.0% (499.8ms out of 500ms) cpu usage by thread 'opensearch[c2adf5a75cce24647b5920c2d0a625ae][transport_worker][T#12]'
     5/10 snapshots sharing following 10244 elements
       org.opensearch.security.filter.SecurityFilter.apply(SecurityFilter.java:154)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:193)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:170)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:98)
       app//org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:108)
       app//org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:95)
       app//org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:433)
       org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler.filter(TransportExplainAction.kt:391)
       org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler.access$filter(TransportExplainAction.kt:116)
       org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler$filter$1.onFailure(TransportExplainAction.kt:414)
       app//org.opensearch.action.support.TransportAction$1.onFailure(TransportAction.java:113)
       org.opensearch.security.filter.SecurityFilter.apply0(SecurityFilter.java:376)
       org.opensearch.security.filter.SecurityFilter.apply(SecurityFilter.java:154)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:193)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:170)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:98)
       app//org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:108)
       app//org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:95)
       app//org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:433)
       org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler.filter(TransportExplainAction.kt:391)
       org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler.access$filter(TransportExplainAction.kt:116)
       org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler$filter$1.onFailure(TransportExplainAction.kt:414)
       app//org.opensearch.action.support.TransportAction$1.onFailure(TransportAction.java:113)
       org.opensearch.security.filter.SecurityFilter.apply0(SecurityFilter.java:376)
       org.opensearch.security.filter.SecurityFilter.apply(SecurityFilter.java:154)
       app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:193)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:170)
       app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:98)
       app//org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:108)
       app//org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:95)
       app//org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:433)
       org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler.filter(TransportExplainAction.kt:391)
       org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler.access$filter(TransportExplainAction.kt:116)
       org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler$filter$1.onResponse(TransportExplainAction.kt:402)
       org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler$filter$1.onResponse(TransportExplainAction.kt:394)
       app//org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:104)
       app//org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:98)
       org.opensearch.indexmanagement.indexstatemanagement.transport.action.managedIndex.TransportManagedIndexAction.doExecute(TransportManagedIndexAction.kt:36)
       
@getsaurabh02
Copy link
Member Author

Another issue previously reported by the community with similar symptoms: #410

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants