You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is the bug?
When a user with no permissions to access Indices executes an Explain API with the request such as GET /_opendistro/_ism/explain/*, it tells the system to execute the API on all indices that match the ”*“ wildcard expression. If the cluster has large number of indices, the exceptions results in recursive failures which then eventually fail with StackOverflow.
These recursive failures can block the transport worker threads executing the transport action for a long duration, leading to high CPU utilization, while starving the other critical operations on the node such as acknowledging pings from the master nodes. Failure to acknowledge master pings in time leads to the disconnection of node from the masters perspective, until the node sends the join request again and joins back the cluster.
How can one reproduce the bug?
A cluster with large number of indices (order of few hundreds), and user with no permissions to access Indices executes an Explain API with the request such as GET /_opendistro/_ism/explain/*
What is the expected behavior?
Instead of recursive calls, we should rather iterate on the indices.
Fix the exception handling for the security exceptions during the ISM “Explain” API call in a more graceful manner, and back port the same to OpenSearch 1.x.
What is your host/environment?
OS: [Linux]
Version [1.x,2.x]
ISM
Do you have any additional context?
100.0% (499.8ms out of 500ms) cpu usage by thread 'opensearch[c2adf5a75cce24647b5920c2d0a625ae][transport_worker][T#12]'
5/10 snapshots sharing following 10244 elements
org.opensearch.security.filter.SecurityFilter.apply(SecurityFilter.java:154)
app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:193)
app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:170)
app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:98)
app//org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:108)
app//org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:95)
app//org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:433)
org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler.filter(TransportExplainAction.kt:391)
org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler.access$filter(TransportExplainAction.kt:116)
org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler$filter$1.onFailure(TransportExplainAction.kt:414)
app//org.opensearch.action.support.TransportAction$1.onFailure(TransportAction.java:113)
org.opensearch.security.filter.SecurityFilter.apply0(SecurityFilter.java:376)
org.opensearch.security.filter.SecurityFilter.apply(SecurityFilter.java:154)
app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:193)
app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:170)
app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:98)
app//org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:108)
app//org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:95)
app//org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:433)
org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler.filter(TransportExplainAction.kt:391)
org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler.access$filter(TransportExplainAction.kt:116)
org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler$filter$1.onFailure(TransportExplainAction.kt:414)
app//org.opensearch.action.support.TransportAction$1.onFailure(TransportAction.java:113)
org.opensearch.security.filter.SecurityFilter.apply0(SecurityFilter.java:376)
org.opensearch.security.filter.SecurityFilter.apply(SecurityFilter.java:154)
app//org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:193)
app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:170)
app//org.opensearch.action.support.TransportAction.execute(TransportAction.java:98)
app//org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:108)
app//org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:95)
app//org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:433)
org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler.filter(TransportExplainAction.kt:391)
org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler.access$filter(TransportExplainAction.kt:116)
org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler$filter$1.onResponse(TransportExplainAction.kt:402)
org.opensearch.indexmanagement.indexstatemanagement.transport.action.explain.TransportExplainAction$ExplainHandler$filter$1.onResponse(TransportExplainAction.kt:394)
app//org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:104)
app//org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:98)
org.opensearch.indexmanagement.indexstatemanagement.transport.action.managedIndex.TransportManagedIndexAction.doExecute(TransportManagedIndexAction.kt:36)
The text was updated successfully, but these errors were encountered:
What is the bug?
When a user with no permissions to access Indices executes an Explain API with the request such as
GET /_opendistro/_ism/explain/*
, it tells the system to execute the API on all indices that match the ”*“ wildcard expression. If the cluster has large number of indices, the exceptions results in recursive failures which then eventually fail with StackOverflow.These recursive failures can block the transport worker threads executing the transport action for a long duration, leading to high CPU utilization, while starving the other critical operations on the node such as acknowledging pings from the master nodes. Failure to acknowledge master pings in time leads to the disconnection of node from the masters perspective, until the node sends the join request again and joins back the cluster.
The code path for reference:
ISM recursive calls on exception: https://code.amazon.com/packages/Opendistro-index-management/blobs/45447daa645e3a1cd3d25b78b9f8bfdd8df6894e/--/src/main/kotlin/org/opensearch/indexmanagement/indexstatemanagement/transport/action/explain/TransportExplainAction.kt#L414
Security Plugin exception:
https://code.amazon.com/packages/Opendistro-for-elasticsearch-security/blobs/15936499fbf455a46189a2eb74267058c0ff4212/--/security/src/main/java/org/opensearch/security/filter/SecurityFilter.java#L319
Eventual failure with StackOverflow:
https://code.amazon.com/packages/Opendistro-for-elasticsearch-security/blobs/15936499fbf455a46189a2eb74267058c0ff4212/--/security/src/main/java/org/opensearch/security/filter/SecurityFilter.java#L376
How can one reproduce the bug?
A cluster with large number of indices (order of few hundreds), and user with no permissions to access Indices executes an Explain API with the request such as
GET /_opendistro/_ism/explain/*
What is the expected behavior?
What is your host/environment?
Do you have any additional context?
The text was updated successfully, but these errors were encountered: