doc-level monitor fan-out approach #1496

sbcd90 · 2024-03-25T20:09:13Z

Issue #, if available:

Description of changes:
doc-level monitor fan-out approach. forked from the original implementation of @eirsep https://github.com/eirsep/alerting/tree/fan_out

CheckList:

Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Subhobrata Dey <[email protected]>

eirsep

have to still review the changes in DocLevelMonitorRunner.
plz add tests for failure cases:

1 node fanout fails
mutliple node fanout fails
all nodes fan out fails.

(add code to retry executing in local node itself for failed fan_out nodes unless local node itself fails execution If error is from nodedisconnected or circuitbreaker or any other node specific exceptions (doesnt matter if we dont cover all of them in first go. start with these))

plz run multi node tests with -PnumNodes=10 (check setting name)

eirsep · 2024-03-25T20:46:17Z

alerting/src/test/kotlin/org/opensearch/alerting/action/GetFindingsRequestTests.kt

+            table,
+            "1",
+            "finding_index_name",
+            boolQueryBuilder = BoolQueryBuilder()


what is this change?

this was introduced in another commit. nothing to do with this pr.

eirsep · 2024-03-25T20:48:03Z

alerting/src/main/kotlin/org/opensearch/alerting/settings/AlertingSettings.kt

@@ -214,5 +215,17 @@ class AlertingSettings {
            1,
            Setting.Property.NodeScope, Setting.Property.Dynamic
        )
+
+        /** Defines the threshold of the docs accumulated in memory to query against percolate query index in document


java doc seems to be for a different setting

removed it.

eirsep · 2024-03-25T20:49:15Z

alerting/src/main/kotlin/org/opensearch/alerting/MonitorRunnerService.kt

@@ -310,7 +323,24 @@ object MonitorRunnerService : JobRunner, CoroutineScope, AbstractLifecycleCompon
                            "PERF_DEBUG: executing workflow ${job.id} on node " +
                                monitorCtx.clusterService!!.state().nodes().localNode.id
                        )
-                        runJob(job, periodStart, periodEnd, false)
+                        logger.debug(
+                            "PERF_DEBUG: executing workflow ${job.id} on node " +


remove perf_debug

removed it.

eirsep · 2024-03-25T20:51:03Z

alerting/src/main/kotlin/org/opensearch/alerting/action/DocLevelMonitorFanOutRequest.kt

+import org.opensearch.core.xcontent.XContentBuilder
+import java.io.IOException
+
+class DocLevelMonitorFanOutRequest : ActionRequest, ToXContentObject {


add code comments, java docs

eirsep · 2024-03-25T20:51:39Z

alerting/src/main/kotlin/org/opensearch/alerting/action/DocLevelMonitorFanOutRequest.kt

+    val executionId: String
+    val indexExecutionContext: IndexExecutionContext
+    val shardIds: List<ShardId>
+    val concreteIndicesSeenSoFar: List<String>


rename to concreteIndices and pass full list of concrete indices

we're passing concreteIndicesSeenSoFar i.e. concrete indices seen so far.

eirsep · 2024-03-25T21:05:15Z

alerting/src/main/kotlin/org/opensearch/alerting/action/DocLevelMonitorFanOutRequest.kt

+import org.opensearch.core.xcontent.XContentBuilder
+import java.io.IOException
+
+class DocLevelMonitorFanOutRequest : ActionRequest, ToXContentObject {


plz add ser-deser tests for writeTo(), readFRom(), xcontent roudntrip(), override equals() method (or whatever is the kotlin equivalent of deciding how objects are deemed equal)

ser-deser tests added. no need for xcontent roudntrip() as there is no rest layer for these objects.

eirsep · 2024-03-25T21:06:24Z

alerting/src/test/kotlin/org/opensearch/alerting/DocumentMonitorRunnerIT.kt

+        )
+    }
+
+    fun `test document-level monitor when aliases contain docs that do match query in a distributed way`() {


what does this mean? when aliases contain docs that do match query in a distributed way

plz add elaborate method level java docs explaining intent of test

this test successfully proves that the fan out approach works for a single rolling index alias.

eirsep · 2024-03-25T21:08:32Z

...ng/src/main/kotlin/org/opensearch/alerting/transport/TransportDocLevelMonitorFanOutAction.kt

+        return errorMessage
+    }
+
+    internal fun isActionActionable(action: Action, alert: Alert?): Boolean {


is this method duplicated? can it be re-used across monitors instead of duplicating the code

removed duplication.

eirsep · 2024-03-25T21:14:27Z

...ng/src/main/kotlin/org/opensearch/alerting/transport/TransportDocLevelMonitorFanOutAction.kt

+                    createFindings(monitor, docsToQueries, idQueryMap, true)
+                }
+            } else {
+                monitor.triggers.forEach {


Prolly a gap in my understanding but can you clarify the following:
if monitor has triggers especially if it has multiple triggers.. do we generate findings multiple times? shouldn't we be executing createFindings() before "runForEachDocTrigger" right?

this may need logic change but not a new issue introduced by this pr.

eirsep · 2024-03-25T21:17:29Z

...ng/src/main/kotlin/org/opensearch/alerting/transport/TransportDocLevelMonitorFanOutAction.kt

+                )
+            )
+        } catch (e: Exception) {
+            log.error("${request.monitor.id} Failed to run fan_out on node ${clusterService.localNode().id} due to error")


NIT: Log error message and re-write message as follows

log.error("Doc-Level Monitor ${request.monitor.id} : Failed to run fan_out on node ${clusterService.localNode().id} due to error", e)

plz add the exception to error log

eirsep · 2024-03-25T21:23:55Z

mutli node tests are very key for PR approvers make sure we verify all multi-node tests are passing before merging

eirsep · 2024-03-25T21:37:22Z

when cluster is upgrading from a version without fan out logic to a version with fan out logic and both nodes from old version and new version are part of cluster: cluster will go down due to ser-der failures

plz submit fan out requests only to nodes on versions compatible with fan out logic

Signed-off-by: Subhobrata Dey <[email protected]>

eirsep · 2024-03-26T16:01:17Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

                        conflictingFields.toList(),
                        matchingDocIdsPerIndex?.get(concreteIndexName),
                    )

-                    fetchShardDataAndMaybeExecutePercolateQueries(
-                        monitor,
+                    val shards = mutableSetOf<String>()


parallelization should be done for entire set of indices not per index.

the no. of requests made doesnt significantly differ in this approach compared to doing parallelization on entire set of indices.
e.g. if there are 2 indices having 11 & 7 shards respectively & if cluster has 5 nodes, if we're doing parallelization on individual indices,
11 / 5 + 7 / 5 i.e. 5 calls instead of (11 + 7) / 5 i.e. 4 calls.

the example you took is not at scale.

I agree on parallelizing both shards and indices. There are use cases where data is stored in a large number of relatively small indices using a 1p1r strategy

Signed-off-by: Subhobrata Dey <[email protected]>

eirsep · 2024-03-26T16:01:50Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

@@ -100,8 +56,11 @@ class DocumentLevelMonitorRunner : MonitorRunner() {
        periodEnd: Instant,
        dryrun: Boolean,
        workflowRunContext: WorkflowRunContext?,
-        executionId: String
+        executionId: String,
+        transportService: TransportService?


why even make this optional?

not optional anymore.

eirsep · 2024-03-26T16:02:32Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

+                    shards.remove("index")
+                    shards.remove("shards_count")
+
+                    val nodeMap = getNodes(monitorCtx)


rename method

add code comments

eirsep · 2024-03-26T16:03:20Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

-                    queryToDocIds[it] = inputRunResults[it.id]!!
-                }
-            }
+                    nodeShardAssignments.forEach {


this foreach() doesnt seem to be doing any business logic. Plz remove.

eirsep · 2024-03-26T16:04:14Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt


-            val idQueryMap: Map<String, DocLevelQuery> = queries.associateBy { it.id }
+                    val responses: Collection<DocLevelMonitorFanOutResponse> = suspendCoroutine { cont ->


we seem to be doing fan out and parallelization within the index and if there are multiple indices we still seem to be doing synchronous. this isn't the intent i think

multiple indices are not supported till now due to this limitation. https://github.com/opensearch-project/alerting/blob/main/alerting/src/main/kotlin/org/opensearch/alerting/MonitorMetadataService.kt#L212

eirsep · 2024-03-26T16:04:34Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

+                                }
+
+                                override fun onFailure(e: Exception) {
+                                    logger.info("Fan out failed", e)


log node id, add retry logic

Nit: logger.error()

already taken care here

alerting/alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

Line 283 in 6586a10

if (cause is ConnectTransportException ||

eirsep · 2024-03-26T16:11:43Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

-                for (alert in alerts) {
-                    triggerResult.actionResultsMap.getOrPut(alert.id) { mutableMapOf() }
-                    triggerResult.actionResultsMap[alert.id]?.set(action.id, actionResults)
+    private fun buildTriggerResults(


code comments

eirsep · 2024-03-26T16:11:49Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

-                    if (item.isFailed) {
-                        logger.error("Failed indexing the finding ${item.id} of monitor [${monitor.id}]")
-                    }
+    private fun buildInputRunResults(docLevelMonitorFanOutResponses: MutableList<DocLevelMonitorFanOutResponse>): InputRunResults {


code comments

eirsep · 2024-03-26T16:12:08Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

-            transformedDocs.clear()
-            docsSizeOfBatchInBytes = 0
-        }
+    private fun getNodes(monitorCtx: MonitorRunnerExecutionContext): MutableMap<String, DiscoveryNode> {


code comments, rename

eirsep · 2024-03-26T16:12:56Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

+
+        val nodeShardAssignments = mutableMapOf<String, MutableSet<ShardId>>()
+        var idx = 0
+        for (node in nodes) {


where is the logic to pick nodes that contain copy of shard?

this logic would case inter-node chatter.

there is no logic for copy of shard yet as now shard numbers or ids get assigned to nodes directly. so the fanout still doesnt run on local shards.

eirsep · 2024-03-27T08:05:03Z

...ng/src/main/kotlin/org/opensearch/alerting/transport/TransportDocLevelMonitorFanOutAction.kt

+                )
+            )
+        } catch (e: Exception) {
+            log.error("${request.monitor.id} Failed to run fan_out on node ${clusterService.localNode().id} due to error")


plz add the exception to error log

eirsep · 2024-03-27T10:52:47Z

does this code avoid fan-out for index patterns?

engechas

I've taken a partial pass and added a few small comments. I will need a few more passes to fully understand the changes.

In general, please try to submit smaller, bite-size PRs for these types of changes. It's nearly impossible to fully piece together the context of 1000s of lines of diff

engechas · 2024-03-27T21:54:29Z

...ng/src/main/kotlin/org/opensearch/alerting/transport/TransportDocLevelMonitorFanOutAction.kt

+        }
+    }
+
+    private suspend fun executeMonitor(


I know this code is mostly taken from the DocLevelMonitorRunner, but moving it here is a perfect opportunity to break the logic into smaller, single-purpose classes.

The doc level monitor does a few high level things

Fetch documents

Transform documents

Execute percolate search

Generate/index findings

Execute triggers

All of these actions belong in their own class IMO. In general we should move away from very large classes doing a lot of different things as they are difficult to understand, difficult to debug, and difficult to write unit tests for.

engechas · 2024-03-27T21:56:45Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

    ): MonitorRunResult<DocumentLevelTriggerRunResult> {
+        if (transportService == null)
+            throw RuntimeException("transport service should not be null")


Minor: let's use a more descriptive error than RuntimeException. If this has a chance to be user facing, the wording could also provide an action to resolve the issue if there is one

We shouldn't need this check if transportService is not optional.

not optional anymore.

Signed-off-by: Subhobrata Dey <[email protected]>

eirsep · 2024-03-29T00:33:09Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

@@ -277,7 +275,25 @@ class DocumentLevelMonitorRunner : MonitorRunner() {
                                        responseReader
                                    ) {
                                        override fun handleException(e: TransportException) {
-                                            listener.onFailure(e)
+                                            // retry in local node
+                                            transportService.sendRequest(


can we add a check to retry only if it is either NodeDisconnected or CircuitBreakerException?
plz ensure failure from local node itself is not re-tried else we will be stuck in infinite loop

you can refer my impl in https://github.com/opensearch-project/asynchronous-search/blob/6a00f98f1dcc462021ad43bf41c4269df54a7822/src/main/java/org/opensearch/search/asynchronous/transport/TransportAsynchronousSearchRoutingAction.java#L12

eirsep · 2024-03-29T21:12:29Z

plz add logs for time taken to do the entire operation on job coordinator node.

goyamegh · 2024-03-28T21:10:14Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

+
+                    val responses: Collection<DocLevelMonitorFanOutResponse> = suspendCoroutine { cont ->
+                        val listener = GroupedActionListener(
+                            object : ActionListener<Collection<DocLevelMonitorFanOutResponse>> {


Let's use ActionListener.wrap() in all places to avoid hanging tasks.

goyamegh · 2024-03-28T21:13:42Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

+                                }
+
+                                override fun onFailure(e: Exception) {
+                                    logger.info("Fan out failed", e)


Nit: logger.error()

goyamegh · 2024-03-30T00:46:37Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

    ): MonitorRunResult<DocumentLevelTriggerRunResult> {
+        if (transportService == null)
+            throw RuntimeException("transport service should not be null")


We shouldn't need this check if transportService is not optional.

Signed-off-by: Subhobrata Dey <[email protected]>

sbcd90 · 2024-04-01T02:42:23Z

when cluster is upgrading from a version without fan out logic to a version with fan out logic and both nodes from old version and new version are part of cluster: cluster will go down due to ser-der failures

plz submit fan out requests only to nodes on versions compatible with fan out logic

@eirsep , this is fixed now. added a bwc test to verify this.

alerting/alerting/src/test/kotlin/org/opensearch/alerting/bwc/AlertingBackwardsCompatibilityIT.kt

Line 109 in 9250a7d

fun `test backwards compatibility for doc-level monitors`() {

Signed-off-by: Subhobrata Dey <[email protected]>

sbcd90 · 2024-04-01T02:49:57Z

does this code avoid fan-out for index patterns?

will be taken up in separate pr.

Signed-off-by: Subhobrata Dey <[email protected]>

sbcd90 · 2024-04-02T02:13:11Z

plz add logs for time taken to do the entire operation on job coordinator node.

added totalTimeTakenStat. https://github.com/opensearch-project/alerting/pull/1496/files#diff-68866b22ed9703814b4d5db8d3488872bcb972086ecaca10c9b8bfd54db981bcR52

Signed-off-by: Subhobrata Dey <[email protected]>

eirsep · 2024-04-02T23:25:41Z

all the commits are titled doc-level monitor fan-out approach making it incredibly difficult to understand what changes have been pushed in each commit.
this also creates meaningless and non-descriptive commit messages when doing squash-merge

eirsep · 2024-04-02T23:53:02Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

+                                        override fun handleException(e: TransportException) {
+                                            val cause = e.unwrapCause()
+                                            if (cause is ConnectTransportException ||
+                                                (e is RemoteTransportException && cause is NodeClosedException)


plz add NodeDisconnectedException, NodeNotConnectedException, ReceiveTimeoutTransportException, CircuitBreakingException

eirsep · 2024-04-02T23:53:33Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

+                                                            responseReader
+                                                        ) {
+                                                        override fun handleException(e: TransportException) {
+                                                            listener.onFailure(e)


error log missing

eirsep · 2024-04-02T23:55:53Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

-            docsSizeOfBatchInBytes = 0
-        }
+    private fun getNodes(monitorCtx: MonitorRunnerExecutionContext): Map<String, DiscoveryNode> {
+        return monitorCtx.clusterService!!.state().nodes.dataNodes.filter { it.value.version >= Version.CURRENT }


If someone simply upgrades from old 2.13 RC to new 2.13 RC, would this work?

even if this action is not found in remote node, it is retried on local node.

alerting/alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

Line 286 in 49b2374

cause is ActionNotFoundTransportException

engechas

Code generally looks good to me and I wouldn't consider any of my comments blocking as they can be addressed in a follow up PR if needed.

Some general items we should test before merging this:

Upgrade testing with an existing monitor from a version before this change
Performance testing, particularly when the shard count is significantly higher than the node count

I would also advocate for more test coverage of these changes, but that can be addressed in a followup PR.

engechas · 2024-04-04T03:51:47Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

                        updatedIndexName,
                        concreteIndexName,
+                        updatedIndexNames,
+                        concreteIndices,


Do we need to pass these twice?

engechas · 2024-04-04T03:55:58Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

                        conflictingFields.toList(),
                        matchingDocIdsPerIndex?.get(concreteIndexName),
                    )

-                    fetchShardDataAndMaybeExecutePercolateQueries(
-                        monitor,
+                    val shards = mutableSetOf<String>()


I agree on parallelizing both shards and indices. There are use cases where data is stored in a large number of relatively small indices using a 1p1r strategy

engechas · 2024-04-04T04:19:59Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

+                        triggerResults[triggerId] = documentLevelTriggerRunResult
+                        triggerErrorMap[triggerId] = if (documentLevelTriggerRunResult.error != null) {
+                            val error = if (documentLevelTriggerRunResult.error is AlertingException) {
+                                documentLevelTriggerRunResult.error as AlertingException


Style nit: 6x nested. buildTriggerResults as a whole should be broken in a few private methods to improve readability

engechas · 2024-04-04T04:25:03Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

+                                    workflowRunContext
+                                )
+
+                                transportService.sendRequest(


Do we need a rate limiting mechanism to avoid overwhelming the cluster? Theoretically this could submit hundreds of requests in parallel which could consume all of the cluster's resources until execution completes. This would interrupt other processes like indexing or user search traffic

engechas · 2024-04-04T04:32:14Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

-        })
-    }
+        val totalShards = shards.size
+        val numFanOutNodes = allNodes.size.coerceAtMost((totalShards + 1) / 2)


I might have missed it earlier in the code, but what's the need to limit the number of fan out nodes to half the shard count?

engechas · 2024-04-04T04:33:58Z

alerting/src/main/kotlin/org/opensearch/alerting/settings/AlertingSettings.kt

@@ -21,6 +21,7 @@ class AlertingSettings {
        const val DEFAULT_PERCOLATE_QUERY_NUM_DOCS_IN_MEMORY = 50000
        const val DEFAULT_PERCOLATE_QUERY_DOCS_SIZE_MEMORY_PERCENTAGE_LIMIT = 10
        const val DEFAULT_DOC_LEVEL_MONITOR_SHARD_FETCH_SIZE = 10000
+        const val DEFAULT_FAN_OUT_NODES = 1000


Minor: 1000 seems arbitrary. If we are trying to avoid limiting the max number of fan out nodes by default, this could be Int.MAX_VALUE for clarity

engechas · 2024-04-04T04:36:22Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

+                                        cont.resumeWithException(e)
+                                }
+                            },
+                            nodeShardAssignments.size


What happens if one of the nodes never returns a response? Would the code get stuck here waiting forever?

eirsep · 2024-04-04T20:52:14Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

+
+                                override fun onFailure(e: Exception) {
+                                    if (e.cause is Exception)
+                                        cont.resumeWithException(e.cause as Exception)


can't do resume with exception here i guess.
Grouped action listener doesn't tolerate partial failure.
we might need to implement a new grouped listener which will return a list of responses and list of exceptions rather than one failure nullifying rest of the responses.

this is handled now.

alerting/alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

Line 310 in 7356f94

logger.error("Fan out retry failed in node ${localNode.id}")

https://github.com/opensearch-project/alerting/blob/7356f94d33cd6af97a99a4bad656b52d0911909f/alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt#L353-#L356

Signed-off-by: Subhobrata Dey <[email protected]>

eirsep · 2024-04-16T23:39:52Z

alerting/src/main/kotlin/org/opensearch/alerting/DocumentLevelMonitorRunner.kt

@@ -321,7 +308,15 @@ class DocumentLevelMonitorRunner : MonitorRunner() {
                                                        ) {
                                                        override fun handleException(e: TransportException) {
                                                            logger.error("Fan out retry failed in node ${localNode.id}")


plz log exception. If we dont log exception debugging is tough .

addressed this.

Signed-off-by: Subhobrata Dey <[email protected]>

opensearch-trigger-bot · 2024-04-17T03:13:55Z

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/alerting/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/alerting/backport-2.x
# Create a new branch
git switch --create backport-1496-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 65acca128211b60338088f24ea7bd3d1cc8ee964
# Push it to GitHub
git push --set-upstream origin backport-1496-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/alerting/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport-1496-to-2.x.

opensearch-trigger-bot · 2024-04-17T03:13:56Z

The backport to 2.13 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/alerting/backport-2.13 2.13
# Navigate to the new working tree
pushd ../.worktrees/alerting/backport-2.13
# Create a new branch
git switch --create backport-1496-to-2.13
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 65acca128211b60338088f24ea7bd3d1cc8ee964
# Push it to GitHub
git push --set-upstream origin backport-1496-to-2.13
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/alerting/backport-2.13

Then, create a pull request where the base branch is 2.13 and the compare/head branch is backport-1496-to-2.13.

doc-level monitor fan-out approach

2c7486b

Signed-off-by: Subhobrata Dey <[email protected]>

sbcd90 requested review from lezzago, AWSHurneyt, eirsep, getsaurabh02, praveensameneni, qreshi, bowenlan-amzn, rishabhmaurya, engechas, riysaxen-amzn and jowg-amazon as code owners March 25, 2024 20:09

eirsep reviewed Mar 25, 2024

View reviewed changes

doc-level monitor fan-out approach

3ee0f10

Signed-off-by: Subhobrata Dey <[email protected]>

eirsep reviewed Mar 26, 2024

View reviewed changes

sbcd90 added 2 commits March 27, 2024 01:57

doc-level monitor fan-out approach

be420ab

Signed-off-by: Subhobrata Dey <[email protected]>

doc-level monitor fan-out approach

116fe90

Signed-off-by: Subhobrata Dey <[email protected]>

eirsep requested changes Mar 27, 2024

View reviewed changes

engechas reviewed Mar 27, 2024

View reviewed changes

doc-level monitor fan-out approach

0588456

Signed-off-by: Subhobrata Dey <[email protected]>

eirsep requested changes Mar 29, 2024

View reviewed changes

goyamegh reviewed Mar 30, 2024

View reviewed changes

doc-level monitor fan-out approach

9250a7d

Signed-off-by: Subhobrata Dey <[email protected]>

doc-level monitor fan-out approach

6586a10

Signed-off-by: Subhobrata Dey <[email protected]>

doc-level monitor fan-out approach

22459a7

Signed-off-by: Subhobrata Dey <[email protected]>

doc-level monitor fan-out approach

e4b7310

Signed-off-by: Subhobrata Dey <[email protected]>

eirsep requested changes Apr 2, 2024

View reviewed changes

fix review comments

9008367

praveensameneni mentioned this pull request Apr 3, 2024

All correlations are executing on the same node as the executing doc level monitor #1350

Open

fix retry issue

49b2374

eirsep approved these changes Apr 3, 2024

View reviewed changes

engechas approved these changes Apr 4, 2024

View reviewed changes

eirsep reviewed Apr 4, 2024

View reviewed changes

sbcd90 added 4 commits April 11, 2024 14:07

Merge branch 'main' into test67

9d38e26

Signed-off-by: Subhobrata Dey <[email protected]>

fix for fan-out failure scenarios

600519e

Signed-off-by: Subhobrata Dey <[email protected]>

Merge branch 'main' into test67

fd5c3e5

throw exception only if all fanouts failed

7356f94

Signed-off-by: Subhobrata Dey <[email protected]>

eirsep reviewed Apr 16, 2024

View reviewed changes

eirsep approved these changes Apr 17, 2024

View reviewed changes

sbcd90 added 2 commits April 17, 2024 02:21

fix builds again

baea3d6

Signed-off-by: Subhobrata Dey <[email protected]>

fix builds again

eb7a9f0

Signed-off-by: Subhobrata Dey <[email protected]>

sbcd90 added backport 2.x backport 2.13 labels Apr 17, 2024

sbcd90 merged commit 65acca1 into opensearch-project:main Apr 17, 2024
15 of 20 checks passed

opensearch-trigger-bot bot added the failed backport label Apr 17, 2024

This was referenced Apr 18, 2024

doc-level monitor fan-out approach #1521

Merged

doc-level monitor fan-out approach #1522

Merged

doc-level monitor fan-out approach #1523

Merged


		val idQueryMap: Map<String, DocLevelQuery> = queries.associateBy { it.id }
		val responses: Collection<DocLevelMonitorFanOutResponse> = suspendCoroutine { cont ->

doc-level monitor fan-out approach #1496

doc-level monitor fan-out approach #1496

Conversation

sbcd90 commented Mar 25, 2024 • edited Loading

eirsep left a comment • edited Loading

Choose a reason for hiding this comment

(add code to retry executing in local node itself for failed fan_out nodes unless local node itself fails execution If error is from nodedisconnected or circuitbreaker or any other node specific exceptions (doesnt matter if we dont cover all of them in first go. start with these))

plz run multi node tests with -PnumNodes=10 (check setting name)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eirsep commented Mar 25, 2024 • edited Loading

eirsep commented Mar 25, 2024 • edited Loading

plz submit fan out requests only to nodes on versions compatible with fan out logic

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eirsep commented Mar 27, 2024

engechas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

engechas Mar 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eirsep commented Mar 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sbcd90 commented Apr 1, 2024

plz submit fan out requests only to nodes on versions compatible with fan out logic

sbcd90 commented Apr 1, 2024

sbcd90 commented Apr 2, 2024

eirsep commented Apr 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

engechas left a comment • edited Loading

Choose a reason for hiding this comment

sbcd90 commented Mar 25, 2024 •

edited

Loading

eirsep left a comment •

edited

Loading

eirsep commented Mar 25, 2024 •

edited

Loading

eirsep commented Mar 25, 2024 •

edited

Loading

engechas Mar 27, 2024 •

edited

Loading

engechas left a comment •

edited

Loading