Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using time out in cluster state observer as we are reusing the observer #215

Merged
merged 1 commit into from
Oct 27, 2021

Conversation

gbbafna
Copy link
Collaborator

@gbbafna gbbafna commented Oct 27, 2021

Description

This makes waitForNextChange wait till time out value every time it is called. Without this change, the cluster state observer doesn't update cso.startTimeMS . So it waits for total timeout across multiple calls . For ex : 60 sec on first time and after that since cso.startTimeMS is not updated , the waitForNextChange returns immediately . This results in unnecessary CPU cycles and log flood as well.

Issues Resolved

#207

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@gbbafna
Copy link
Collaborator Author

gbbafna commented Oct 27, 2021

Logs after fix

[2021-10-27T10:00:05,559][INFO ][o.o.r.t.i.IndexReplicationTask] [followCluster-0] [grab] In restoring state
[2021-10-27T10:00:05,569][DEBUG][o.o.c.c.PublicationTransportHandler] [followCluster-0] received diff cluster state version [10] with uuid [wGnixzedSougu_Atbb-PSQ], diff size [598]
[2021-10-27T10:00:05,789][DEBUG][o.o.c.c.C.CoordinatorPublication] [followCluster-0] publication ended successfully: Publication{term=1, version=10}
[2021-10-27T10:01:05,800][INFO ][o.o.r.t.i.IndexReplicationTask] [followCluster-0] [grab] Timed out while waiting for restore to complete.
[2021-10-27T10:02:05,803][INFO ][o.o.r.t.i.IndexReplicationTask] [followCluster-0] [grab] Timed out while waiting for restore to complete.
[2021-10-27T10:03:05,807][INFO ][o.o.r.t.i.IndexReplicationTask] [followCluster-0] [grab] Timed out while waiting for restore to complete.
[2021-10-27T10:04:05,811][INFO ][o.o.r.t.i.IndexReplicationTask] [followCluster-0] [grab] Timed out while waiting for restore to complete.

Logs before fix

[2021-10-13T07:49:33,977][INFO ][c.a.e.r.t.i.IndexReplicationTask] [a133cc96509ad479b0161b22d54d2199] [abcd] In restoring state

[2021-10-13T07:50:33,170][INFO ][c.a.e.r.t.i.IndexReplicationTask] [a133cc96509ad479b0161b22d54d2199] [abcd] Timed out while waiting for restore to complete.

[2021-10-13T07:50:33,170][INFO ][c.a.e.r.t.i.IndexReplicationTask] [a133cc96509ad479b0161b22d54d2199] [abcd] Timed out while waiting for restore to complete.
[2021-10-13T07:50:33,170][INFO ][c.a.e.r.t.i.IndexReplicationTask] [a133cc96509ad479b0161b22d54d2199] [abcd] Timed out while waiting for restore to complete.
[2021-10-13T07:50:33,170][INFO ][c.a.e.r.t.i.IndexReplicationTask] [a133cc96509ad479b0161b22d54d2199] [abcd] Timed out while waiting for restore to complete.
[2021-10-13T07:50:33,171][INFO ][c.a.e.r.t.i.IndexReplicationTask] [a133cc96509ad479b0161b22d54d2199] [abcd] Timed out while waiting for restore to complete.
[2021-10-13T07:50:33,171][INFO ][c.a.e.r.t.i.IndexReplicationTask] [a133cc96509ad479b0161b22d54d2199] [abcd] Timed out while waiting for restore to complete.
[2021-10-13T07:50:33,171][INFO ][c.a.e.r.t.i.IndexReplicationTask] [a133cc96509ad479b0161b22d54d2199] [abcd] Timed out while waiting for restore to complete.

@@ -125,7 +125,7 @@ suspend fun ClusterStateObserver.waitForNextChange(reason: String, predicate: (C
override fun onTimeout(timeout: TimeValue?) {
cont.resumeWithException(OpenSearchTimeoutException("timed out waiting for $reason"))
}
}, predicate)
}, predicate, TimeValue(60000))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the timeout is not set, Is the default value under cluster service not taken into account?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is taken, but it is measured since startTimeMS which is initialized only once across multiple waitForNextChange.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global object cso object in index replication task seems to be creating the issue. Can we check the previous cluster state tracker in this observer?

Copy link
Member

@ankitkala ankitkala Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct fix here would be to set startTimeMS and timeoutTimeLeftMS irrespective of whether the timeOutValue is null or not. That'll require changes in OS repo though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes @ankitkala , but that will change behavior for other use cases and may break their use case as well.

@saikaranam-amazon : not sure i understand your point. will sync up offline and update it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, Concern was around the reference to the previous cluster state. Based on the logic from observer code, it is updating the latest state.

@ankitkala
Copy link
Member

LGTM.

@gbbafna gbbafna merged commit ced0d66 into opensearch-project:main Oct 27, 2021
gbbafna added a commit to gbbafna/cross-cluster-replication that referenced this pull request Oct 27, 2021
gbbafna added a commit to gbbafna/cross-cluster-replication that referenced this pull request Oct 27, 2021
gbbafna added a commit that referenced this pull request Oct 27, 2021
gbbafna added a commit to gbbafna/cross-cluster-replication that referenced this pull request Oct 27, 2021
gbbafna added a commit that referenced this pull request Oct 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants