Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix monitor renew lock issue #1623

Merged
merged 3 commits into from
Jul 31, 2024
Merged

Conversation

sbcd90
Copy link
Collaborator

@sbcd90 sbcd90 commented Jul 26, 2024

Issue #, if available:

If a particular monitor run fails to release the lock, the lock can never be renewed and every monitor run is skipped.
Description of changes:

If a particular monitor run fails to release the lock, the lock can be renewed and acquired after a fixed timeout of 5 mins..

CheckList:

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Subhobrata Dey <[email protected]>
@sbcd90
Copy link
Collaborator Author

sbcd90 commented Jul 26, 2024

hi @eirsep , plz review.

if (isLockReleased(existingLock)) {
log.debug("lock is released or expired: {}", existingLock)
val updateLock = LockModel(existingLock, getNow(), false)
updateLock(updateLock, listener)
} else {
log.debug("Lock is NOT released or expired. {}", existingLock)
listener.onResponse(null)
log.debug("Lock is NOT released. {}", existingLock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz add a test to verify expired locks are retrieved and deleted/re-used instead. create in now()- 10 mins

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -37,6 +39,7 @@ class LockService(private val client: Client, private val clusterService: Cluste

companion object {
const val LOCK_INDEX_NAME = ".opensearch-alerting-config-lock"
val LOCK_EXPIRED_SECONDS = TimeValue(5, TimeUnit.MINUTES)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not holding seconds?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to mins.

val updateLock = LockModel(existingLock, getNow(), false)
updateLock(updateLock, listener)
} else {
log.debug("Lock is NOT expired. {}", existingLock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz explicitly mention we don't run Lock is NOT expired. Not running monitor

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it.

@@ -331,12 +331,12 @@ object MonitorRunnerService : JobRunner, CoroutineScope, AbstractLifecycleCompon
when (job) {
is Workflow -> {
launch {
var lock: LockModel? = null
var workflowLock: LockModel? = null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and catch between try and finally for lock acquire logic

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added catch between try & finally block.

) {
log.debug("Lock is expired. Renewing Lock {}", existingLock)
val updateLock = LockModel(existingLock, getNow(), false)
updateLock(updateLock, listener)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the previous node that owned the lock resumes its execution? i.e. node holding lock was unresponsive due to high traffic and temporarily left the cluster. After traffic subsides it rejoins the cluster

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related question - are there any race conditions with multiple nodes attempting to claim an expired lock?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @engechas , each lock is specific to a monitor. & if the node running the old job goes down, the job is cancelled. the monitor job is then reassigned to a new node & it can then try to renew the lock with this fix now.

sbcd90 added 2 commits July 30, 2024 23:32
Signed-off-by: Subhobrata Dey <[email protected]>
Signed-off-by: Subhobrata Dey <[email protected]>
@sbcd90 sbcd90 merged commit 25eec30 into opensearch-project:main Jul 31, 2024
18 of 21 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/alerting/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/alerting/backport-2.x
# Create a new branch
git switch --create backport-1623-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 25eec30dbf0b327b8d7e9b684e7e14c6dec74ca4
# Push it to GitHub
git push --set-upstream origin backport-1623-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/alerting/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport-1623-to-2.x.

@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.15 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/alerting/backport-2.15 2.15
# Navigate to the new working tree
pushd ../.worktrees/alerting/backport-2.15
# Create a new branch
git switch --create backport-1623-to-2.15
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 25eec30dbf0b327b8d7e9b684e7e14c6dec74ca4
# Push it to GitHub
git push --set-upstream origin backport-1623-to-2.15
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/alerting/backport-2.15

Then, create a pull request where the base branch is 2.15 and the compare/head branch is backport-1623-to-2.15.

@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.16 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/alerting/backport-2.16 2.16
# Navigate to the new working tree
pushd ../.worktrees/alerting/backport-2.16
# Create a new branch
git switch --create backport-1623-to-2.16
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 25eec30dbf0b327b8d7e9b684e7e14c6dec74ca4
# Push it to GitHub
git push --set-upstream origin backport-1623-to-2.16
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/alerting/backport-2.16

Then, create a pull request where the base branch is 2.16 and the compare/head branch is backport-1623-to-2.16.

@eirsep
Copy link
Member

eirsep commented Jul 31, 2024

@sbcd90 plz backport to 2.15, 2.13,2.11, 2.16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants