-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix monitor renew lock issue #1623
Conversation
Signed-off-by: Subhobrata Dey <[email protected]>
hi @eirsep , plz review. |
if (isLockReleased(existingLock)) { | ||
log.debug("lock is released or expired: {}", existingLock) | ||
val updateLock = LockModel(existingLock, getNow(), false) | ||
updateLock(updateLock, listener) | ||
} else { | ||
log.debug("Lock is NOT released or expired. {}", existingLock) | ||
listener.onResponse(null) | ||
log.debug("Lock is NOT released. {}", existingLock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz add a test to verify expired locks are retrieved and deleted/re-used instead. create in now()- 10 mins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -37,6 +39,7 @@ class LockService(private val client: Client, private val clusterService: Cluste | |||
|
|||
companion object { | |||
const val LOCK_INDEX_NAME = ".opensearch-alerting-config-lock" | |||
val LOCK_EXPIRED_SECONDS = TimeValue(5, TimeUnit.MINUTES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not holding seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to mins.
val updateLock = LockModel(existingLock, getNow(), false) | ||
updateLock(updateLock, listener) | ||
} else { | ||
log.debug("Lock is NOT expired. {}", existingLock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz explicitly mention we don't run Lock is NOT expired. Not running monitor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed it.
@@ -331,12 +331,12 @@ object MonitorRunnerService : JobRunner, CoroutineScope, AbstractLifecycleCompon | |||
when (job) { | |||
is Workflow -> { | |||
launch { | |||
var lock: LockModel? = null | |||
var workflowLock: LockModel? = null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and catch between try and finally for lock acquire logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added catch between try & finally block.
) { | ||
log.debug("Lock is expired. Renewing Lock {}", existingLock) | ||
val updateLock = LockModel(existingLock, getNow(), false) | ||
updateLock(updateLock, listener) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the previous node that owned the lock resumes its execution? i.e. node holding lock was unresponsive due to high traffic and temporarily left the cluster. After traffic subsides it rejoins the cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related question - are there any race conditions with multiple nodes attempting to claim an expired lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @engechas , each lock is specific to a monitor. & if the node running the old job goes down, the job is cancelled. the monitor job is then reassigned to a new node & it can then try to renew the lock with this fix now.
Signed-off-by: Subhobrata Dey <[email protected]>
Signed-off-by: Subhobrata Dey <[email protected]>
The backport to
To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/alerting/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/alerting/backport-2.x
# Create a new branch
git switch --create backport-1623-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 25eec30dbf0b327b8d7e9b684e7e14c6dec74ca4
# Push it to GitHub
git push --set-upstream origin backport-1623-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/alerting/backport-2.x Then, create a pull request where the |
The backport to
To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/alerting/backport-2.15 2.15
# Navigate to the new working tree
pushd ../.worktrees/alerting/backport-2.15
# Create a new branch
git switch --create backport-1623-to-2.15
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 25eec30dbf0b327b8d7e9b684e7e14c6dec74ca4
# Push it to GitHub
git push --set-upstream origin backport-1623-to-2.15
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/alerting/backport-2.15 Then, create a pull request where the |
The backport to
To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/alerting/backport-2.16 2.16
# Navigate to the new working tree
pushd ../.worktrees/alerting/backport-2.16
# Create a new branch
git switch --create backport-1623-to-2.16
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 25eec30dbf0b327b8d7e9b684e7e14c6dec74ca4
# Push it to GitHub
git push --set-upstream origin backport-1623-to-2.16
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/alerting/backport-2.16 Then, create a pull request where the |
@sbcd90 plz backport to 2.15, 2.13,2.11, 2.16 |
Issue #, if available:
If a particular monitor run fails to release the lock, the lock can never be renewed and every monitor run is skipped.
Description of changes:
If a particular monitor run fails to release the lock, the lock can be renewed and acquired after a fixed timeout of
5 mins.
.CheckList:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.