-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-49707] (part): Fail the build if an agent has been removed #104
[JENKINS-49707] (part): Fail the build if an agent has been removed #104
Conversation
…f so, fail the build immediately.
I rely on cross-disconnect durability of Pipeline builds for Swarm across my entire Jenkins fleet, so this would be a serious regression from my point of view. Would you be willing to run |
Sure, though I doubt that would reproduce any problem anyway, as this patch is about the case that a client is disconnected and removed while the Jenkins master is still running. To test this patch you would need to kill the client and then have it restart, in a single “story”. Reconnection of a Swarm client across restarts is what #47/#48 should have been affecting; in that case there was a grace period, needed since @basil this code looks wrong to me. It could all be replaced by simply jenkins.addNode(slave); since that method already replaces any existing node of the same name, without firing the deleted event. (Currently it always fires the created event, even when it ought to be firing updated; if I get a moment I will try to patch that.) |
|
My latest update to jenkinsci/swarm-plugin#81 adds a new test for this case in |
That change makes sense to me and my new integration tests still pass with it, so I opened jenkinsci/swarm-plugin#82. |
Even with jenkinsci/swarm-plugin#82, |
…letion, to avoid breaking a reconnecting Swarm client for example.
diff --git a/plugin/pom.xml b/plugin/pom.xml
index bfda46a..1a913a9 100644
--- a/plugin/pom.xml
+++ b/plugin/pom.xml
@@ -77,7 +77,7 @@
</plugins>
</build>
<properties>
- <workflow-support-plugin.version>2.13</workflow-support-plugin.version>
+ <workflow-support-plugin.version>3.3</workflow-support-plugin.version>
</properties>
<dependencyManagement>
<dependencies>
@@ -90,19 +90,19 @@
<dependency>
<groupId>org.jenkins-ci.plugins</groupId>
<artifactId>scm-api</artifactId>
- <version>1.3</version>
+ <version>2.2.6</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.jenkins-ci.plugins</groupId>
<artifactId>structs</artifactId>
- <version>1.5</version>
+ <version>1.18</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.jenkins-ci.plugins</groupId>
<artifactId>script-security</artifactId>
- <version>1.24</version>
+ <version>1.58</version>
<scope>test</scope>
</dependency>
</dependencies>
@@ -111,7 +111,7 @@
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-api</artifactId>
- <version>2.12</version>
+ <version>2.33</version>
<scope>test</scope>
</dependency>
<dependency>
@@ -123,13 +123,13 @@
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-cps</artifactId>
- <version>2.28</version>
+ <version>2.70</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-durable-task-step</artifactId>
- <version>2.4</version>
+ <version>2.32-rc917.b70f8edaf896</version> <!-- https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/104 -->
<scope>test</scope>
</dependency>
<dependency>
@@ -141,7 +141,7 @@
<dependency>
<groupId>org.jenkins-ci.plugins.workflow</groupId>
<artifactId>workflow-step-api</artifactId>
- <version>2.7</version>
+ <version>2.20</version>
<scope>test</scope>
</dependency>
<dependency>
diff --git a/pom.xml b/pom.xml
index d5fbe91..4ce3480 100644
--- a/pom.xml
+++ b/pom.xml
@@ -24,7 +24,7 @@
<version>3.18-SNAPSHOT</version>
<properties>
- <jenkins.version>2.60.3</jenkins.version>
+ <jenkins.version>2.150.1</jenkins.version>
<java.level>8</java.level>
</properties>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general, added some comments/questions.
src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
Outdated
Show resolved
Hide resolved
return null; | ||
} else { | ||
LOGGER.fine(() -> node + " has been removed for a while, assuming it is not coming back"); | ||
throw new FlowInterruptedException(Result.ABORTED, new ExecutorStepExecution.RemovedNodeCause()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ABORTED
really the result we want in this case? My initial thought was that FAILURE
would be more appropriate. Thoughts?
Looks like at least one reason for ABORTED
is that over in RemovedNodeListener
you call BodyExecution.cancel
, and CpsBodyExecution
always throws a FlowInterruptedException
with result ABORTED
in that method, so it makes sense to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally FAILURE
is used for cases where something about the build itself broke. The typical scenario here is that the build started running, something like a spot instance got killed, and so the build could not complete due to external causes. Prior to this patch (plus downstream where appropriate) the build would just hang until manually ABORTED
; now the abort is automatic.
src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java
Outdated
Show resolved
Hide resolved
// Part of JENKINS-49707: check whether an agent has been removed. | ||
// (Note that a Computer may be missing because a Node is offline, | ||
// and conversely after removing a Node its Computer may remain for a while. | ||
// Therefore we only fail here if _both_ are absent.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that really the case here? It looks like we only check the Node, not the Computer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FilePathUtils.find
above already checked nullness of Jenkins.getComputer
.
* {@link ExecutorPickle} is written to the stream first, which holds a {@link PlaceholderTask}, | ||
* and the {@link BodyExecution} holds {@link PlaceholderTask.Callback} whose {@link WorkspaceList.Lease} | ||
* is not processed by {@link WorkspaceListLeasePickle} since pickles are not recursive. | ||
* So we make a best effort and only try to cancel a body within the current session. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So IIUC, if you restart Jenkins while a Pipeline is executing inside of a node
block, and then remove the Node from Jenkins, that's how you get to the "Agent foo was deleted, but do not have a node body to cancel" branch in RemovedNodeListener? Or session
here refers to something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your understanding is correct. The patch to DurableTaskStep
should still suffice to abort the build if it was inside sh
when the node was removed.
If you remove the node during the restart, a different bit of code in ExecutorPickle
(existing long before this patch) kicks in and aborts the build before it even resumes the CPS VM thread.
Co-Authored-By: Devin Nusbaum <[email protected]>
Co-Authored-By: Devin Nusbaum <[email protected]>
@jglick Quick question, we came across this and I was wondering how to set |
|
For anyone who finds this later, its |
@slaughter550 you can file a PR for https://github.com/jenkinsci/workflow-durable-task-step-plugin/blob/master/CHANGELOG.md#232 as needed though I see no mention of the property to be misspelled. |
…ut `BodyExecution`. When we do not use pickles, there is no need for `transient`. Moving the reference to `ExecutorStepExecution` from `PlaceholderTask` avoids a memory leak. This is only possible now that `Callback` holds the `ExecutorStepExecution`. And we can now properly handle `AsynchronousExecution.interrupt`.
…tion of `node` step, not kill whole build (#180) * Sketch of non-`Pickle`-based resumption of `ExecutorStepExecution` * Amending test from #34 * Fixing `Callback` * Exploring where to detect a failure eligible for retry * Making `normalNodeDisappearance` pass under somewhat altered circumstances * First successful `node` block retry * `FilePathDynamicContext` must take precedence over `ExecutorStepDynamicContext.FilePathTranslator` * `ExecutorStepTest.unloadableExecutorPickle` as currently conceived no longer makes sense * Fixing `ExecutorStepTest.nodeDisconnectMissingContextVariableException` by copying some diagnostic code from `FilePathDynamicContext` * Fixing `ExecutorStepTest.contextualizeFreshFilePathAfterAgentReconnection` * Marking `ExecutorStepRetryEligibility` beta * Javadoc imports * SpotBugs: does not make sense to allow `ExecutorStepDynamicContext.node` to be null * Reasonable behavior of `ExecutorPickleTest.canceledQueueItem` * User-visible logging about 5m timeout now that `ExecutorPickle.printWaitingMessage` is unused * `ExecutorPickleTest` → `ExecutorStepDynamicContextTest` * Move `TIMEOUT_WAITING_FOR_NODE_MILLIS` from `ExecutorPickle` to `ExecutorStepExecution` * Some user-visible logging when `ExecutorStepRetryEligibility` is _not_ used * Do not block on `FutureImpl`. After jenkinsci/workflow-step-api-plugin#73 this caused `SecretsMasker` to block on a `KubernetesComputer` which in turn blocked provisioning of a `TaskListener`, causing `RestartPipelineTest.terminatedPodAfterRestart` to fail to find a message: ``` FINE o.j.p.w.s.d.DurableTaskStep$Execution#getWorkspace: rediscovering that terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed and timeout has expired FINE o.j.p.w.s.d.DurableTaskStep$Execution#_listener: JENKINS-34021: could not get TaskListener in CpsStepContext[9:sh]:Owner[terminated Pod After Restart/1:terminated Pod After Restart #1] java.util.concurrent.TimeoutException at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:102) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:115) Caused: java.io.IOException at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:118) at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:97) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet$1Delegate.doGet(ContextVariableSet.java:98) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75) at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:85) at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:73) at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:95) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139) at org.jenkinsci.plugins.workflow.cps.CpsThread.getContextVariable(CpsThread.java:135) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:297) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.getListener(DefaultStepContext.java:127) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:87) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution._listener(DurableTaskStep.java:421) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.listener(DurableTaskStep.java:412) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:363) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:570) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:553) at … FINE o.j.p.w.s.d.DurableTaskStep$Execution$NewlineSafeTaskListener#getLogger: creating filtered stream FINE o.j.p.w.s.d.DurableTaskStep$Execution#_listener: terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed for 15 sec, assuming it is not coming back ``` * A `sh` step could fail to find a workspace at the moment it starts. java.lang.NullPointerException at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.start(DurableTaskStep.java:329) at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:319) at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:193) at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122) at … * Also handling `ClosedChannelException` as commonly thrown by `SimpleBuildStep`s and the like: jenkinsci/kubernetes-plugin#1083 (comment) * Also retry after `MissingContextVariableException` on `FilePath` * Remove `super.onResume` left over in #46; see jenkinsci/workflow-step-api-plugin#66 * Documenting need for JENKINS-30383 * jenkinsci/workflow-step-api-plugin#77 allows `retryNodeBlockSynchAcrossRestarts` to be fixed * jenkinsci/workflow-step-api-plugin#77 released * Expanding `MissingContextVariableException` type list after noticing that `isUnix` requests `Launcher` not `FilePath` * Never log a message with only `PlaceholderTask.cookie`; also need `.runId` for context * Suppressing uninteresting log message * Maybe need to call `StepContext.saveState`? * Deleting comment; for now it seems clearest to keep `ExecutorStepDynamicContext` as a distinct type * Resolve longstanding tech debt from #104 & #117 about `BodyExecution`. When we do not use pickles, there is no need for `transient`. Moving the reference to `ExecutorStepExecution` from `PlaceholderTask` avoids a memory leak. This is only possible now that `Callback` holds the `ExecutorStepExecution`. And we can now properly handle `AsynchronousExecution.interrupt`. * Sketch of updating `WorkspaceStepExecution`: #180 (comment) * Allowing `FilePathDynamicContext` to block waiting for an agent to reconnect. Not great since the CPS VM thread will only grant 5s. Better would be to avoid unnecessary use of `StepContext.get(FilePath.class)` (which implicitly requires a live `Channel`); perhaps `FilePathRepresentation` should become an API. Amends b65ff1c. #180 (comment) * Adding a comment about 8b08398 * 8b08398 seems to allow a968adf to be finished. * Refining 8b08398 to avoid tail call * `org.jenkinsci.plugins.workflow.support.concurrent.Futures` deprecated * SpotBugs reminds me that `WorkspaceStepExecution.Callback` needs an SVUID; taking one from trunk * Adding `NodeTranslator` * Beginning rewrite to `AgentErrorCondition` * Extracted `RetryExecutorStepTest` from `ExecutorStepTest` * Rely on blocking `ExecutorStepExecution.onResume` * Removing bogus `waitForMessage` (and unnecessary `interrupt`) from `unrestorableAgent`. Sometimes a failing `waitForMessage` seems to trigger https://maven.apache.org/surefire/maven-surefire-plugin/faq.html#corruptedstream though I cannot track down why. * `PlaceholderTask` may not hold a reference to `ExecutorStepExecution` * Equivalent of #184 for `ExecutorStepDynamicContext` * Assert jenkinsci/workflow-api-plugin@b1778a9 * jenkinsci/workflow-cps-plugin#534 & jenkinsci/workflow-job-plugin#260 released * Windows tests jenkinsci/workflow-basic-steps-plugin#203 (comment) * Avoiding `powershell` jenkinsci/workflow-basic-steps-plugin#203 (comment) * Better usage of Hamcrest; got a flake https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/180/checks?check_run_id=6779178701 with `QueueTaskCancelled` instead of `RemovedNodeCause` * In fact the flake was in a different test (`canceledQueueItem`) * Forgot to make `canceledQueueItem` not use `sh`, though it should not really matter since this step is expected to fail anyway * Test flake: https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/180/checks?check_run_id=7240329028 * Use `AgentOfflineException` from `ExecutorStepDynamicContext` as well * Can now run all of `AgentErrorConditionTest` * Responding to `CancellationException` in `ExecutorStepDynamicContext` with `QueueTaskCancelled`, analogous to handling in `ExecutorPickle` * Permit `canceledQueueItem` to pass on `RemovedNodeCause` * Normalizing indentation * jenkinsci/workflow-step-api-plugin#124 deprecates an overload which 2e95e3a only handles at one call site; TBD if others should be switched to `actualInterruption` * Refined `ExecutorStepDynamicContext` behavior in case of `quietingDown` #180 (comment) or `terminating` #180 (comment) * Add an explicit `serialVersionUID` to `ExecutorStepExecution.PlaceholderTask.Callback` #180 (comment) * Optimizing `withExecution` to only consider steps in the current build #180 (comment) * Pick up jenkinsci/workflow-api-plugin#256 #180 (comment) * jenkinsci/workflow-api-plugin#256 released
Improved behavior for one subcase of JENKINS-49707, that a disconnected agent is actually removed from Jenkins configuration for whatever reason. We assume that it is not going to be readded (under the same name), so there is no point in waiting for it to reconnect.
The one exception that I know of is that a Swarm agent will remove itself upon disconnect but might readd itself later, so this might break cross-disconnect durability of Pipeline builds for Swarm, but that seems like a far less likely case than typicalCloud
s.#47 and #48 do something similar across Jenkins restarts. I think the
TIMEOUT_WAITING_FOR_NODE_MILLIS
grace period there is irrelevant here, since that would be done before the program even resumes.FAILURE
vs.ABORTED
statustest on EC2 with spot instancesdid not manage to set up theec2
plugin successfullykubernetes
plugin with a killed podExecutorStepExecution
(would work not just forDurableTaskStep
)@schottsfired