-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-58085] Fix issue where stages are shown as waiting to start when they are running/completed #2017
Conversation
@@ -710,14 +711,14 @@ public BluePipelineStep getPipelineNodeStep(String id, Link parent) { | |||
|
|||
// New wrapper object based on current execution | |||
FlowNodeWrapper newNodeWrapper = new FlowNodeWrapper( | |||
oldNodeWrapper.getNode(), // Use old node because we want to show the final id not intermediary | |||
currentNodeWrapper.getNode(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main fix. The old comment doesn't make sense to me. My understanding is that in this case, we have a real node from the current Pipeline. Its ID will never change, even when the build completes, so we already have the "final" ID. Because the code checks for a match using findNodeWrapperByParentageIn
if it doesn't find a direct ID match, the ID of the real node might not match the ID of the corresponding node from the old build. This leads to a bug when we use that ID in reference to the current build in other contexts where we expect it to be a stage/branch or whatever, since that ID might be something totally different in the current build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just delete newNodeWrapper
and use currentNodeWrapper
directly now that we aren't changing the fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root problem is the node ids do change during a run, thanks to the way this code walks the graph backwards instead of forwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this will fix some of the bad graphs, and break others that are currently working. I used the old node because it's from a finished run, which has the "final" node ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root problem is the node ids do change during a run, thanks to the way this code walks the graph backwards instead of forwards.
The ID for an actual Pipeline FlowNode
in a build will never change once it is added to the flow graph. Whether we walk the graph in reverse or not does not matter. In this particular loop, as far as I can tell, we are looping through FlowNodeWrappers
s for FlowNode
s that already exist in the flow graph of the current build, so their IDs will never change.
If you mean the problem is that a single FlowNodeWrapper
that reference nodes from the current build can end up referencing a different node once the build progresses further because we are mutating FlowNodeWrapper
or something is unstable in the visitor, that seems like a bug, and I am happy to look into fixing that here too if we can come up with a reproduction case.
I used the old node because it's from a finished run, which has the "final" node ids.
The problem is that expecting the node IDs from a different build to match the node IDs in the current build is absolutely not something you can count on. It happens to work in some cases, but findNodeByParentageIn
is not strict enough to guarantee that if two nodes match by that criteria their IDs also match, and nothing in a Pipeline will change the ID of a FlowNode
once it has been added to the flow graph to make them match after the fact or something.
Further on in this code, where we use the IDs from the old build in what we return, is also a bit suspect, because one of the old IDs might refer to a different node in the current Pipeline, but since those nodes are marked as having not started it seems to not really matter. It would be safer to make the node IDs for the fake FlowNodeWrapper
s some ID that could not possibly be a valid ID in the current Pipeline, but I don't know what assumptions the rest of the code makes about the IDs for the fake FlowNodeWrapper
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I came up with two minimal test cases where nodes that show up while the build is in progress do not appear in the final graph. Both of them involve a parallel
step with only one branch, so they're not particularly realistic, but they should help to figure out what is going on (and to check if my change affects the behavior or not).
Here is the Declarative test case:
pipeline {
agent none
stages {
stage('outer') {
parallel {
stage('inner') {
steps {
sleep 5
}
}
}
}
}
}
Here is the scripted test case:
stage('outer') {
parallel inner: {
stage('inner') {
sleep 5
}
}
}
In both cases, outer
is shown in a separate bubble preceding inner
until the build completes, at which point only inner
is shown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, looking into those test cases, the main difference in the graph of FlowNodeWrapper
s when the build is running vs completed is that when the build is running, the FlowNodeWrapper.type
of the node for inner
is STAGE
, but when the build completes the FlowNodeWrapper.type
for inner
becomes PARALLEL
. I'll keep looking to understand why and either fix it or at add an ignored test case, although I'll probably open a separate PR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't reproduce all of the situations that cause node drift reliably, it's definitely timing-sensitive. I think it occurs only when parallel stages are just starting or just ending, such that the nodes exist but not all the events for those nodes do yet. I don't really have more detail to give, other than my conclusion that it'd be quicker building a new graph visitor that uses the newer pipeline API via feature flag rather than trying to debug the exact event sequence that triggers it. Fresh eyes may be able to find it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility might be adopting a different id scheme, that is tied to the structure of the graph (akin to an xpath or something) rather than just reusing the ids given to the pipeline nodes by Jenkins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, from what I can tell, the problem for my test cases from #2017 (comment) is that ForkScanner.visitSimpleChunks
, which is used by PipelineNodeGraphVisitor
, makes it difficult to identify what chunks belong to a parallel branch when the branch is not yet completed. In particular, when the build is still in progress, this condition is true for inner
:
Line 321 in db8d55d
if (nestedbranches.size() != parallelBranchEndNodes.size()) { |
Because of that, parallelStart
exits early and we won't identify inner
as having type parallel
until after it is complete so that parallelBranchEndNode
is called to set up the data for parallelStart
. I don't really see any obvious fix for that over here, I think it would make more sense to adapt ForkScanner
to have it call some new SimpleChunkVisitor
method named inProgressParallelBranchEnd
or something in this case so PipelineNodeGraphVisitor
doesn't have to reverse engineer what is happening.
That said, my change doesn't affect these cases as far as I can tell, they are broken with or without my fix, and the problem is the FlowNodeWrapper.type
, not the ID. Because the type
doesn't match, FlowNodeWrapper.probablySameNode
will return false in PipelineNodeGraphVisitor.union
here:
Line 705 in db8d55d
if (!currentNodeWrapper.probablySameNode(oldNodeWrapper)) { |
Because the overall condition is true, the graphs will be marked as incompatible, and so union
will break out of the loop that contains my changes and just show the nodes from the current build, so my changes won't affect these cases.
For now, I can't think of an example where my changes will make things worse, so I am going to mark this PR as ready for review. If we come up with an example where it does make things worse, I am happy to look into it, and if there are any regressions from the PR, I am happy to look into them as well.
@@ -687,7 +687,8 @@ public BluePipelineStep getPipelineNodeStep(String id, Link parent) { | |||
|
|||
boolean graphsAreCompatible = true; | |||
|
|||
Map<String, FlowNodeWrapper> newNodes = new HashMap<>(previousNodes.size()); // indexed by id | |||
// A map from the ID of a node from previousNodes to the corresponding node for the current build. | |||
Map<String, FlowNodeWrapper> newNodes = new HashMap<>(previousNodes.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would be useful to clarify that the IDs here come from the old build, not the new build, and I updated the names of the variables that get put into the map to try to make that clearer as well.
AFAIK, Josh has worked the most on this space, so added him as a reviewer to get feedback. |
currentNodeWrapper.getStatus(), | ||
currentNodeWrapper.getTiming(), | ||
currentNodeWrapper.getInputStep(), | ||
run | ||
); | ||
|
||
newNodes.put(nodeId, newNodeWrapper); | ||
newNodes.put(oldNodeWrapper.getId(), newNodeWrapper); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so with some more testing, it turns out that this is actually the only change needed to fix the test case I am working with. The problem is that nodeId
is an ID from the current build, but all other get
s/put
s on newNodes
use an ID from the old build, so when the IDs don't match across builds the logic in the rest this method can get mixed up. Regardless of the discussion in #2017 (comment), I think we should use the ID from the old node here so that all of the keys in newNodes
are IDs from the old build.
{ | ||
List<FlowNodeWrapper> nodes = unionPipelineNodes(b1, b2); | ||
assertStageAndEdges(nodes, "first", BlueRun.BlueRunState.FINISHED, BlueRun.BlueRunResult.SUCCESS, "second"); | ||
assertStageAndEdges(nodes, "second", BlueRun.BlueRunState.RUNNING, BlueRun.BlueRunResult.UNKNOWN, "third"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you revert the changes in src/main
, all assertions expecting BlueRunState.RUNNING
and BlueRunResult.UNKNOWN
fail because they are null
instead.
Test failures on ci.blueocean.io are |
Rebuilding because I suspect the test failures are preexisting flakes, see #2017 (comment). |
ATH is super flakey, I don't think i've seen those tests break before, but you could have fixed something that made other tests less flakey :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
@dwnusbaum Lets get @olamy review and then merge and release it. Great job 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@olamy do you know when this will be released? |
@graingert 1.19.0 has been released with this fix. |
Signed-off-by: runzexia <[email protected]>
I'm still getting this after upgrading to 1.19.0 |
@feliwir Please post comments like that to the corresponding Jira issue or the jenkinsci-users mailing list instead of GitHub. Make sure to include steps to reproduce the problem, such as the Jenkinsfile you are using, otherwise we are unlikely to investigate because we do not have any leads on what the problem might be. |
Description
See JENKINS-58085/JENKINS-53816 and some tickets linked from those.
This PR tweaks the fix for JENKINS-53816 added in #1961.
In some cases, when a build's flow graph changes from one build to the next (because the Pipeline was changed, or perhaps due to the result of a
when
condition or some dynamic Pipeline behavior in a Scripted Pipeline), but stays similar enough that Blue Ocean considers the overall graph structure the same, the node ID for one of the nodes in the current Pipeline would use the node ID for what Blue Ocean thought the corresponding node was in previous builds. If the node ID for that node was not actually the same in both builds, the new node would be shown as having never started, and would not have any steps associated with it (because the API calls for that node ID would be for some other node entirely, perhaps one that wasn't actually a stage, so it had no children). Once the build completed, the node ID would be replaced with the actual node ID from the build, and things would be displayed correctly.I'm not sure if this covers all of the issues users describe in JENKINS-58085/JENKINS-53816, but it fixes at least one problem I found. Just opening a draft PR for now while I try to come up with a regression test for the bug.
Here is the Pipeline I was using for testing
Submitter checklist
Reviewer checklist