Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A build that starts immediately after the completion of a prior build for the same branch sometimes gets the wrong merge #269

Closed
larafields opened this issue May 15, 2015 · 14 comments

Comments

@larafields
Copy link

This appears to be a timing issue with the following steps to reproduce (reproduces fairly reliably for us, but as it is a timing issue it might not always reproduce for every repo)

  1. Trigger two PR based builds (from the trigger, not cron).
  2. Note that the builds both have the correct sha1 property defined (/pr/1/merge, /pr/2/merge)
  3. Check out the /pr/2/merge outside of jenkins so you can see what the "current" sha for that is.
  4. Fetch the above merge ref for PR 2 while the build for PR 1 is still running. Note the sha, and look up the commit in GitHub. Note that the parent does not include the commit for PR 1 (correct at this point, as PR 1 has not yet been merged)
  5. When build 2 starts, note the sha that is actually being built, and see that it matches the sha from step 4. Now fetch the merge ref for PR 2 again, and get the sha. Note that the sha from the fetch is now different from the sha being built. Look up the sha from the fetch in GitHub. Note that this, correctly, is the sha that results from GitHub having remerged PR 2 due to the update from PR 1

The problem is, the correct merge isn't the sha being built. The sha being built is the old merge from prior to PR1's build succeeding. We believe this to be a timing issue with GitHub. When build 2 starts, GitHub has not yet completed the re-merge, and so the fetch in the beginning of the build gets the old merge sha. For us, this can either cause the second build to fail at the very end when the commit notifier attempts to notify to the merge sha about the status of the build, or the second build runs and may fail because conditions (db upgrade or code coverage thresholds, for example) were changed by the previous build and this build does not contain those changes.

There is a workaround and some suggestions for a possible fix.

Workaround - Install the pre-scm-buildstep and run a 'sleep 5' prior to fetching from GitHub. This reliably allows GitHub time to process the re-merge so that when the build fetches, the correct new merge sha is retrieved.

Suggestions:

  1. Bake in a pause prior to running the fetch (this may be outside of the scope of the plugin's lifecycle)
  2. Send the commit notifier to the actual commit instead of the merge commit. This may have a hole in testing that users may have to deal with, as sometimes it isn't the "real" merge that is being tested. It would, however, not fail due to a SHA not found at the end of the build.
  3. Double-check that the sha to build still exists on GitHub prior to running the whole build. This could either fail the build or trigger a re-fetch from github - the refetch would be nicer, but at least a fail-fast would be helpful instead of holding up all other builds with one that will fail at the end, regardless of success of the actual job.
@DavidTanner
Copy link
Collaborator

There is an Advanced Project Setting for Quiet period. You can set that for the timeout. Once the build is scheduled the trigger has no more control over it.

@larafields
Copy link
Author

Repeating here the comment I put on the other linked issue. The quiet period does not work to assist with this problem. The quiet period starts immediately after the build is triggered, not when the build actually starts. If there is a build already running, the quiet period would have to be longer than the time to complete a full build in order to be effective. The suggestion I posted about using the pre SCM fetch plugin and putting in a manual sleep is the only way we have found to force a pause between the first build ending and the second build starting when the second build was already queued with the quiet period expired.

@DavidTanner
Copy link
Collaborator

When I first added the merge section I had the same issue. We were committing back a version number to the repo before "pushing the button." I had to add a pause for github to re-merge the PR before trying to send the command.

@larafields
Copy link
Author

Yeah, the pause has been pretty reliable at the front of the next build, although a pause at the end after the merge back to the golden branch would probably work as well - I'm not entirely sure what changed on GitHub's end, but the length of time the pending PRs take to settle was previously rapid enough that we didn't run into it on every stacked build, but is it happening now on every set of stacked builds without the pause we put in - 5 seconds seems to be a reliable buffer.

@larafields
Copy link
Author

I know there really isn't anything that can be done about the specific issue of the pr/x/merge branch returning the previous merge sha, since that is completely out of the plugin's control, but there is one part of it that could be addressed (since getting the older merge sha isn't always a problem for the build itself, depending on the code delta). When the commit notifier is triggered (it appears to be triggered by the plugin directly), the sha used is the merge commit. That causes the "No commit for SHA" exception in the other issue ticket - if that specific call to the commit notifier used the actual commit sha instead, the only failure would be caused by someone force pushing to the pr after the build had started, which is something that I would expect to fail anyway.

@DavidTanner
Copy link
Collaborator

When the trigger runs it queries github for the pull request, and gets the latest comments/commits. If it decides to build it sets up the environment variables, generates the cause, then ships it off to be run.

When the build starts, we execute Builds.onStarted which creates a commit status, and sets the build description.

Then when the build is completed Builds.onCompleted is run.

In between the schedule build and onStarted there is no input from the trigger. The GitSCM plugin handles the actual checkout. I think if there were a pause and then check the status after the merge the job could avoid having any artificial delays.

@larafields
Copy link
Author

So, that won't actually work - here is hopefully a better description of what and when the problem happens:

We are letting the PR builder handle the "push" of the merge. The problem isn't the merge of the first job - it is complete when the first job is over, with the correct "merged" status (so checking the status won't help here, the status is merged). The problem is the merges of all of the PRs that are pending for the same branch and have been approved (and thus triggered) and already are queued in Jenkins with pending status waiting to start BEFORE the first job completes. The trigger at that point has already queried github for the pull request, as well as the actual commit (as these are properties passed in as build parameters and used for the build description when the build is shipped off). During both the onStarted and onCompleted phases, one of the actions that the ghprb plugin takes is to notify GitHub (via the commit notifier api) that things are pending (onStarted), and then successful or failed (onCompleted). The problem is that the ghprb is using the MERGE commit and not the ACTUAL commit (it knows both of these at this point) for the notification at the end of the build (onCompleted)

The problem is that when build #2 is triggered, build #1 is still in progress (and not yet merged). The pull request builder has the correct pr, and has the correct "actual" commit sha. The timing issue is not here. The timing issue is in between Build 1 finishing and Build 2 (which was already pending when Build 1 completed) starting. Build 1 completes, successfully merges the PR. The status of the PR is merged and all is well for Build 1. But all is not well for Build 2. Build 2 starts immediately after Build 1 completes (because Build 2 was already pending, and thus the quiet period has expired and no longer applies). The very first thing Build 2 does (outside of the ghprb plugin's control) is to fetch from GitHub. The fetch for the PR for Build 2 returns /head and /merge for that PR. The /merge at this point returns the /merge from Build 2's PR prior to Build 1's merge succeeding. GitHub hasn't finished re-merging Build 2's PR into the branch to check mergeability yet (because that is done asynchronously AFTER Build 1's commit is merged successfully). Build 2 starts and fetches the /merge ref for it's PR. GitHub then finishes the asynchronous re-merge of PR 2, and clobbers the old /merge sha with the new /merge sha. It is too late for Build 2 however, because the /merge ref for Build 2 has the older merge sha. This only becomes problematic in the onCompleted phase, when the ghprb plugin tries to notify GitHub that the build is successful. The notification is sent to the MERGE sha, which no longer exists on GitHub, because it is the old sha that was clobbered when the GitHub re-merge finished. This causes the No Commit for SHA exception. If this notification (or, really, ALL of the notifications via the commit notifier) were sent to the ACTUAL sha instead of the MERGE sha, this error would not happen and as a side bonus, we would actually be able to see all of the updates that went to the commit in the PR history. (Since, right now, if it fails and the merge sha changes again, any of the messages that went to the old merge sha now are no longer visible in the PR, since the old sha is gone)

@DavidTanner
Copy link
Collaborator

There are a couple misconceptions here.

  1. The merge commit sha is never used by the plugin. The only commit sha used is of the last commit to the PR. The cause commit is the last commit. We can't notify the merge commit because it doesn't really exist. It is just a figment of GitHubs creation, and while GitHub is deciding if a PR can be merged, that sha really doesn't exist.

  2. If there is a pause after pushing the "button," then GitHub will have time to settle and recreate their merged sha. At that point there is still the possibility that the PR can't be merged and there will be no sha again.

I understand what you are saying, I just want to help explain the code. If you look through you will only ever see one commit value passed around. The ${sha1} variable is generated just before shipping off the build to be scheduled, where cause.getCommit() returns the real commit in question:

        final String commitSha = cause.isMerged() ? "origin/pr/" + cause.getPullID() + "/merge" : cause.getCommit();
        values.add(new StringParameterValue("sha1", commitSha));

@DavidTanner
Copy link
Collaborator

I was just thinking, the plugin gets called on start, it could check that the PR is stabilized or unmerfeable at that time and update/ pause the build

@larafields
Copy link
Author

What we were seeing is that the cause commit was not the actual last commit that the pr was based on - the cause commit was the commit at the tip of whatever ref the build is actually being built off of (which is the sha of the /merge, not the sha of the /head). The cause.getCommit() was returning the "real" commit, that the build was being built off of, but it is being built off of the /merge, not /head. The /head has the actual commit, the /merge has the pre-merged commit done by GitHub. That /merge sha is a real sha, and it is visible by itself in GitHub as long as the merge is valid - that is how we were able to track down what was happening, by looking at the sha that was being pushed to by the commit notifier at the end, and comparing that to the refspec tips from the /merge and /head for that PR in the local Jenkins workspace, and then looking up that sha directly in GitHub. We also noticed that If you run the build off of the ghprbActualCommit instead of the sha1 parameter, the behavior is totally different with the notifications - you actually get a bunch more detail about the build's final status showing up in the PR if it fails the build and it notifies the correct commit at the end. It may appear to work properly because as long as that merge commit is valid GitHub properly attributes the information to the PR, but it is not the correct commit. We were able to grab the commit that the notifier is using at the very end to notify the completed status, and verify on GitHub what commit is it - it is most certainly not the actual last commit of the PR. It is possible that it may appear to be the same if it is a fast forward merge, but ours very often are not (and, in fact, we prefer them not to be)

@larafields
Copy link
Author

With the second thought - is it possible to check with GitHub to see if the PR is stabilized? It should absolutely check if it is no longer mergeable when the build actually starts (since that could have changed) - a fail fast there would be helpful (but it will fail anyway at the end). In our case though, it is just not stabilized, it is almost always still mergeable.

@larafields
Copy link
Author

If you follow the steps to reproduce that I outlined, we had a clear and reproducible test case that would fail the second build every time, always with the no commit for sha error, and always trying to notify the incorrect commit. I understand you are saying it shouldn't work that way, but I can reproduce exactly the behavior I have outlined, every time.

@DavidTanner
Copy link
Collaborator

I have emailed GitHub asking if there is a way to tell when the PR settles.

@DavidTanner
Copy link
Collaborator

Please check out 1.21.1 once it is in the update-center. It should be up in the next couple of hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants