Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to fetch/merge certain pull requests #19910

Closed
spiffxp opened this issue Nov 10, 2020 · 22 comments
Closed

Unable to fetch/merge certain pull requests #19910

spiffxp opened this issue Nov 10, 2020 · 22 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@spiffxp
Copy link
Member

spiffxp commented Nov 10, 2020

What happened:
Some presubmits are failing because they either:

  • can't fetch a given PR's ref
$ git fetch https://github.com/kubernetes/kubernetes.git pull/96426/head
fatal: couldn't find remote ref pull/96426/head
  • can fetch a ref, but it doesn't match what GitHub's API is saying
$ cat .git/FETCH_HEAD
1a439c03e242a0735f53bf54e0ac2c999844d836		'refs/pull/95921/head' of https://github.com/kubernetes/kubernetes
$ hub api /repos/kubernetes/kubernetes/pulls/95921 | jq -r .head.sha
92383b25ba1455d2ddc866a57b1b7e249c9b04f8

These errors are reproducible locally

What you expected to happen:

To be able to fetch a pull request from github, and have its HEAD SHA agree with github's API

How to reproduce it (as minimally and precisely as possible):

$ git init
$ git fetch https://github.com/kubernetes/kubernetes.git pull/96426/head
fatal: couldn't find remote ref pull/96426/head
$ git init
$ git fetch https://github.com/kubernetes/kubernetes.git pull/95921/head
$ git show $(hub api /repos/kubernetes/kubernetes/pulls/95921 | jq -r .head.sha)
fatal: bad object 92383b25ba1455d2ddc866a57b1b7e249c9b04f8

Please provide links to example occurrences, if any:

Anything else we need to know?:

I strongly suspect this is a problem on GitHub's side. Mostly making this issue to reference in areas where this is happening.

One possible workaround / resiliency fix would be to ignore GitHub's API when it comes to choosing which SHA to checkout for which pull request.

@spiffxp spiffxp added the kind/bug Categorizes issue or PR as related to a bug. label Nov 10, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Nov 10, 2020

Maybe related, or another symptom of the underlying issue: #19908

@alvaroaleman
Copy link
Member

One possible workaround / resiliency fix would be to ignore GitHub's API when it comes to choosing which SHA to checkout for which pull request.

That's not an option. The tests (AKA prowjob custom resources) contain the exact SHAs that get tested and we need that information to be accurate in order to determine what revisions got tested.

@spiffxp
Copy link
Member Author

spiffxp commented Nov 10, 2020

Filed a support ticket with GitHub and cc'ed kubernetes github admin team, as I'm going to be unavailable the rest of the week

@robscott
Copy link
Member

Some of the PRs that have referenced this issue above have since been merged or at least have passing tests now. It's difficult for me to tell if there actually is a workaround for this issue that I'm missing, or if those were not actually experiencing the same issues my PR is. My PR had all tests passing last night. I made a couple updates this morning in response to feedback and all tests immediately failed with this merging issue. Here's what I've tried so far, unfortunately none of these approaches have worked:

  • Rebasing on master
  • Creating a new branch + PR
  • Creating a new branch from master, cherry picking all commits on top of that

Does anyone have any potential workarounds here? This does seem to be limited to a subset of PRs but I'm not sure what makes the affected PRs unique.

@nikhita
Copy link
Member

nikhita commented Nov 11, 2020

cc @kubernetes/owners

The GitHub admin team has escalated the issue with GitHub. Haven't heard any potential solutions/workarounds from them yet.

@justaugustus
Copy link
Member

This appears to be blocking non-PR fetches for the releases underway right now.

ref: https://kubernetes.slack.com/archives/CJH2GBF7Y/p1605122968337200?thread_ts=1605122968.337200&cid=CJH2GBF7Y
cc: @kubernetes/release-engineering

@neolit123
Copy link
Member

@robscott

It's difficult for me to tell if there actually is a workaround for this issue that I'm missing

that way i solved it my PR was:

git pull --rebase origin master     # origin == k/k remote
git push -f mygithub                # mygithub == fork

@robscott
Copy link
Member

@neolit123 thanks for the tip! Unfortunately I've tried that multiple times and it is not working for me.

@cici37
Copy link
Contributor

cici37 commented Nov 11, 2020

I am with @robscott . Tried rebase/push and all tests still fail immediately.

@mrbobbytables
Copy link
Member

If you see this issue with other PRs or areas, please cross-link here. We're using this as the tracking issue.

@robscott
Copy link
Member

I've tried a few more variations of rebasing and even tried squashing my commits into one just to see if that would help. Still no luck. There are a lot of PRs referencing this issue, but from what I can tell, only 5 of them are experiencing the same merging issue I am (some variation of merge: d4b82643c9ecdb0aa25d9852f763e348d5fd72a4 - not something we can merge):

It sounds like rebasing PRs has worked for some people like @andrewsykim, but that has not worked for me.

@alvaroaleman
Copy link
Member

alvaroaleman commented Nov 12, 2020

Okay so I played a bit around with this (using claytons pr kubernetes/kubernetes#94866) and it seems the virtual reference is messed up somehow, but the commit is still available:

$ cd $(mktemp -d)
[2020-11-11-20:16]:[app.ci@ci]/tmp/tmp.PihszoTQSt
$ git init test; cd test; git remote add origin [email protected]:kubernetes/kubernetes.git
Initialized empty Git repository in /tmp/tmp.PihszoTQSt/test/.git/
[2020-11-11-20:17]:[app.ci@ci][HEAD].../tmp.PihszoTQSt/test

# This is what CI does, doesn't work
$ git fetch origin refs/pull/94866/head
remote: Enumerating objects: 244, done.
remote: Counting objects: 100% (244/244), done.
remote: Compressing objects: 100% (146/146), done.
remote: Total 1073177 (delta 120), reused 98 (delta 98), pack-reused 1072933
Receiving objects: 100% (1073177/1073177), 666.76 MiB | 22.21 MiB/s, done.
Resolving deltas: 100% (769836/769836), done.
From github.com:kubernetes/kubernetes
 * branch                    refs/pull/94866/head -> FETCH_HEAD
$ git checkout 5325bb5804634e388c7966b264c817efa58b1099
fatal: reference is not a tree: 5325bb5804634e388c7966b264c817efa58b1099

# Explicitly fetching the commit however works
$ git fetch origin 5325bb5804634e388c7966b264c817efa58b1099
remote: Enumerating objects: 243, done.
remote: Counting objects: 100% (243/243), done.
remote: Compressing objects: 100% (145/145), done.
remote: Total 1073177 (delta 120), reused 98 (delta 98), pack-reused 1072934
Receiving objects: 100% (1073177/1073177), 666.76 MiB | 25.91 MiB/s, done.
Resolving deltas: 100% (769836/769836), done.
From github.com:kubernetes/kubernetes
 * branch                    5325bb5804634e388c7966b264c817efa58b1099 -> FETCH_HEAD
[2020-11-11-20:21]:[app.ci@ci][HEAD].../tmp.PihszoTQSt/test+
$ git checkout 5325bb5804634e388c7966b264c817efa58b1099
Updating files: 100% (22519/22519), done.
HEAD is now at 5325bb58046 scheduler: Implement resource metrics at /metrics/resource

So I guess we could workaround this by explicitly fetching the commit. Does that have any drawbacks?
Interestingly, this issue only appears on the k/k repo.

@mrbobbytables
Copy link
Member

Just to update from Github's side - they are still investigating. They are aware of this thread and are following it.

@robscott
Copy link
Member

I don't understand what has changed, but the PR I've been working on has started passing tests again (kubernetes/kubernetes#96440). This was after pushing a change to a commit. This does not seem like it should be fundamentally different than the various things I've tried earlier so maybe something was fixed on the Github side?

alvaroaleman added a commit to alvaroaleman/test-infra that referenced this issue Nov 12, 2020
Mitigates kubernetes#19910 and
avoid races where the pull head differs from the SHA because a push
occured in the meantime.
@mrbobbytables
Copy link
Member

mrbobbytables commented Nov 12, 2020

Here is the latest from GitHub

Hi Bob,

Thanks for your patience while we investigate the errors you encountered.

We've flagged this ticket with our engineering team and they've advised the current workflow you're using is not recommended. Because updates to the API and refs are triggered by post-receive hooks, in certain circumstances they will fail to run. If these failures occur again our engineering team recommends pushing another commit (empty or otherwise) to the branch which should re-trigger the post-receive hooks to run.

Based on the above feedback from our engineering team we recommend reviewing your current workflows, or implement the work around to re-trigger post-receive hooks via another commit.

To help minimise the re-occurrence of the issues seen in this ticket the next time you have a release, I've gone ahead and created an internal issue for our engineering team to check why the repository has behaved the way it has for this release cycle. When we have more information to share, or if we require further details from you, we'll be in touch.

If you have any further questions, please let us know.

Regards,
Andrew

@cici37
Copy link
Contributor

cici37 commented Nov 12, 2020

Thank you! Post another commit works for me🎉

@alvaroaleman
Copy link
Member

alvaroaleman commented Nov 12, 2020

So we can and probably should update our clone code to cope with this (xref #19928) but:

  • This only happens with the post-receive hook that updates the refs, if the api one would fail we would never get to know about this sha in the first place
  • This only happens in the k/k repo, not in others
  • This also breaks githubs official tooling, for example gh pr checkout 95921 gives me 1a439c03e242a0735f53bf54e0ac2c999844d836 as head commit, which is outdated (current head at time of writing is 92383b25ba1455d2ddc866a57b1b7e249c9b04f8)

@justaugustus
Copy link
Member

@stevekuznetsov
Copy link
Contributor

@mrbobbytables if it's at all possible I think it might be useful to gently push back on the idea that this is "not recommended" - it's an API they provide explicitly for this purpose and their official tooling uses it. Hopefully such a "not recommended" stance is not GitHub Engineering saying that this won't be fixed...

@mrbobbytables
Copy link
Member

Here is the latest update from GitHub:

Hi Bob,

As it has been a little while since our last response I wanted to let you know what our engineering team have done to reduce the rate of failures you have raised in this support request.

Our infrastructure replicates repository data across several hosts in various geographically disperse data centres. The higher than normal repository traffic in the lead up to your code freeze window, along with the longer round trip time to one of the active replica's, resulted in a higher than normal failure rate for some operations.

The engineering team have since removed the host with the longer round trip time from participating as an active replica for the kubernetes/kubernetes repository, and as such you should now be seeing a much lower failure rate. We understand the period of peak load has since passed so it may also have a positive impact on reducing the failure rate.

Please let us know if you're still unable to fetch specific refs, or not seeing the correct commit SHA returned by the API. For instances where these failures are still occurring we recommend pushing another empty commit to trigger the post-receive hooks. With the above replica no longer active, we believe you should see the last of these issues resolved.

The internal issue with our engineering team is still open, and we'll continue to work with them to understand how we can mitigate what we saw during this release period, from occurring again at your next release cycle.

If you have any questions, please let us know.

Regards,
Andrew

@spiffxp
Copy link
Member Author

spiffxp commented Nov 20, 2020

/close

To recap:

  • Nov 10 - issue reported, escalated to github
  • Nov 11 - updated prow's clone_refs init container to fetch by sha if available Clonerefs: Fetch sha instead of pull head if available #19928
  • Nov 12-13 - sporadic reoccurrences of issue
  • Nov 16 - github reports removing a high latency replica from infra resoonsible for hosting kubernetes/kubernetes (sometime since Nov 12)

We've received no further reports of this issue reoccurring in the last 7 days, so I am calling this issue closed. Huge thanks to @alvaroaleman for updating prow, @mrbobbytables for coordinating with Github, and @justaugustus for notifying kubernetes-dev@

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close

To recap:

  • Nov 10 - issue reported, escalated to github
  • Nov 11 - updated prow's clone_refs init container to fetch by sha if available Clonerefs: Fetch sha instead of pull head if available #19928
  • Nov 12-13 - sporadic reoccurrences of issue
  • Nov 16 - github reports removing a high latency replica from infra resoonsible for hosting kubernetes/kubernetes (sometime since Nov 12)

We've received no further reports of this issue reoccurring in the last 7 days, so I am calling this issue closed. Huge thanks to @alvaroaleman for updating prow, @mrbobbytables for coordinating with Github, and @justaugustus for notifying kubernetes-dev@

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

10 participants