Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync (CI): run on arm64 runners #14

Merged
merged 1 commit into from
Apr 30, 2024
Merged

sync (CI): run on arm64 runners #14

merged 1 commit into from
Apr 30, 2024

Conversation

dennisameling
Copy link
Collaborator

Some Pacman updates require post-install scripts to run. Since some of these binaries are native arm64 ones, we should be running these on native arm64 runners, instead of GitHub's x64 ones.

Ref: #13

Some Pacman updates require post-install scripts to run. Since some of these binaries are native arm64 ones, we should be running these on native arm64 runners, instead of GitHub's x64 ones.

Ref: #13
Signed-off-by: Dennis Ameling <[email protected]>
@dscho dscho merged commit 36b089f into main Apr 30, 2024
@dscho dscho deleted the arm64-runners-sync branch April 30, 2024 11:05
@dscho
Copy link
Member

dscho commented Apr 30, 2024

@dscho
Copy link
Member

dscho commented Apr 30, 2024

I guess too many concurrently-triggered workflow runs caused them to stumble over each other?

@dscho
Copy link
Member

dscho commented Apr 30, 2024

Hmm.
image
That's more resources than I thought we'd want.

And it's still only queued, and not running.

@dennisameling maybe this is the reason?

    ACTIONS_RUNNER_REPO: git-for-windows-automation

For the sync job, this should be the git-sdk-arm64 repository. In other words, this line must not hard-code the repository, but it must use repo instead. I think.

@dscho
Copy link
Member

dscho commented Apr 30, 2024

I've started https://github.com/git-for-windows/git-for-windows-automation/actions/runs/8893890536 to see whether that succeeds in spinning up the runner in the correct repository.

@dscho
Copy link
Member

dscho commented Apr 30, 2024

to see whether that succeeds in spinning up the runner in the correct repository.

Yep, the sync is now running.

@dscho
Copy link
Member

dscho commented Apr 30, 2024

to see whether that succeeds in spinning up the runner in the correct repository.

Yep, the sync is now running.

And it failed ☹️

image

@dennisameling
Copy link
Collaborator Author

That's more resources than I thought we'd want.

That's most likely because of the git-artifacts run I kicked off, which runs some things in parallel (and thus causes multiple VMs to spin up).

And it failed ☹️

Do we really need pwsh or will powershell also work? The latter is an older version, but is guaranteed to be installed on Windows machines. Otherwise I can probably add something like winget install --id Microsoft.Powershell --source winget to install pwsh to the post-deployment script.

@dscho
Copy link
Member

dscho commented Apr 30, 2024

Do we really need pwsh or will powershell also work? The latter is an older version, but is guaranteed to be installed on Windows machines

I guess powershell will do ;-)

[EDIT]: Although... that will most likely let us run into that dreaded hang!

@dennisameling
Copy link
Collaborator Author

Alright, here's the option to choose from two approaches:

@dennisameling
Copy link
Collaborator Author

And it's still only queued, and not running.

@dennisameling maybe this is the reason?

    ACTIONS_RUNNER_REPO: git-for-windows-automation

For the sync job, this should be the git-sdk-arm64 repository. In other words, this line must not hard-code the repository, but it must use repo instead. I think.

Good catch! The call to triggerWorkflowDispatch should actually hardcode git-for-windows-automation, as that's where the workflow lives.

Instead, we need to provide the runner_repo input to said workflow, so that it creates the runner in the right repo.

Here's a PR that fixes that: git-for-windows/gfw-helper-github-app#76

@dennisameling
Copy link
Collaborator Author

And it failed ☹️

Looks like it's somewhat working now that pwsh is installed, so that's a start!

afbeelding

However, the script seems to hang after the Initial setup complete. MSYS2 is now ready to use message is printed to the console. I can test the script on my ARM64 device tomorrow and see if anything weird pops up. If you have some time to RDP into a fresh VM and see what's going on, that'd also be great.

In any case, let's prioritize this PR first to ensure we don't end up with a bunch of VMs that were created for the wrong repo 🙃

@dscho
Copy link
Member

dscho commented May 1, 2024

However, the script seems to hang after the Initial setup complete. MSYS2 is now ready to use message is printed to the console. I can test the script on my ARM64 device tomorrow and see if anything weird pops up. If you have some time to RDP into a fresh VM and see what's going on, that'd also be great.

I think the same thing is happening here as we've discussed previously: The pacman invocation somehow enters an unhappy code path in the MSYS2 runtime. Maybe we need to go forward with the hack I proposed to terminate the dead-locked process.

@dennisameling
Copy link
Collaborator Author

Maybe we need to go forward with git-for-windows/git#4883 (comment) to terminate the dead-locked process.

I think that sounds great. I have my ARM64 device (Surface Pro X) where I can reproduce the pacman hang consistently enough to work and iterate on the patch. I'm a bit short on time atm, but realize that the arm64 sync job is now in a bad state, so will try to come up with a PR in a few hours 🙏🏼

@dennisameling
Copy link
Collaborator Author

dennisameling commented May 1, 2024

Was just fiddling with this a bit more. Some observations:

  • When pacman.exe hangs (if invoked through bash), there's always two pacman.exe instances. Killing one of them sometimes allows the process to continue without problems. I guess the hack you proposed would be to just kill all pacman processes after a certain amount of time, right? And then simply try again? We could set it to 2 mins for the sync job on arm64, appreciating the fact that x64 operations like bash are a bit slower on ARM64 due to emulation. WDYT?
  • It looks like (but correct me if I'm wrong) all hangs are related to GPG-related functionality. I kicked off a few pipelines with pacman --debug (PR here) which gave me a bit more visibility into the process. The logs are here, here, here, here and here. The hangs are very consistent after debug: GPGME version: 1.23.2 and debug: validity: full; reason: Success. Not sure if that info helps at all, but I think it's good that there seems to be consistency.
  • I haven't been able to run a single sync job successfully without hangs, but it does work most of the times on my Surface Pro X and when I run update-via-pacman.ps1 in a self-hosted VM that has the post-deployment stuff configured as well. So maybe the retry mechanism discussed above would work, trying up to 5 (?) times if pacman hangs.

@dennisameling
Copy link
Collaborator Author

Regarding the hang after the debug: GPGME version: 1.23.2 line: it looks like you already found that back in February. 😞

@dscho
Copy link
Member

dscho commented May 1, 2024

  • We could set it to 2 mins for the sync job on arm64, appreciating the fact that x64 operations like bash are a bit slower on ARM64 due to emulation.

Sounds like a good plan to me!

Regarding the hang after the debug: GPGME version: 1.23.2 line: it looks like you already found that back in February. 😞

Yep. And it looks as if I have to work on this during my free time, too, that's why it's taking such a long time.

maybe the retry mechanism discussed above would work, trying up to 5 (?) times if pacman hangs.

When I monitored the build-and-deploy runs manually, by logging into the runner VM via RDP, No retry was necessary, oddly enough. It was always a pacman process that was hanging, it was always a process whose CommandLine was identical to its parent process's CommandLine (suggesting that it was really a fork()ed child process that somehow deadlocked on some resource), and terminating that child process would always result in successful completion of the parent process. Mind, I had to terminate several dozens of waiting processes that way to allow the respective workflow job to succeed...

@dscho
Copy link
Member

dscho commented May 3, 2024

Maybe we need to go forward with the hack I proposed to terminate the dead-locked process.

Hmm. This seems not to have worked as well as I hoped.

@dscho
Copy link
Member

dscho commented May 5, 2024

This will hopefully provide the work-around we need to proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants