Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start is very slow #79

Closed
maxzinkus opened this issue Feb 27, 2023 · 6 comments · Fixed by #93
Closed

Start is very slow #79

maxzinkus opened this issue Feb 27, 2023 · 6 comments · Fixed by #93
Assignees

Comments

@maxzinkus
Copy link
Collaborator

In testing #78, I also noticed that the Retry loop surrounding Exec("hwclock -s") fails a few times every boot on my machine. It looks like it's caused by startup not yet being ready -- that is, the SSH server hasn't completely come up.

Overall (especially with the current exponential backoff on Retry -- changed in #73) this means that Booting... takes around 2+4+8= 14 seconds.

If this loop is skipped, the hwclock -s is not run -- meaning the system time will not be reset based on the RTC. However, given #78, it seems that the RTC is wrong regardless after sleep and therefore this step might not be necessary.

If the loop is skipped, Start reaches qemu/ops.go:354 before qemu-system-... has finished starting the VM. It then checks for the pidfile which is not yet present and determines that Start failed.

However, the VM is still correctly started, and by the time the user follows up with alpine ssh ..., qemu-system-... has completed startup, the pidfile has been created, and the ssh server is ready.

Whatever the resolution is to #78 (either no change or some change), this Retry loop should instead await for one of two things to happen:

  1. qemu-system-... exits (likely with an error/after receiving a signal)
  2. alpine.pid is created (which qemu-system- seems to do late in the boot process)

If 1 occurs, the VM surely did not start correctly and the pidfile should be missing (TODO: check if pidfile is cleaned up after sigkill).

If 2 occurs, the VM likely started correctly.

This way, Start can wait asynchronously for bootup to finish, and then proceed when ready after a host-system-speed dependent amount of time. Instead, on slow systems, Retry will spend many seconds sleeping after multiple failures, and on fast systems, Retry will spend some seconds sleeping after a few failures -- in both cases waiting precious cycles and time.

@maxzinkus
Copy link
Collaborator Author

@idroz you can assign me here and I'll make a PR when I can, or close if I'm missing something important here. I hope the reduced Retry times in #73 don't make Start fail on slower systems than mine, but if they do, fixing this issue will fix that too.

@jdriselvato
Copy link

Is this related to what I'm seeing right now with over 10 minutes of dots?

$ alpine start simplistic-background
2023/02/28 19:56:40 Booting...................

It's 20:07:XX as of positing this.

Is there a way to verbose output the booting? I'm worried my vm has been corrupted some how.

@maxzinkus
Copy link
Collaborator Author

maxzinkus commented Mar 1, 2023

It's not. I mean, it kind of is, but not really.

Likely, Start (the thing that boots the VM) is trying to SSH into the VM to run a command as part of the boot process. That SSH is failing in a Retry loop that waits a long time after failures. That is part of what this issue plans to fix, but it's not the root cause of your bug.

  • In ~/.macpine/simplistic-background/config.yaml, are sshuser and sshpassword correct?
  • Is the root password still the default (root)?
  • Is sshport correct and is anything on your host system listening on that port already?
  • Did you modify /etc/ssh/sshd_config inside the VM at all?

@jdriselvato
Copy link

jdriselvato commented Mar 1, 2023

I don't want to take away from the original problem, but I waited it out and got this

2023/02/28 20:14:00 unable to sync clocks: after 10 attempts, last error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none password], no supported methods remain

but checking

In ~/.macpine/simplistic-background/config.yaml, are sshuser and sshpassword correct?

was the correct fix.

@maxzinkus
Copy link
Collaborator Author

@jdriselvato see also :)

@maxzinkus
Copy link
Collaborator Author

@idroz Unfortunately, in testing on my machine the PR #93 only shaves a few seconds off the start time. If you have a chance to test, I'm curious about what you see on a different machine.

I guess it really does just take a while for the SSH server to come up :/

@idroz idroz closed this as completed in #93 Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants