-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Unix domain socket support for VLAN #17473
Conversation
It should not alter any default behavior and everything else is ether behind opt-in toggle or not yet supported platform (QEMU on Windows), so, I was leaning towards adding NO NEW TESTS tag, but the changeset is not small/trivial, so, the team might disagree with it. This changeset with all other in pending review states is 99% of changes needed to enable QEMU on Windows. But yeah, I don't expect this PR would be the fast one to get through. |
I'm not sure if this part behaved on Windows as expected, but for sure it didn't affect happy path in my testing. https://github.com/containers/podman/blob/06e85f0f6093081e8707b30e191ba190613090b1/pkg/machine/qemu/machine.go#L502-L512 |
Rebased to main to resolve conflicts from another merged PR. |
64e2869
to
753b34c
Compare
I'm still a bit lost if it is possible to write a test for this one as one needs to use Windows for it or being able to modify environment variable for a specific test. |
@rhatdan may be you can help me to find some reviewers for this one? At least to get some comments if it is ok conceptually. |
pkg/machine/qemu/machine.go
Outdated
return err | ||
} | ||
|
||
for i := 0; i < 6; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 6? Personal nit, I'd prefer a variable with a name that explained why 6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because it was already used in this file in 2 places https://github.com/containers/podman/blob/main/pkg/machine/qemu/machine.go#L515 and https://github.com/containers/podman/blob/main/pkg/machine/qemu/machine.go#L604
Probably it should refactored into more generic utility, which could be used in all places in this file. I have not strong opinion if this should be done as part of this PR or not. If it is agreed to add the refactoring here, then I will do it in a separate commit as this is not implementation related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added named constants. Didn't manage to get full refactoring to unify similar code paths.
The changes in general LGTM, but this is all way out of my wheelhouse |
This should be safe on Windows. The biggest difference is that SIGTERM can't be eaten and the program kept alive, you can only use it for cleanup or exiting with an exit code like this block is doing, so this part looks right to me. |
90d001f
to
2ee867a
Compare
Either add tests or [NO NEW TESTS NEEDED] |
I will add [NO NEW TESTS NEEDED] and force push. My reasoning:
|
Observed some sort of regression at least in my test Windows build, so, I might apply some fixes to the commits, but the concept is not expect to change. Update: removed additional refactoring for now. Probably that is beyond something I can quickly reimplement in a complex Go lang codebase. Removed commit is available at arixmkii@1321726 |
@TomSweeneyRedHat @n1hility are any additional changes needed? |
Failed "Verify Win Installer Build" looks like a flake to me (at least from build log and searching the web for that specific error). |
Looking into the verify installer failure, we have other runs where its working and repeated runs are failing here, so looking into why that happens |
pkg/machine/qemu/machine.go
Outdated
@@ -557,7 +593,10 @@ func (v *MachineVM) Start(name string, opts machine.StartOptions) error { | |||
defer dnw.Close() | |||
|
|||
attr := new(os.ProcAttr) | |||
files := []*os.File{dnr, dnw, dnw, fd} | |||
files := []*os.File{dnr, dnw, dnw} | |||
if fd != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on this check.
// cleanupVMProxyProcess kills the proxy process and removes the VM's pidfile | ||
func (v *MachineVM) cleanupVMProxyProcess(proxyProc *os.Process) error { | ||
// cleanupVMProxyProcess kills the proxy process | ||
func (v *MachineVM) cleanupVMProxyProcess(proxyPid int) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, why did you choose to change the type here from something reasonably strong to something more generic like an int? im not sure i have an objection with it, i would however like to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because internally this os.Process
pointer was used only to call Kill
method https://pkg.go.dev/os#Process.Kill, which really doesn't work as we need to send SIGTERM for correct cleanup and it also doesn't work on Windows. There is no point extracting the pid here from os.Proccess
as previously we acquired that looking up pid value we already stored as part of the state. os.Process
really brings no benefits there if we can't use its API.
@@ -921,7 +961,11 @@ func (v *MachineVM) stopLocked() error { | |||
} | |||
|
|||
fmt.Println("Waiting for VM to exit...") | |||
for isProcessAlive(vmPid) { | |||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible that we never exit this loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added retry limit
Timeout: timeout, | ||
} | ||
return monitor, nil | ||
return define.NewMachineFile(filepath.Join(rtDir, name+".sock"), nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi, there are actually plans for podman 5 to make a sister to VMFile
but more like VMSocket or MachineSocket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will refactor this PR if the changes lands before this one is merged.
} | ||
|
||
cmd := gvproxy.NewGvproxyCommand() | ||
cmd.AddQemuSocket(fmt.Sprintf("unix://%s", v.QMPMonitor.Address.GetPath())) | ||
cmd.AddQemuSocket(fmt.Sprintf("unix://%s", filepath.ToSlash(vlanSocket.GetPath()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
Ok, reviewed. Many of the comments are fairly soft and if you wish to not do them, thats ok, just be sure to respond to the comments so it can be reconciled later. I have two stronger review comments that I will put in subsequently because then we can respond to them individually. |
I would love, where possible, to see more tests that lean heavily on your code. Unittests where possible as well. This is a lot of code to change and it is going to get heavily reworked and broken in the very near future. Would you consider going back and accessing your comfort with tests and that regressions are not introduced? |
And finally, I don't recall seeing any sort of version checking but yet this requires a newer QEMU if I understood things correctly. Podman, Podman Desktop, Brew, Distros are really struggling with package level interdependencies. Do you feel that we should somehow do something in your code to avoid this? |
ps. i think we just about have hyperv and wsl passing tests again (maybe a flake here and there) ... i would prefer that we do not merge this PR until #20953 is merged. |
Yes, I will try to add tests after addressing your other comments.
I'm counting on it. It was just a coincidence that I needed a rebase (because line endings were changes in README.md from CRLF to LF).
That newer version is one, which was released a year ago :) I have a version specific breakdown in this comment #17473 (comment) Also, this code path is enabled unconditionally only for Windows code and is behind Env var feature toggle for others. I will think about if it is viable to add version check within current changes. Will either add them or create an issue for a follow up and assign to me. |
b7e68be
to
0d27e95
Compare
Addressed comments. Will try to add unit tests later this week. |
@baude I rebased to latest changes and added unit tests, which are similar to existent ones. I see that currently there are not many unit tests and machines are mostly tested e2e. I wanted to update code to use your new utilities from podman/pkg/machine/qemu/machine.go Line 913 in c324dbb
podman/pkg/machine/qemu/config.go Line 137 in c324dbb
|
2b74266
to
4c728a7
Compare
Signed-off-by: Arthur Sengileyev <[email protected]>
This change adds support for new QEMU stream netdev added in 7.2.0. It is implemented as an opt-in mode for previously supported platforms and the only supported mode on Windows. Old FD netdev has changes only on podman side. Instead of previously used QMP socket address, it now has VLAN dedicated socket address to make both implementations more similar. As VLAN socket address has short lifespan (it exists only after forwarder has been started and before QEMU has finished startup), it is not promoted to persisted machine settings, but is rather calculated inside Start method. Signed-off-by: Arthur Sengileyev <[email protected]>
Will be continued in #21594 |
Fixes #15852
This change adds support for new QEMU stream netdev added in 7.2.0. It is implemented as an opt-in mode for previously supported platforms and the only supported mode on Windows.
Old FD netdev has changes only on podman side. Instead of previously used QMP socket address, it now has VLAN dedicated socket address to make both implementations more similar.
As VLAN socket address has short lifespan (it exists only after forwarder has been started and before QEMU has finished startup), it is not promoted to persisted machine settings, but is rather calculated inside Start method.
Signed-off-by: Arthur Sengileyev [email protected]
I tried to highlight everything significant in the commit message. This indeed fixes #15852, but it has a lot more added that the change required to fix it. I needed other changes to actually check the desired behavior of the fix for the issue mentioned above, if the team will insist I can split it up into series of commits (I would prefer not to, though 😅 ).
VLAN socket address is not persisted anywhere in machine socket, so, the logic decides on the mode by checking the command line. Actually the way how QEMU command line is stored is dictating a lot of how QEMU and in turn gvproxy and Podman should operate. So, it might look dirty, but to me looked like a reasonable implementation (at least at this point in time).
Machine Init
When in SOCKET VLAN mode (on platforms, where FD supported it has to be enabled via env var) the Init will create QEMU command line compatible with new QEMU 7.2.0 feature.
Machine Start
Podman will ignore the current settings, but will rely on what was create by Machine Init (analyze command line)
Named pipes
Now gvproxy will also expose named pipe on Windows (no-op for non-Windows paths). The name of the pipe is of the format
qemu-podman-...
For default machine it will result inqemu-podman-machine-default
. This prefix is neededto prevent name collision with WSL machine. Clients could read pipe name from
podman machine inspect
output (the change is already merged to Podman Desktop).Mode switch
Current implementation is using opt-in via env var, but alternative implementations (like checking QEMU version or config files) could exist. Socket VLAN is more cross platform, but it can't replace old implementation, because
it is only available starting from QEMU 7.2.0 (which means baseline RHEL 10 😅 )
Configuration mismatch warning
When mode switch has happened between Init and Start - like machine was created with FD settings, but then Start is
called, when SOCKET VLAN usage is requested - there will be WARN output what could be changed to fix the mismatch.
Works in both directions.
Socket VLAN gvproxy readiness check
It is impossible to use the Dial, because then we could not pass it forward as FD. The most common reason for failed start
would be socket address unavailable. In this case gvproxy will terminate. Readiness check algorithm looks like this:
3.1. If not, then short circuit to failure
3.2. If process is running continue to the next check
os.Stat
the socket address4.1. If not then sleep and go to the next loop iteration
4.4. If ok, then break the look and continue
Refactorings
Additional fixes
checkProcessStatus
in _unix sourcesv.ReadySock.Path
instead of making it up again in Start method using the same rulesLocal testing
Acceptance test: start the machine and run nginx with port forwarding to host machine.
Output from Apple Silicon Mac:
Running FD created machine (created with release version of Podman) w/ FD mode
Running FD created machine (created with release version of Podman) w/ socket VLAN mode
[NO NEW TESTS NEEDED]
Does this PR introduce a user-facing change?