-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix ambiguous networks #1831
fix ambiguous networks #1831
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: BenTheElder The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
// deterministically sort networks | ||
// NOTE: THIS PART IS IMPORTANT! | ||
// TODO(fixme): we should be sorting on active usage first! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this TODO is the one thing keeping this WIP.
I will come back and rework this later. the overall code will be more or less the same otherwise.
let's verify it diff --git a/pkg/cluster/internal/providers/docker/network_test.go b/pkg/cluster/internal/providers/docker/network_test.go
index 789a0d93..a5fb1d2b 100644
--- a/pkg/cluster/internal/providers/docker/network_test.go
+++ b/pkg/cluster/internal/providers/docker/network_test.go
@@ -19,8 +19,46 @@ package docker
import (
"fmt"
"testing"
+
+ "sigs.k8s.io/kind/pkg/exec"
)
+func TestEnsureNetworkConcurrent(t *testing.T) {
+ defer func() {
+ cmd := exec.Command(
+ "docker", "network", "rm", "test-kind",
+ )
+ cmd.Run()
+ }()
+
+ // Create multiple networks concurrenctly
+ errCh := make(chan error, 3)
+ for i := 0; i < 3; i++ {
+ go func() {
+ errCh <- ensureNetwork("test-kind")
+ }()
+ }
+ for i := 0; i < 3; i++ {
+ if err := <-errCh; err != nil {
+ t.Errorf("error creating network: %v", err)
+ }
+ }
+
+ cmd := exec.Command(
+ "docker", "network", "ls",
+ "--filter=name=^test-kind$",
+ "--format={{.Name}}",
+ )
+
+ lines, err := exec.OutputLines(cmd)
+ if err != nil {
+ t.Errorf("obtaining the docker networks")
+ }
+ if len(lines) != 1 {
+ t.Errorf("wrong number of networks created: %d", len(lines))
+ }
+}
+
func Test_generateULASubnetFromName(t *testing.T) {
t.Parallel()
cases := []struct { |
maybe the name has not to be unique but the subnet should be
|
yeah, so AFAICT this only happens when we use the default IPAM (AKA the no-IPv6 case). EDIT: ignore previous note here. see integration test discussion below. |
not asking to include it , just to run it before merge to be sure, it is not working for me ... I couldn't spend much time though |
picking this back up now. |
testing with |
I think it may have been in the timestamp parse/comparison. Since dropping that I've not had any failures in 160 iterations. |
) | ||
|
||
func TestIntegrationEnsureNetworkConcurrent(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can consider reworking the make targets based on this pattern in a follow-up.
to run only "unit": go test -short ...
to run only "integration": go test -run ^TestIntegration ...
/retest |
/retitle fix ambiguous networks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! This should mitigate the main ambiguous docker network issue 👍
The other remaining possibilities of race conditions seem like weird corner cases which we'll most likely not run into in practice.
SGTM |
currently refactoring to make #1831 (comment) a bit cleaner to address. initially I didn't want to expose the inspect details but I think it's worth it to break these up and make them more testable, upon reflection. |
/retest |
@BenTheElder: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
I profiled this before the last commit and improved the performance slightly, the happy path is just as cheap as before now We're going to need to do something about those two CI jobs, but they're unrelated to this PR. |
fixes #1596
Rough algorithm:
This ensures that we will remove the ambiguous extra networks when we have two
kind create cluster
"win the race" ofdocker network create
and simultaneously create identically named networks within the timeframe that docker doesn't catch that the name already exists during the internal non-atomic check for an existing network by the name.This should be safe because we deterministically sort based on "is this network being used somehow" then "when was this network created" and finally if those are identical on the UID docker assigns them.
TODO:
[ ] There's still perhaps a race when doing the cleanup, where process A and B win the race, A wins it faster and somehow continues on to creating containers attached to this while B is going to cleanup. I'm not sure if this can actually happen and it's difficult to reproduce, but if it did we might get an error during container creation, in which case we should retry because the cleanup is coming. I think this isn't feasible as in all my experiments when I pull out the network logic and "win the race" by running it simultaneously in two goroutines I see that the following list check observes both networks and handles it, but this is probably still worth adding as a mitigationEDIT: The second case should not happen now. If somehow I'm wrong it will be a simple mitigation in the future though and this already provides a significant improvement (complete?) in avoiding #1596.