Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman network create: race: error reading */conflist: ENOENT #7807

Closed
edsantiago opened this issue Sep 28, 2020 · 12 comments · Fixed by #7943
Closed

podman network create: race: error reading */conflist: ENOENT #7807

edsantiago opened this issue Sep 28, 2020 · 12 comments · Fixed by #7943
Assignees
Labels
In Progress This issue is actively being worked by the assignee, please do not work on this at this time. kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

There seems to be a race condition causing flakes when multiple tests run 'podman network create/rm' at once:

  podman network remove with two networks
...
Running: podman [options] network create net2
Error: in /etc/cni/net.d/testNetTwoCNI2.conflist: error reading /etc/cni/net.d/testNetTwoCNI2.conflist: open /etc/cni/net.d/testNetTwoCNI2.conflist: no such file or directory

Probable cause: podman inspect container two CNI networks test running at just the right time, and running its network rm at just the wrong moment.

Log: from PR 7803

@edsantiago
Copy link
Member Author

(I'm not going to bother trying to reproduce this. I'm gambling that it's trivial to find and fix the race. If it isn't, please LMK and I'll dive into it further).

@mheon
Copy link
Member

mheon commented Sep 28, 2020 via email

@edsantiago
Copy link
Member Author

@mheon by "should be possible" do you mean "we currently have existing functionality, e.g. a --confdir option or envariable, that the tests can use"? Or "it should be possible to add such functionality"?

@mheon
Copy link
Member

mheon commented Sep 28, 2020 via email

@Luap99
Copy link
Member

Luap99 commented Sep 29, 2020

Actually we support this already: podman --cni-config-dir ~/test network ls

@rhatdan
Copy link
Member

rhatdan commented Oct 1, 2020

@edsantiago Is that what you want?

@edsantiago
Copy link
Member Author

@rhatdan I haven't looked into it as closely at it deserves. My impression is that it might help with CI flakes but still leave us exposed to real bugs: I do suspect there's a real race condition in which (perhaps):

  1. process 1 (podman network create) reads the conflist dir; then
  2. something removes one of those files; then
  3. process 1 tries to read one of the conflist files it read in step 1, and fails.

One possibility is for podman network create to lock the cni dir (yuk); another might be to ignore ENOENT. But again, this is just pure guesswork on my part.

@Luap99
Copy link
Member

Luap99 commented Oct 5, 2020

podman network create is indeed very racy.

Just try this:

$ podman network create & podman network create
/etc/cni/net.d/cni-podman18.conflist
/etc/cni/net.d/cni-podman18.conflist

As you can see both calls created the same config. This happens about 1 out of 4 tries on my pc, both rootfull and rootless.

Looking at the code there are definitely some race conditions. The root cause is properly this function:

// LoadCNIConfsFromDir loads all the CNI configurations from a dir
func LoadCNIConfsFromDir(dir string) ([]*libcni.NetworkConfigList, error) {
files, err := libcni.ConfFiles(dir, []string{".conflist"})
if err != nil {
return nil, err
}
sort.Strings(files)
configs := make([]*libcni.NetworkConfigList, 0, len(files))
for _, confFile := range files {
conf, err := libcni.ConfListFromFile(confFile)
if err != nil {
return nil, errors.Wrapf(err, "in %s", confFile)
}
configs = append(configs, conf)
}
return configs, nil
}

For some reason it is called three times when you run podman network create. This alone is very redundant.

I believe this is also the cause of this flake: #7583
If you create two networks at the same time they can get the same interface name assigned which then fails if we try to run it.

Unfortunately I don't think I have enough knowledge and time to properly fix this.

@rhatdan
Copy link
Member

rhatdan commented Oct 6, 2020

Random idea?
Does the 18 matter? Could we just randomize the number, to make the conflict much less likely?

@Luap99
Copy link
Member

Luap99 commented Oct 6, 2020

Well this is also the network interface name. I personally prefer numbered interface names in my ip addr output instead of random suffixes.

@mheon
Copy link
Member

mheon commented Oct 6, 2020

We definitely need a global lock for major CNI operations. I'm hesitant to add one for container network setup due to performance concerns, but creating and deleting CNI networks needs to be locked.

@mheon mheon self-assigned this Oct 6, 2020
@mheon mheon added the kind/bug Categorizes issue or PR as related to a bug. label Oct 6, 2020
@baude
Copy link
Member

baude commented Oct 6, 2020

I'll add this one to my queue ... i have one in front of it and somewhat heavy (compared to other days) meeting day.

@mheon mheon assigned baude and unassigned mheon Oct 6, 2020
@baude baude added the In Progress This issue is actively being worked by the assignee, please do not work on this at this time. label Oct 6, 2020
baude added a commit to baude/podman that referenced this issue Oct 7, 2020
due to a lack of "locking" on cni operations, we could get ourselves in trouble when doing rapid creation or removal of networks.  added a simple file lock to deal with the collision and because it is not considered a performent path, use of the file lock should be ok.  if proven otherwise in the future, some generic shared memory lock should be implemented for libpod and also used here.

moved pkog/network to libpod/network because libpod is now being pulled into the package and it has therefore lost its generic nature. this will make it easier to absorb into libpod as we try to make the network closer to core operations.

Fixes: containers#7807

Signed-off-by: baude <[email protected]>
mheon pushed a commit to mheon/libpod that referenced this issue Oct 14, 2020
due to a lack of "locking" on cni operations, we could get ourselves in trouble when doing rapid creation or removal of networks.  added a simple file lock to deal with the collision and because it is not considered a performent path, use of the file lock should be ok.  if proven otherwise in the future, some generic shared memory lock should be implemented for libpod and also used here.

moved pkog/network to libpod/network because libpod is now being pulled into the package and it has therefore lost its generic nature. this will make it easier to absorb into libpod as we try to make the network closer to core operations.

Fixes: containers#7807

Signed-off-by: baude <[email protected]>
<MH: Fixed cherry pick conflicts>
Signed-off-by: Matthew Heon <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
In Progress This issue is actively being worked by the assignee, please do not work on this at this time. kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants