Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Salt master not deploy correctly because another salt master is already running. #2840

Closed
MonPote opened this issue Oct 9, 2020 · 2 comments · Fixed by #3003
Closed

Salt master not deploy correctly because another salt master is already running. #2840

MonPote opened this issue Oct 9, 2020 · 2 comments · Fixed by #3003
Labels
kind:bug Something isn't working topic:flakiness Some test are flaky and cause CI to do transient failing

Comments

@MonPote
Copy link
Contributor

MonPote commented Oct 9, 2020

Component: salt

What happened:
On a fresh install, salt-master is sometime not deployed correctly because another salt-master is already running a using the port.

[root@boostrap-test centos]# kubectl get pods -n kube-system
NAME                                              READY   STATUS             RESTARTS   AGE
apiserver-proxy-boostrap-test.novalocal           1/1     Running            0          78m
[...]
salt-master-boostrap-test.novalocal               1/2     CrashLoopBackOff   25         2m39s
storage-operator-7b54589795-j89p8                 1/1     Running            0          78m

[root@boostrap-test centos]# kubectl describe pods salt-master-boostrap-test.novalocal -n kube-system
[...]
  Type     Reason   Age                   From                              Message
  ----     ------   ----                  ----                              -------
  Normal   Pulled   46m (x11 over 78m)    kubelet, boostrap-test.novalocal  Container image "metalk8s-registry-from-config.invalid/metalk8s-2.6.0-dev/salt-master:3000.3-1" already present on machine
  Warning  BackOff  103s (x359 over 78m)  kubelet, boostrap-test.novalocal  Back-off restarting failed container
[root@boostrap-test centos]# kubectl logs salt-master-boostrap-test.novalocal salt-master -n kube-system
/usr/lib/python2.7/site-packages/salt/scripts.py:109: DeprecationWarning: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date.  Salt will drop support for Python 2.7 in the Sodium release or later.
[INFO    ] Setting up the Salt Master
[WARNING ] Unable to bind socket 10.200.4.17:4505, error: [Errno 98] Address already in use; Is there another salt-master running?
[INFO    ] The Salt Master is shut down
The salt master is shutdown. The ports are not available to bind

After some checking we figure out that this come from a rogue salt-master that escaped metalk8s tracking.
(container is still running, but not shown by kubectl)

[root@boostrap-test centos]# crictl ps -a
CONTAINER ID        IMAGE               CREATED              STATE               NAME                            ATTEMPT             POD ID
20e46184f3143       b0139962ff652       About a minute ago   Exited              salt-master                     28                  7252df6f59ddb
[...]
0b3f2a8cd512f       b0139962ff652       2 hours ago          Running             salt-api                        0                   7252df6f59ddb
cae6b3d0b08f4       b0139962ff652       2 hours ago          Running             salt-master                     0                   7252df6f59ddb
79e8c98f5a32d       f09fe80eb0e75       2 hours ago          Running             repositories                    0                   7039f0ae74cfb

Stoping and removing both salt-master containers solve the issue.

What was expected:
When you deployed a fresh bootstrap, salt-master should be deployed correctly.

Steps to reproduce
After some discussions with @slaperche-scality , this flaky can also happen when you deploy/undeploy a solution.

So the best way to reproduce this is to deploy and undeploy a (complex ?) solution.

It is maybe a bug when salt is restarting.

Resolution proposal (optional):

@MonPote MonPote added kind:bug Something isn't working topic:flakiness Some test are flaky and cause CI to do transient failing labels Oct 9, 2020
@slaperche-scality
Copy link
Contributor

Got a similar error, but this time it's the repository pods that went rogue:

[root@xcore-bootstrap centos]# kubectl logs  -n kube-system repositories-bootstrap
2020/12/02 08:36:48 [emerg] 1#1: bind() to 10.100.4.3:8080 failed (98: Address already in use)
nginx: [emerg] bind() to 10.100.4.3:8080 failed (98: Address already in use)
2020/12/02 08:36:48 [emerg] 1#1: bind() to 10.100.4.3:8080 failed (98: Address already in use)
nginx: [emerg] bind() to 10.100.4.3:8080 failed (98: Address already in use)
2020/12/02 08:36:48 [emerg] 1#1: bind() to 10.100.4.3:8080 failed (98: Address already in use)
nginx: [emerg] bind() to 10.100.4.3:8080 failed (98: Address already in use)
2020/12/02 08:36:48 [emerg] 1#1: bind() to 10.100.4.3:8080 failed (98: Address already in use)
nginx: [emerg] bind() to 10.100.4.3:8080 failed (98: Address already in use)
2020/12/02 08:36:48 [emerg] 1#1: bind() to 10.100.4.3:8080 failed (98: Address already in use)
nginx: [emerg] bind() to 10.100.4.3:8080 failed (98: Address already in use)
2020/12/02 08:36:48 [emerg] 1#1: still could not bind()
nginx: [emerg] still could not bind()

and indeed:

[root@xcore-bootstrap centos]# crictl ps -a
CONTAINER ID        IMAGE               CREATED             STATE               NAME                       ATTEMPT             POD ID                          
62330bec79664       f09fe80eb0e75       4 minutes ago       Exited              repositories               176                 d8ba75426f4f4                   
e69b133e7f8a8       e977dc43393e9       15 hours ago        Running             salt-api                   0                   233d615634fe2                   
fdb57d186e14c       e977dc43393e9       15 hours ago        Running             salt-master                0                   233d615634fe2                   
1a550d90c5037       f09fe80eb0e75       15 hours ago        Running             repositories               0                   d8ba75426f4f4  

Happened when I undeploy old solution/deploy a new one.

@thomasdanan
Copy link
Contributor

got this when importing a new metalk8s iso

gdemonet added a commit that referenced this issue Dec 28, 2020
When using `metalk8s.static_pod_managed`, we call `file.managed` behind
the scenes. This state does a lot of magic, including creating a
temporary file with the new contents before replacing the old file.
This temp file gets created **in the same directory** as the managed
file by default, so it gets picked up by `kubelet` as if it were
another static Pod to manage. If the replacement occurs too late,
`kubelet` may have already created another Pod for the temp file, and
may not be able to "remember" the old Pod, hence not cleaning it up.
This results in "rogue containers", which can create issues (e.g.
preventing new containers from binding some ports on the host).

This commit ensures we create the temp files in `/tmp` (unless
specified otherwise), which should prevent the aforementioned situation
from happening.

Fixes: #2840
gdemonet added a commit that referenced this issue Dec 28, 2020
When using `metalk8s.static_pod_managed`, we call `file.managed` behind
the scenes. This state does a lot of magic, including creating a
temporary file with the new contents before replacing the old file.
This temp file gets created **in the same directory** as the managed
file by default, so it gets picked up by `kubelet` as if it were
another static Pod to manage. If the replacement occurs too late,
`kubelet` may have already created another Pod for the temp file, and
may not be able to "remember" the old Pod, hence not cleaning it up.
This results in "rogue containers", which can create issues (e.g.
preventing new containers from binding some ports on the host).

This commit ensures we create the temp files in `/tmp` (unless
specified otherwise), which should prevent the aforementioned situation
from happening.

Fixes: #2840
gdemonet added a commit that referenced this issue Jan 7, 2021
When using `metalk8s.static_pod_managed`, we call `file.managed` behind
the scenes. This state does a lot of magic, including creating a
temporary file with the new contents before replacing the old file.
This temp file gets created **in the same directory** as the managed
file by default, so it gets picked up by `kubelet` as if it were
another static Pod to manage. If the replacement occurs too late,
`kubelet` may have already created another Pod for the temp file, and
may not be able to "remember" the old Pod, hence not cleaning it up.
This results in "rogue containers", which can create issues (e.g.
preventing new containers from binding some ports on the host).

This commit reimplements the 'file.managed' state in a minimal fashion,
to ensure the temporary file used for making an "atomic replace" is
ignored by kubelet. Note that it requires us to also reimplement the
'file.manage_file' execution function, since it always relies on the
existing "atomic copy" operation from `salt.utils.files.copyfile`.

Fixes: #2840
gdemonet added a commit that referenced this issue Jan 7, 2021
When using `metalk8s.static_pod_managed`, we call `file.managed` behind
the scenes. This state does a lot of magic, including creating a
temporary file with the new contents before replacing the old file.
This temp file gets created **in the same directory** as the managed
file by default, so it gets picked up by `kubelet` as if it were
another static Pod to manage. If the replacement occurs too late,
`kubelet` may have already created another Pod for the temp file, and
may not be able to "remember" the old Pod, hence not cleaning it up.
This results in "rogue containers", which can create issues (e.g.
preventing new containers from binding some ports on the host).

This commit reimplements the 'file.managed' state in a minimal fashion,
to ensure the temporary file used for making an "atomic replace" is
ignored by kubelet. Note that it requires us to also reimplement the
'file.manage_file' execution function, since it always relies on the
existing "atomic copy" operation from `salt.utils.files.copyfile`.

Fixes: #2840
gdemonet added a commit that referenced this issue Jan 7, 2021
When using `metalk8s.static_pod_managed`, we call `file.managed` behind
the scenes. This state does a lot of magic, including creating a
temporary file with the new contents before replacing the old file.
This temp file gets created **in the same directory** as the managed
file by default, so it gets picked up by `kubelet` as if it were
another static Pod to manage. If the replacement occurs too late,
`kubelet` may have already created another Pod for the temp file, and
may not be able to "remember" the old Pod, hence not cleaning it up.
This results in "rogue containers", which can create issues (e.g.
preventing new containers from binding some ports on the host).

This commit reimplements the 'file.managed' state in a minimal fashion,
to ensure the temporary file used for making an "atomic replace" is
ignored by kubelet. Note that it requires us to also reimplement the
'file.manage_file' execution function, since it always relies on the
existing "atomic copy" operation from `salt.utils.files.copyfile`.

Fixes: #2840
@bert-e bert-e closed this as completed in b799483 Jan 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug Something isn't working topic:flakiness Some test are flaky and cause CI to do transient failing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants