Salt master not deploy correctly because another salt master is already running. #2840

MonPote · 2020-10-09T17:04:53Z

Component: salt

What happened:
On a fresh install, salt-master is sometime not deployed correctly because another salt-master is already running a using the port.

[root@boostrap-test centos]# kubectl get pods -n kube-system
NAME                                              READY   STATUS             RESTARTS   AGE
apiserver-proxy-boostrap-test.novalocal           1/1     Running            0          78m
[...]
salt-master-boostrap-test.novalocal               1/2     CrashLoopBackOff   25         2m39s
storage-operator-7b54589795-j89p8                 1/1     Running            0          78m

[root@boostrap-test centos]# kubectl describe pods salt-master-boostrap-test.novalocal -n kube-system
[...]
  Type     Reason   Age                   From                              Message
  ----     ------   ----                  ----                              -------
  Normal   Pulled   46m (x11 over 78m)    kubelet, boostrap-test.novalocal  Container image "metalk8s-registry-from-config.invalid/metalk8s-2.6.0-dev/salt-master:3000.3-1" already present on machine
  Warning  BackOff  103s (x359 over 78m)  kubelet, boostrap-test.novalocal  Back-off restarting failed container

[root@boostrap-test centos]# kubectl logs salt-master-boostrap-test.novalocal salt-master -n kube-system
/usr/lib/python2.7/site-packages/salt/scripts.py:109: DeprecationWarning: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date.  Salt will drop support for Python 2.7 in the Sodium release or later.
[INFO    ] Setting up the Salt Master
[WARNING ] Unable to bind socket 10.200.4.17:4505, error: [Errno 98] Address already in use; Is there another salt-master running?
[INFO    ] The Salt Master is shut down
The salt master is shutdown. The ports are not available to bind

After some checking we figure out that this come from a rogue salt-master that escaped metalk8s tracking.
(container is still running, but not shown by kubectl)

[root@boostrap-test centos]# crictl ps -a
CONTAINER ID        IMAGE               CREATED              STATE               NAME                            ATTEMPT             POD ID
20e46184f3143       b0139962ff652       About a minute ago   Exited              salt-master                     28                  7252df6f59ddb
[...]
0b3f2a8cd512f       b0139962ff652       2 hours ago          Running             salt-api                        0                   7252df6f59ddb
cae6b3d0b08f4       b0139962ff652       2 hours ago          Running             salt-master                     0                   7252df6f59ddb
79e8c98f5a32d       f09fe80eb0e75       2 hours ago          Running             repositories                    0                   7039f0ae74cfb

Stoping and removing both salt-master containers solve the issue.

What was expected:
When you deployed a fresh bootstrap, salt-master should be deployed correctly.

Steps to reproduce
After some discussions with @slaperche-scality , this flaky can also happen when you deploy/undeploy a solution.

So the best way to reproduce this is to deploy and undeploy a (complex ?) solution.

It is maybe a bug when salt is restarting.

Resolution proposal (optional):

The text was updated successfully, but these errors were encountered:

slaperche-scality · 2020-12-02T08:45:09Z

Got a similar error, but this time it's the repository pods that went rogue:

[root@xcore-bootstrap centos]# kubectl logs  -n kube-system repositories-bootstrap
2020/12/02 08:36:48 [emerg] 1#1: bind() to 10.100.4.3:8080 failed (98: Address already in use)
nginx: [emerg] bind() to 10.100.4.3:8080 failed (98: Address already in use)
2020/12/02 08:36:48 [emerg] 1#1: bind() to 10.100.4.3:8080 failed (98: Address already in use)
nginx: [emerg] bind() to 10.100.4.3:8080 failed (98: Address already in use)
2020/12/02 08:36:48 [emerg] 1#1: bind() to 10.100.4.3:8080 failed (98: Address already in use)
nginx: [emerg] bind() to 10.100.4.3:8080 failed (98: Address already in use)
2020/12/02 08:36:48 [emerg] 1#1: bind() to 10.100.4.3:8080 failed (98: Address already in use)
nginx: [emerg] bind() to 10.100.4.3:8080 failed (98: Address already in use)
2020/12/02 08:36:48 [emerg] 1#1: bind() to 10.100.4.3:8080 failed (98: Address already in use)
nginx: [emerg] bind() to 10.100.4.3:8080 failed (98: Address already in use)
2020/12/02 08:36:48 [emerg] 1#1: still could not bind()
nginx: [emerg] still could not bind()

and indeed:

[root@xcore-bootstrap centos]# crictl ps -a
CONTAINER ID        IMAGE               CREATED             STATE               NAME                       ATTEMPT             POD ID                          
62330bec79664       f09fe80eb0e75       4 minutes ago       Exited              repositories               176                 d8ba75426f4f4                   
e69b133e7f8a8       e977dc43393e9       15 hours ago        Running             salt-api                   0                   233d615634fe2                   
fdb57d186e14c       e977dc43393e9       15 hours ago        Running             salt-master                0                   233d615634fe2                   
1a550d90c5037       f09fe80eb0e75       15 hours ago        Running             repositories               0                   d8ba75426f4f4

Happened when I undeploy old solution/deploy a new one.

thomasdanan · 2020-12-22T14:32:48Z

got this when importing a new metalk8s iso

When using `metalk8s.static_pod_managed`, we call `file.managed` behind the scenes. This state does a lot of magic, including creating a temporary file with the new contents before replacing the old file. This temp file gets created **in the same directory** as the managed file by default, so it gets picked up by `kubelet` as if it were another static Pod to manage. If the replacement occurs too late, `kubelet` may have already created another Pod for the temp file, and may not be able to "remember" the old Pod, hence not cleaning it up. This results in "rogue containers", which can create issues (e.g. preventing new containers from binding some ports on the host). This commit ensures we create the temp files in `/tmp` (unless specified otherwise), which should prevent the aforementioned situation from happening. Fixes: #2840

When using `metalk8s.static_pod_managed`, we call `file.managed` behind the scenes. This state does a lot of magic, including creating a temporary file with the new contents before replacing the old file. This temp file gets created **in the same directory** as the managed file by default, so it gets picked up by `kubelet` as if it were another static Pod to manage. If the replacement occurs too late, `kubelet` may have already created another Pod for the temp file, and may not be able to "remember" the old Pod, hence not cleaning it up. This results in "rogue containers", which can create issues (e.g. preventing new containers from binding some ports on the host). This commit reimplements the 'file.managed' state in a minimal fashion, to ensure the temporary file used for making an "atomic replace" is ignored by kubelet. Note that it requires us to also reimplement the 'file.manage_file' execution function, since it always relies on the existing "atomic copy" operation from `salt.utils.files.copyfile`. Fixes: #2840

MonPote added kind:bug Something isn't working topic:flakiness Some test are flaky and cause CI to do transient failing labels Oct 9, 2020

gdemonet mentioned this issue Dec 28, 2020

salt: Avoid duplicating static pod manifests #3003

Merged

bert-e closed this as completed in b799483 Jan 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Salt master not deploy correctly because another salt master is already running. #2840

Salt master not deploy correctly because another salt master is already running. #2840

MonPote commented Oct 9, 2020

slaperche-scality commented Dec 2, 2020

thomasdanan commented Dec 22, 2020

Salt master not deploy correctly because another salt master is already running. #2840

Salt master not deploy correctly because another salt master is already running. #2840

Comments

MonPote commented Oct 9, 2020

slaperche-scality commented Dec 2, 2020

thomasdanan commented Dec 22, 2020